Toward Expanding the Scope of Radiology Report Summarization to Multiple Anatomies and Modalities

Radiology report summarization (RRS) is a growing area of research. Given the Findings section of a radiology report, the goal is to generate a summary (called an Impression section) that highlights the key observations and conclusions of the radiology study. However, RRS currently faces essential limitations.First, many prior studies conduct experiments on private datasets, preventing reproduction of results and fair comparisons across different systems and solutions. Second, most prior approaches are evaluated solely on chest X-rays. To address these limitations, we propose a dataset (MIMIC-RRS) involving three new modalities and seven new anatomies based on the MIMIC-III and MIMIC-CXR datasets. We then conduct extensive experiments to evaluate the performance of models both within and across modality-anatomy pairs in MIMIC-RRS. In addition, we evaluate their clinical efficacy via RadGraph, a factual correctness metric.


Introduction
The radiology report documents and communicates crucial findings in a radiology study. A standard radiology report usually consists of a Background section that describes the exam and patient information, a Findings section, and an Impression section (Kahn Jr et al., 2009). In a typical workflow, a radiolo-gist first dictates the detailed findings into the report and then summarizes the salient findings into the more concise Impression section based also on the condition of the patient. Automating this summarization task is critical because the Impression section is the most important part of a radiology report, and manual summarization can be time-consuming and error-prone.
Despite its importance, we identify three weaknesses in the ongoing work on radiology report summarization. First, the most recent studies (Zhang et al., 2018(Zhang et al., , 2020Hu et al., 2022) and organized challenges (Ben Abacha et al., 2021) on automated radiology report summarization systems solely focus on chest X-rays. The reason is that the only two open-access and curated datasets, namely MIMIC-CXR (Johnson et al., 2019) and Open-i Chest X-ray (Demner-Fushman et al., 2012), exclusively contain chest-X ray radiology reports. In some rarer cases, researchers omit to disclose the modality and anatomy of the radiology reports used for their experiments (Karn et al., 2022). Secondly, existing models are optimized to generate summaries that achieve high performance on the ROUGE metric (Lin, 2004). As investigated in previous studies, this does not guarantee factually correct summaries (Zhang et al., 2020). So far, only one "factuallyoriented" metric has been proposed, based on CheXbert (Smit et al., 2020). While this addition is a good step towards evaluating factual correctness of the summaries, it is limited to chest X-rays. Finally, new proposed models (Karn et al., 2022;Hu et al., 2022) present an increased complexity in architecture that offers only marginal improvements on the existing evaluation metrics for summarization. This, in turn, makes the replication of studies more difficult.
To address these three limitations, we consequently present three contributions: • We release a pre-processed and curated dataset of radiology reports for new modalities (MR and CT) and anatomies (Chest, Head, Neck, Sinus, Spine, Abdomen, Pelvis). Our dataset is based on the MIMIC-III database (Johnson et al., 2016) and suitable for radiology report summarization: each report contains a clear findings and impression section.
• We present a new summarization metric, called the RadGraph score, that evaluates the factual completeness and correctness of the generated radiology impressions. We show that this score is suitable for every modality and anatomy of our new dataset. We also show that the RadGraph score can be turned, without further modification, into a reward suitable for Reinforcement Learning (RL) optimization.
• We present a new simple summarization system that not only acts a strong baseline on the new datasets but also outperforms the previous replicable research on chest X-rays. Because our new system is small (few trainable parameters) and fast (low FLOPs), it makes it suitable for RL which is, by essence, computationally expensive.
Our paper is structured as follows: we first start by describing our three contribution, namely our new MIMIC-III summarization dataset (Section 2), the RadGraph score (Section 3) and our baseline model (Section 4).
We then proceed to outline the experiments carried out (Section 5) and present our results (Section 6). For each report, we extract the findings and impression section. However, the findings section is not always stated as such. With the help of one board-certified radiologist, and for each modality/anatomy pair, we create a mapping of the section header that acts as "findings". As an example, for CT head, findings could be referred as "non-contrast head ct", "ct head", "ct head without contrast", "ct head without iv contrast", "head ct", "head ct without iv contrast" or "cta head". This "findings" mapping contains up to 537 candidate sections for our whole dataset. We also discarded reports where multiple studies are pooled in the same radiology reports, leading to multiple intricate observations in the impression section. We release our mapping as well as the code to recreate the dataset from scratch (Appendix B). In addition, a few comments can be made from Figure 2. As expected, for all anatomymodality pairs, the findings section is significantly longer than the impression section (up to +315% for MR abdomen). The findings section of all our new pairs are also much longer than chest X-rays: the MIMIC-CXR dataset averages 49 words per findings. In contrast, MR Abdomen and MR Pelvis average 205 and 174 words respectively. We can see note that CT Chest, CT Head and CT Abdomen-Pelvis have a relatively large vocabulary size (given their sample size) with respectively 20,909, 19,813 and 18,933. Surprisingly, the CT Abdomen-Pelvis impressions contains more vocabulary than the findings, as opposed to MR pelvis and MR abdomen impressions that contain respectively 36% and 37% less words than their findings counterpart.

RadGraph score
The goal of our RadGraph score is to provide an evaluation of the factual correctness and completeness of a generated impression. It does so by evaluating two semantic components of the impression: the correctness of the named entities and the relation between two entities.
We divide this section in two subsections: we first present RadGraph in section Section 3.1 and then describe how we leverage RadGraph to compute the proposed RadGraph score in section Section 3.2. To design our new evaluation metric, we leverage the RadGraph dataset (Jain et al., 2021) containing board-certified radiologist annotations of chest X-ray reports, which correspond to 14,579 entities and 10,889 relations. RadGraph has released a PubMed-BERT model (Gu et al., 2021) pretrained on these annotations to annotate new reports. An example of annotation can be seen in Figure 3. Before moving on to the next section, we quickly describe the concept of entities and relations:

RadGraph
Entities An entity is defined as a continuous span of text that can include one or more adjacent words. Entities in RadGraph center around two concepts: Anatomy and Observation. Three uncertainty levels exist for Observation, leading to four different entities: Anatomy (ANAT-DP ), Observation: Definitely Present (OBS-DP ), Observation: Uncertain (OBS-U ), and Observation: Definitely Absent (OBS-DA).

Score
Using RadGraph annotation scheme and pretrained model, we design a F-score style reward that measures the factual consistency and completeness of the generated impression (also called hypothesis impression) compared to the reference impression.
To do so, we treat the RadGraph annotations of an impression as a graph G(V, E) with the set of nodes V = {v 1 , v 2 , . . . , v |V | } containing the entities and the set of edges E = {e 1 , e 2 , . . . , e |E| } the relations between pairs of entities. The graph is directed, meaning that the edge e = (v 1 , v 2 ) = (v 2 , v 1 ). An example is depicted in Figure 4. Each node or edge of the graph also has a label, which we denote as v i L for an entity i (for example "OBS-DP" or "ANAT") and e ij L for a relation e = (v i , v j ) (such as "modified" or "located at").
To design our RadGraph score, we focus on the nodes V and whether or not a node has a relation in E. For a hypothesis impression y, we create a new set of triplets In other words, a triplet contains an entity, the entity label and whether or not this entity has a relation. We proceed to construct the same set for the reference reportŷ and denote this set Tŷ.
Finally, our score is defined as the harmonic mean of precision and recall between the hypothesis set T y and the reference set Tŷ, giving a value between 0 and 100. As an illustration, we provide in Appendix C the set V , E and T of the graph G in Figure 4.

Generalization to other modalities and anatomies
As mentioned in Section 3.1, the RadGraph model is trained on chest X-rays entities and relations. Therefore, it is not obvious if Rad-Graph can be ported to other modalities and anatomies. As a partial validation of the use of RadGraph for our experiments, we ask one board-certified radiologist to subjectively evaluate the entities of the RadGraph model on two randomly selected reports for each anatomy-modality pairs. Three examples of those reports are shown in Table 2. In the selected reports, no entities were omitted by the RadGraph model.

Model
In this section, we detail our third contribution. We first describe in Section 4.1 our simple baseline by detailing the architecture of our findings-to-impression model and then explain in Section 4.2 how this model can be trained using RL to directly optimize our RadGraph score.

Architecture
To encode the findings, we use a BERT encoder (Vaswani et al., 2017;Devlin et al., 2019) and use its final representation as textual features H. To generate impressions, we use a BERT decoder with cross-attention over the textual features. More formally, the crossattention of the decoder transformer layer is written: where Q is the decoder hidden state of size d and K and V are the textual features H.
We denote the number of layers in the BERT encoder and decoder as L.
The encoder can be pre-trained (i.e. using the pre-trained BioMed-RoBERTa (Gururangan et al., 2020) as encoder and decoder, for example) in which case the number of layers L is defined by the pre-trained model. We can also train both the encoder and decoder from scratch with a custom number of layers L. More details are available in our Experiment section.

Training
If we denote θ as the model parameters, then θ is learned by maximizing the likelihood of the hypothesis impression Y = (y 1 , y 2 , · · · , y n ) or in other words by minimizing the negative log likelihood (NLL). The objective function is given by: After the NLL training, we start a RL training by optimizing our RadGraph score. The loss function in Equation (2) is now given by: where r(Y ) is the reward given of the generated report. We use the SCST algorithm (Rennie et al., 2017) to approximate the expected gradient of our non-differentiable reward function. The expression becomes: Here r(Ȳ ) acts a baseline (Sutton et al., 1998) to reduce the variance of r(Y ). In our case, r(Ȳ ) is the expected reward by sampling from the model during training.

Experiments
In this section, we detail the experiments carried out to evaluate our three contributions.

Model
For our BERT encoder-decoder of Section 4.1, we use two pre-trained model: BioMed-RoBERTa (Gururangan et al., 2020) and PubMedBERT (Gu et al., 2021), and also train from scratch using L = 2, L = 4 and L = 8. When training from scratch, the rest of the model parameters can be found in Appendix D.
The details of NLL and RL training can be found in Appendix E.

Metrics
We proceed to evaluate our systems using the ROUGE1, ROUGE2 and ROUGEL metrics (Lin, 2004) to be consistent with prior work. We also report the RadGraph score (Section 3) for both the MIMIC-CXR and MIMIC-III experiments.
For MIMIC-CXR, we also use F 1 CheXbert (Zhang et al., 2020) score alongside our RadGraph score to evaluate the factual correctness of the generated impressions. This metric uses CheXbert (Smit et al., 2020), a Transformer-based model trained to output abnormalities of chest X-rays given a radiology report (or an impression) as input. F 1 CheXbert is the f1-score between the prediction of CheXbert over the hypothesis impressionŷ and the corresponding reference impression y. The f1-score is calculated over the 14 abnormalities to be consistent with Hu et al. (2022). Because the abnormalities are specific to chest X-rays, we only use it for MIMIC-CXR.

Results
This section is divided in two parts. First, we show the results of our preliminary experiments on MIMIC-CXR to validate our  Table 3: Summary of our results discussed throughout Section 6. Each reported score is the average of 5 independent runs. NLL refers to Equation (2), RL refers to Equation (4).
baselines and training setup (Figure 6.1. We then discuss the results on our new MIMIC-III dataset.

Preliminary experiments on MIMIC-CXR
The results of our preliminary experiments on MIMIC-CXR can be found in Table 3.
(NLL) When training with NLL, our best performing model is L = 4, meaning our BERT encoder-decoder trained from scratch with four hidden layers. For the factualcorrectness metrics, this model achieves 71.25 points of F 1 CheXbert and 45.95 points of RadGraph score. This difference is respectively of +0.52 and +1.15 points compared to Hu et al. (2022) and +1.13 and +1.01 compared to our biggest model L = 8. The model L = 4 also reports the highest ROUGE-L: 46.82. These metrics can be further improved when ensembling models: Surprisingly, using BioMedRoBERTa and PubMedBERT as pre-training didn't help improve our results. On the contrary, they report the lowest score of all our model variants. Nevertheless, this aligns with the low results from Liu and Lapata (2019) where authors also used pre-trained BERT encoder and decoder for their summarization. In addition, in the task of Radiology Report Generation (X-ray to impression), Miura et al. (2021) reported the strongest performances using a one-layered BERT decoder.
(RL) Using RL to directly optimize the RadGraph score further improved the metrics accross the board. Compared to L4 (NLL), L4 (RL) reports an improvement of +2.28 (+4.96%) RadGraph score, +3.61 (+5.1%) F 1 CheXbert but only +0.28 (+0.5%) ROUGEL. Figure 5 shows that our L4 (RL) model is also encouraged to generate more en-tities and relations than its NLL counterpart. However, L4 (RL) generated more "ANAT-DP" entities and "located at" relations than found in the reference. Conversely, it didn't generate any "OSB-U" and "suggestive-of". One hypothesis is that, because "OSB-U" and "suggestive-of" are under-represented in the dataset, the RL model couldn't learn to use it correctly, leading to a negative reward, making in turn the model less and less likely to use it. Given these encouraging results, we decide to carry out the remaining of our experiments using the L4 model.

Experiments on MIMIC-III
The results of our experiments on MIMIC-III can be found in Table 3 (NLL) and  as shown in Figure 2, the findings and impressions in the MIMIC-III datasets are substantially longer than MIMIC-CXR. Hence, the summarization task is more difficult. As an example, we highlight in Table 4 generated impression of by two of our models. Such examples show that our model L4 (NLL) has learned, to some extent, to summarize effectively. A more in-depth study would be required to evaluate the output on the whole test-set, for different modalities and anatomies.
(RL) As shown in Table 5, directly optimizing the RadGraph score using Reinforcement Learning improves the ROUGE and Rad-Graph score for all modalities and anatomies. It shows convincingly that RadGraph can be used, either as a score to evaluate the radiology reports or as reward to optimize factual correctness, for various types of reports.

Conclusion
In this paper, we present three original contributions to address the current weakness of the task of Radiology Report Summarization. First, we release a pre-processed and curated dataset of radiology reports for new modalities (MR and CT) and anatomies (Chest,  Table 5: Performance of our L4 (RL) on the five largest sets of our MIMIC-III summarization dataset with improvements in % compared to their NLL counterpart from Table 3 Head, Neck, Sinus, Spine, Abdomen, Pelvis) based MIMIC-III database (Section 2). This allows future research to extend the scope of this task beyond chest x-rays. Then, we presented a new summarization metric, called the RadGraph score (Section 3), that evaluates the factual completeness and correctness of the generated radiology impressions. We showed theoretically (Section 3.3) and empircally (Section 6.2) that this new evaluation metric was suitable for every modality and anatomy of our new dataset. Finally, we presented a new simple summarization system (Section 4.1) that not only acted a strong baseline on the new datasets but also outperforms the previous replicable research on chest X-rays (Section 6). Our model, the dataset, and the score can be found and replicated in Appendix B.

Future work
RadGraph has shown to be an efficient tool and could be used in different ways than presented in this paper.

Appendix A. Related work
We restrict this section to Radiology Report Summarization.
The first attempt at automatic summarization of radiology finding into natural language impression statements is proposed by Zhang et al. (2018). Their contribution is to propose a first baseline on the task, using a bidirectional-LSTM as encoder and decoder. Importantly, they found that about 30% of the radiology summaries generated from neural models contain factual errors. Subsequently, Zhang et al. (2020) proposed the F 1 CheXbert score to evaluate the factual correctness of the generated impression. They also use RL to directly optimize that metric. Finally, both Hu et al. (2021) and Hu et al. (2022) used the Biomedical and Clinical English Model Packages in the Stanza Python NLP Library (Zhang et al., 2021) to extract medical entites. The former study used the entities to construct a Graph Neural Network, used as input in their summarization pipeline, while the latter study used the entites to mask the findings in a contrastive pre-training.
We believe this paper is an original contribution to the aforementioned work. As instigated by Zhang et al. (2018), our goal is to release new summarization corpus and baselines on new modalities and anatomies. We do so by releasing 11 new anatomy-modality pairs. Similarly to Zhang et al. (2020), we continue the effort in proposing a new metric that evaluate the factual correctness and completeness of the generated impression, namely the RadGraph score. Finally, we improve the work of Hu et al. (2021) and Hu et al. (2022) in two ways: 1) we use semantic annotations from a pre-trained model supervised on board-certified radiologists annotations, as opposed to Stanza that levarages unsupervised biomedical and clinical text data 2) we leverage relation annotations between entities, a feature that was not available to prior work.

Appendix B. Code and data release
To help with further research, we also make our code publicly available. More specifically, we release the code of the RadGraph score as well as the training of our baseline. We also release the script to download, pre-process and split the radiology reports of the MIMIC-III database as per our experiments.
To download the MIMIC-III database, researchers are required to formally request access via a process documented on the MIMIC website. There are two key steps that must be completed before access is granted: 1) the researcher must complete a recognized course in protecting human research participants that includes Health Insurance Portability and Accountability Act (HIPAA) requirements.
2) the researcher must sign a data use agreement, which outlines appropriate data usage and security standards, and forbids efforts to identify individual patients.
Our research has been carried out using the ViLMedic library (Delbrouck et al., 2022). Our code is available at https:// github.com/jbdel/vilmedic. This link is anonymized and complies with the double blind review process.