Improving Radiology Summarization with Radiograph and Anatomy Prompts

The impression is crucial for the referring physicians to grasp key information since it is concluded from the findings and reasoning of radiologists. To alleviate the workload of radiologists and reduce repetitive human labor in impression writing, many researchers have focused on automatic impression generation. However, recent works on this task mainly summarize the corresponding findings and pay less attention to the radiology images. In clinical, radiographs can provide more detailed valuable observations to enhance radiologists' impression writing, especially for complicated cases. Besides, each sentence in findings usually focuses on single anatomy, so they only need to be matched to corresponding anatomical regions instead of the whole image, which is beneficial for textual and visual features alignment. Therefore, we propose a novel anatomy-enhanced multimodal model to promote impression generation. In detail, we first construct a set of rules to extract anatomies and put these prompts into each sentence to highlight anatomy characteristics. Then, two separate encoders are applied to extract features from the radiograph and findings. Afterward, we utilize a contrastive learning module to align these two representations at the overall level and use a co-attention to fuse them at the sentence level with the help of anatomy-enhanced sentence representation. Finally, the decoder takes the fused information as the input to generate impressions. The experimental results on two benchmark datasets confirm the effectiveness of the proposed method, which achieves state-of-the-art results.


Introduction
A radiology report of an examination is used to describe normal and abnormal conditions with one medical image and two important text sections: findings and impression.The findings section is a free-text description of a clinical radiograph (e.g., † Corresponding author.chest X-ray), providing the medical image's detailed observations.Meanwhile, the impression is a more concise statement about critical observations summarized from the findings, images and the inference from radiologists and provides some clinical suggestions, such that in practice, clinicians prefer to read the impression to locate the prominent observations and evaluate their differential diagnoses.However, writing impressions is time-consuming and in high demand, which draws many researchers to focus on automatic impression generation (AIG) to alleviate the workload of radiologists (Gharebagh et al., 2020;Hu et al., 2021;Zhang et al., 2018Zhang et al., , 2020c;;Hu et al., 2022a;MacAvaney et al., 2019).
For example, (Gharebagh et al., 2020;Hu et al., 2021;Karn et al., 2022) propose to extract medical ontologies and entities from findings and then utilize graph neural networks (GNNs), dual encoder, or reinforcement learning to integrate this knowledge into general sequence-to-sequence models for promoting AIG.Yet, most existing studies mainly focus on fully using findings to produce impressions and pay rare attention to medical radiography.Owing to the fact that some diseases tend to have similar observations, they are difficult to get a clear diagnosis only depending on the textual statements.In this situation, most radiologists usually consider both the image and findings to make a more ac- curate clinical suggestion in impressions.Besides, many approaches have been proposed for radiology report generation and have achieved considerable success (Chen et al., 2021;Zhang et al., 2020a), whose goal is to generate the findings based on a given medical image, further showing the value of knowledge in the medical image.In radiology reports, each findings can be regarded as a text representation of the corresponding medical image, and meanwhile, each image is a visual representation of the findings such that these two modal data can be effectively aligned.
Therefore, we propose a task that integrates the images and anatomy-enhanced findings for impression generation.According to communication with radiologists, each sentence in the findings focuses on single anatomy, so the sentence-level representation should be easier to align to a certain anatomical region of the image.To enhance such a process, we first construct some rules under the guidance of radiologists and utilize these rules to extract the main anatomies from each sentence.Then we put these anatomies at the beginning of the sentence to emphasize anatomy information.Next, we use a visual extractor to extract visual features from the radiology image and apply a Transformerbased text encoder to embed the corresponding findings.Afterward, an extra encoder is used to further model visual features, whose output will be aligned to the textual representation at the document level by a contrastive learning module.Finally, we employ a co-attention to integrate the visual and text features at the sentence level to obtain the final fused representation, which is then input to the decoder to generate the impressions.Experimental results on two benchmark datasets, MIMIC-CXR and OpenI, demonstrate the effectiveness of our proposed model, which achieves better performance than most existing studies.Furthermore, analysis of impression length shows that our proposed multimodal model is better at long impression generation, where our model obtains significant improvements when the impression is longer than 20.

Method
We follow existing studies on report generation (Chen et al., 2020;Zhou et al., 2021) and impression generation (Zhang et al., 2018;Gharebagh et al., 2020;Hu et al., 2021) and utilize the standard sequence-to-sequence paradigm for this task.In doing so, we regard patch features extracted from radiology image X I as one of the source inputs.In addition, the other input is the findings sequence where M is the number of sentence and token.The goal is to utilize X I and X F to find a target impression Y = [y 1 , ...y i , ..., y L ] that summarizes the most critical observations, where L is the number of tokens and y i ∈ V is the generated token and V is the vocabulary of all possible tokens.The impression generation process can be defined as: (1) For this purpose, we train the proposed model to maximize the negative conditional log-likelihood of Y given the X I and X F : (2) where θ can be regarded as trainable parameters of the model.The overall architecture of the model is shown in Figure 2.

Visual Extractor
We employ a pre-trained convolutional neural networks (CNN) (e.g., ResNet (He et al., 2016)) to extract features from X I .We follow Chen et al. (2020) to decompose the image into multiple regions with equal size and then expand these patch features into a sequence: where f ve refers to the visual extractor and im i is the patch feature.

Sentence Anatomy Prompts
It is known that each sentence in findings usually focuses on describing observations in single anatomies, such as lung, heart, etc., instead of stating multiple anatomy observations in one sentence.This might be because many radiologists usually draw on radiology report templates when writing findings, and most templates follow this characteristic, which describes medical observations anatomy by anatomy.For example, radiology report templates in the radreport website 1 mainly divide the 1 https://radreport.org/radiology findings into six sections: Lungs, Pleural Spaces, Heart, Mediastinum, Osseous Structures, and Additional Findings, respectively.Motivated by this, we manually construct a rule lexicon under the guidance of radiologists to extract anatomy information from the sentence, with the details shown in Table 1.After that, we use the following ways to deal with different types of sentences: • Type I: For the sentence that only describes observation in single anatomy, we assign the sentence to the corresponding anatomy type.For example, the sentence "The lungs are hyperexpanded and mild interstitial opacities" only contains one anatomy (i.e., lungs), and thus, we assign type lungs to this sentence.• Type II: Although most sentences focus on single anatomy, there are still some with multiple anatomies.For these sentences, we follow the priority ranking from normal observations to comparisons, as shown in Table 1.For instance, although both lung and pleural spaces are in the sentence "lungs are grossly clear, and there are no pleural effusions", we distribute this sentence into type normal observations.• Type III: For the remaining sentences, we use a particular type other observations to mark.Next, we plan anatomy type into the corresponding sentence and modify the original sentence as "anatomy: sentence".For instance, the type lungs is inserted into "The lungs are hyperexpanded and mild interstitial opacities" as "lungs: The lungs are hyperexpanded and mild interstitial opacities".In this way, the original findings X F is updated as an anatomy-enhanced one X ′ F .

Text Encoder
Pre-trained language models have achieved great success in many NLP tasks (Hu et al., 2022b,c;Zhong and Chen, 2021;Xu et al., 2021b;Fang et al., 2023a,b;Hu et al., 2023).Therefore, we employ a pre-trained model BioBERT (Lee et al., 2020) as our text encoder to extract features from the findings: where f te (•) refers to the text encoder, and h i is a high dimensional vector for representing tokens x i .
We regard the representation of [CLS] i in s i (i.e., h CLS i ) as the ith sentence representation.

Document-level Cross-Modal Alignment
In radiology reports, findings and radiology images usually describe the same medical observations by using different media (i.e., vision and text, respectively).To pull the image representation close to the output of the text encoder, we first utilize an extra Transformer encoder to further model the visual features X I , computed by: Herein the outputs are the hidden states c i encoded from the input visual features in subsection 2.1 and f ie refers to the Transformer image encoder.Afterward, we use the mean pooling to obtain the overall representation with respect to the findings and the corresponding image, formalized as: Owing to the characteristic of the radiology report, z I and z F should be close to each other if the image and findings are from the same examination.
On the contrary, radiology images and reports from different tests tend to have distinct medical observations and further should be different from each other.Therefore, we introduce a contrastive learning module to map positive samples closer and push apart negative ones, where the positive indicates that z I and z F are from the same pair (i.e., the same examination) and the negative refers to the samples from different pairs.For example, we assume there are two tests, (f indings 1 , images 1 ) and (f indings 2 , image 2 ), and thus, in this case, for f indings 1 , the image + 1 is a positive sample while the image − 2 is a negative instance.We follow Gao et al. (2021) to compute the cosine similarity between the original representation and its positive and negative examples.Then, for a batch of 2Q examples z ∈ {z I } ∪ {z F }, we compute the contrastive loss for each z m as: where sim(•, •) is the cosine similarity, and τ is a temperature hyperparameter.The total contrastive loss is the mean loss of all examples:

Sentence-Level Co-Attention Fusion
As mentioned in subsection 2.2, each sentence in the findings usually focuses on single anatomy, meaning that sentence-level textual information can be mapped to corresponding anatomy regions in images.Therefore, we propose to utilize the anatomy-enhanced sentence representation to align with the image.In detail, as introduced in 2.3, we extract anatomy-enhanced sentence representations from the text encoder which are then used to perform co-attention to fuse two modal knowledge.We first treat h CLS as query and the corresponding image representations c as key and value matrix and compute the attention weight with the softmax function: where a b i can be viewed as a probability distribution over the image features, which is then used to compute a weighted sum: Afterward, on the contrary, c is regarded as the key and value matrix, and h CLS is represented as the query.We then adopt a similar method to obtain another fusion representation: (11) After that, we obtain the updated image and sentence representation by adding the fusion vectors to the original ones:

Decoder
The backbone decoder in our model is the one from the standard Transformer, where e = [c, h CLS , h] is functionalized as the input of the decoder so as to improve the decoding process.Then, the decoding process at time step t can be formulated as the function of a combination of previous output (i.e., y 1 , • • • , y t−1 ) and the feature input (i.e., e):  where f de (•) refers to the Transformer-based decoder, and this process will generate a complete impression.We define the final loss function as the linear combination of impression generation loss and contrastive objectives: where λ is the tuned hyper-parameter controlling the weight of the contrastive loss.
3 Experimental Setting • OPENI: it is a public dataset containing 7,470 chest X-ray images and 3,955 corresponding reports collected by Indiana University.• MIMIC-CXR: it is a large-scale radiography dataset with 473,057 chest X-ray images and 206,563 report.We follow Hu et al. (2021) to remove the following cases: (a) incomplete reports without findings or impressions; (b) reports whose findings have fewer than ten words or impression has fewer than two words.Besides, since some reports have multiple radiology images from different views, such as posteroanterior, anteroposterior and lateral, we only select one image from posteroanterior or anteroposterior.As for partition, we follow Chen et al. (2020) to split OpenI and MIMIC-CXR, where the former is split as 70%/10%/20% for train/validation/test, and the latter follows its official split.

Baseline and Evaluation Metrics
To illustrate the validity of our proposed model, we use the following models as our main baselines: • BASE-FINDINGS and BASE-IMAGE: They are unimodal models, where the former utilizes a pre-trained text encoder and a randomly initialized Transformer-based decoder, and the latter replaces the text encoder with image encoders.• BASE: This is the base backbone multimodal summarization model with pre-trained image and text encoders and a Transformer-based decoder, which utilizes both findings and images to generate impressions.• BASE+DCA and BASE+AP: They are the multimodal summarization models.The former utilizes document-level representations to align findings and images, and the latter utilizes the rules to enhance anatomy prompts for each sentence.We follow Zhang et al. (2020c) to utilize summarization and factual consistency (FC) metrics to examine the model performance.Specially, we use ROUGE (Lin, 2004) and report F 1 scores of ROUGE-1 (R-1), ROUGE-2 (R-2), and ROUGE-L (R-L) for summarization metrics2 .Meanwhile, a pre-trained CheXbert (Smit et al., 2020) 3 is used to recognize 14 types of observation from reference and generated impression, respectively, whose detected results are used to calculate the precision,  recall, and F1 score for measuring FC.

Implementation Details
In our experiments, we select biobert-base-cased-

Overall Results
To explore the effect of integrating image and text to generate impressions, we compare our model to corresponding single modal summarization baselines in Moreover, we conduct experiments on the different models, and the results are reported in Table 2 where BASE+AP+DCA indicates our full model.There are several observations drawn from different aspects.First, the comparisons between BASE+DCA, BASE+AP, and BASE illustrate the effectiveness of each component in our proposed model (i.e., contrastive learning and lexicon matching).Second, our full model (i.e., BASE+AP+DCA) achieves the best results among these baselines, which confirms the validity of our design that combines contrastive learning and anatomy information planning.Contrastive learning can map the image closer to the corresponding findings if they are in the same pair and push them apart if they are not, which can effectively align these two modalities at the document level.For another, highlighting anatomy characteristics can potentially help the model align the sentence feature to the corresponding organ or body part position in the images, further improving feature fusion between different modalities.Third, in terms of FC metrics on the MIMIC-CXR dataset, our proposed model outperforms all baselines and achieves higher F1 scores, indicating that our model is able to generate more accurate impressions.This is because our model can enhance feature matching between findings and images to facilitate critical information extraction, contributing to better impression generation with the help of such information.

Comparison with Previous Studies
We further compare our model with existing methods, with the results reported in Table 3.We can observe that our model outperforms other methods, although those studies utilize complicated structures to enhance the generation, e.g., WGSUM utilizes a complicated graph structure, and R2GEN uses a recurrent relational memory.In addition, it is surprising that CLIPABS achieve worse performance than text-based models (i.e., TRANSABS, WGSUM and AIG_CL).This might be because CLIP pays more attention to the images and is less powerful in encoding text, while textual features are more important in this task.

Human Evaluation
We also conduct a human evaluation to evaluate the quality of the generated impressions with respect to three metrics: Readability, Accuracy, and Completeness (Gharebagh et al., 2020).In detail, we randomly select 100 chest X-ray images and their findings and impressions from the test set of MIMIC-CXR, as well as impressions generated from different models.Afterward, three experts who are familiar with radiology reports are invited to evaluate the generated impression with the results shown in Table 4.We can observe that our model is better than BASE, where more impressions from our model have higher quality than those from BASE, further confirming the effectiveness of our model.Meanwhile, when comparing our model against references, we find that although some cases are worse than ground truth (9%, 18%, and 10%), most of the impressions from our model are at least as good as the reference impressions.

Impression Length
To test the effect of the length of impressions in AIG, we categorize the generated impressions on the MIMIC-CXR test set into several groups according to the length of reference impression, with the R-1 scores shown in Figure 3.Note that the average impression length for MIMIC-CXR is 17.We can observe that these models tend to have worse performance with increasing impression length, especially in the last group, where all obtain the worst R-1 scores.Our proposed model achieves more promising results in most groups, except the first group where the BASE-FINDINGS achieves the best results, which illustrates that our model is better at generating longer impressions.The main reason is that short impressions are usually normal observations without complicated abnormalities so that findings are enough to describe such information, and images may lead to some redundant noise due to their being too detailed.In contrast, for the long impression, detailed information can complement textual features to help the model accurately grasp complex observations.

Case Study
To further qualitatively investigate the effectiveness of our proposed model, we conduct a case study on the generated impressions from different models whose inputs are X-ray images and corresponding findings.The results are shown in Figure 4, and different colors represent the observations found in different locations.It is observed that OURS is able to produce better impressions than the BASE model, where impressions from our models can almost cover all the key points in these two examples with the help of the corresponding regions in images.On the contrary, the BASE model ignores some critical observations written in reference impressions, such as "right basilar loculated hydropneumothorax." in the first example and "Stable mild cardiomegaly" in the second example, and even generates some unrelated information (e.g., "No pneumonia" in the second case).
6 Related Work

Multimodal Summarization
With the increase of multimedia data, multimodal summarization has recently become a hot topic, and many works have focused on this area, whose goal is to generate a summary from multimodal data, such as textual and visual (Zhu et al., 2018;Li et al., 2018;Zhu et al., 2020;Li et al., 2020;Im et al., 2021;Atri et al., 2021;Delbrouck et al., 2021).For example, Li et al. (2017) proposed to generate a textual summary from a set of asynchronous documents, images, audios and videos by a budgeted maximization of submodular functions.

Radiology report generation
Image captioning is a traditional task and has received extensive research interest (You et al., 2016;Aneja et al., 2018;Xu et al., 2021a).Radiology report generation can be treated as an extension of image captioning tasks to the medical domain, aiming to describe radiology images in the text (i.e., findings), and has achieved considerable improvements in recent years (Chen et al., 2020;Zhang et al., 2020a;Liu et al., 2019bLiu et al., , 2021b;;Zhou et al., 2021;Boag et al., 2020;Pahwa et al., 2021;Jing et al., 2019;Zhang et al., 2020b;You et al., 2021;Liu et al., 2019a).Liu et al. (2021a) employed competence-based curriculum learning to improve report generation, which started from simple reports and then attempted to consume harder reports.

Radiology impression generation
Summarization is a fundamental text generation task in natural language processing (NLP), drawing sustained attention over the past decades (See et al., 2017;Liu and Lapata, 2019;Duan et al., 2019;Chen and Bansal, 2018;Lebanoff et al., 2019).General Impression generation can be regarded as a special type of summarization task in the medical domain, aiming to summarize findings and generate impressions.There are many methods proposed for this area (Gharebagh et al., 2020;Hu et al., 2021;Zhang et al., 2018;Hu et al., 2022a;Karn et al., 2022;MacAvaney et al., 2019;Zhang et al., 2020c;Delbrouck et al., 2022).MacAvaney et al. (2019);Gharebagh et al. (2020) proposed to extract medical ontologies and then utilize a separate encoder to extract features from such critical words for improving the decoding process and thus promoting AIG.Hu et al. (2021) further constructed a word graph by medical entities and dependence tree and then utilized the GNN to extract features from such graph for guiding the generation process.However, recent works in this area mainly focus on the text section while failing to fully explore the valuable information in corresponding radiology images.

Conclusion
This paper proposes an anatomy-enhanced multimodal summarization framework to integrate radiology images and text for facilitating impression generation.In detail, for radiology images, we use a visual extractor to extract detailed visual features.For radiology findings, we first plan anatomical prompts into each sentence by keywords and rules and then apply a pre-trained encoder to distillate features from modified findings.Afterward, we employ a contrastive learning module to align the visual and textual features at the document level and use a co-attention to fuse these two features at the sentence level, which are then input to the decoder to improve impression generation.Furthermore, experimental results on two benchmark datasets illustrate the effectiveness of our model, especially for long impression generation, where our model achieves significant improvements.

Limitations
Although our model has achieved considerable improvements, as shown in Figure 3, our model tends to have a slight decrease in short impression generation, which need to be further solved in the future.In this paper, we follow previous studies and only utilize English radiology report datasets to verify the effectiveness of our proposed model, which is limited in verification in other languages.The main reason is that most publicly available radiology report datasets center on English.In addition, our model needs relatively more parameters than the models only using findings to generate impressions.

Figure 1 :
Figure 1: An example of the radiology report and its chest X-ray image, where different color means that different sentences are aligned to the image.

Figure 2 :
Figure 2: The overall architecture of our proposed model.The green box is used to provide sentence anatomy prompts.Besides, aligned contrastive learning and sentence-level co-attention fusion modules are shown in the purple and red boxes. 1 ⃝, 2 ⃝, 3 ⃝ indicate different pairs (i.e., image and its corresponding findings).
Figure3: R-1 score of generated impressions from different models, where OURS represents the BASE+AP+DCA.Note that when the word-based impression length is longer than 20, the p − value is less than 0.05.

Figure 4 :
Figure 4: Examples of the generated impressions from BASE and BASE+AP+DCA as well as reference impressions.Lungs, tubes and hearts are located in the red, blue and green boxes.

Table 1 :
The details of the lexicon, where the left is the anatomy type and the right is the keywords and rules used to match the sentence.

Table 2 :
The performance of all baselines and our model on test sets of OPENI and MIMIC-CXR datasets.R-1, R-2 and R-L refer to ROUGE-1, ROUGE-2 and ROUGE-L.P, R and F-1 represent precision, recall, and F1 score.

Table 3 :
Comparisons of our proposed models with the previous studies on the test sets of OPENI and MIMIC-CXR with respect to the ROUGE metric.CHESTXRAYBERT is regarded as a weak reference since their data processing method was not public.

Table 2
4 https://github.com/dmis-lab/biobertoutperforms BASE-IMAGE, illustrating that textual features are more valuable than visual ones because the gap between two related texts is smaller than that between vision and text.

Table 4 :
Results of the human evaluation.The top three give results for comparison between BASE+AP+DCA and BASE.The bottom three are results for BASE+AP+DCA versus the reference impressions.