Boosting Radiology Report Generation by Infusing Comparison Prior

Recent transformer-based models have made significant strides in generating radiology reports from chest X-ray images. However, a prominent challenge remains; these models often lack prior knowledge, resulting in the generation of synthetic reports that mistakenly reference non-existent prior exams. This discrepancy can be attributed to a knowledge gap between radiologists and the generation models. While radiologists possess patient-specific prior information, the models solely receive X-ray images at a specific time point. To tackle this issue, we propose a novel approach that leverages a rule-based labeler to extract comparison prior information from radiology reports. This extracted comparison prior is then seamlessly integrated into state-of-the-art transformer-based models, enabling them to produce more realistic and comprehensive reports. Our method is evaluated on English report datasets, such as IU X-ray and MIMIC-CXR. The results demonstrate that our approach surpasses baseline models in terms of natural language generation metrics. Notably, our model generates reports that are free from false references to non-existent prior exams, setting it apart from previous models. By addressing this limitation, our approach represents a significant step towards bridging the gap between radiologists and generation models in the domain of medical report generation.


Introduction
Analyzing radiology images and writing medical reports is an important task commonly performed during the diagnostic process. At the same time, writing a radiology report is a laborious and timeconsuming job for radiologists and requires years of training for them to identify and describe specific abnormalities on medical images. Motivated by the success of the image captioning model in deep learning, a lot of works start to suggest various models for automated radiology report generation for chest X-ray images (Yuan et al., 2019;Xue et al., 2018;Jing et al., 2017;Liu et al., 2019a). This automatic generation can alleviate the high workload of radiologists and accelerate the diagnostic process by providing a preliminary report that includes useful keywords or observations (Johnson et al., 2019;Chen et al., 2020).
Despite relative success in radiology report generation from chest X-ray images by recent approaches (Endo et al., 2021;Johnson et al., 2019;Chen et al., 2020;Miura et al., 2020;Ramirez-Alonso et al., 2022;Nooralahzadeh et al., 2021), one of the most important problems remains unsolved in these studies: the model should be given appropriate priors, which are usually provided to radiologists. For example, radiologists are informed about the existence of previous reports and X-ray images and encouraged to write reports by comparing current exams to old ones to indicate the improvement, deterioration, and progress of the patient's state. These reports often contain specific words or phrases such as "compared to the previous exam", "in the interval", "referring to the prior X-ray", and so on. In this paper, we will refer to these words or phrases as prior expressions and the prior expressions are also included in MIMIC-CXR (Johnson et al., 2019) and IU X-ray (Demner-Fushman et al., 2016) which are medical datasets generally used to train and evaluate report generation tasks. The model trained on these datasets is likely to generate reports with inappropriate or misused prior expressions, consequently resulting in relatively low-performance metrics.
In Table 1, we compare the ground truth report and generated reports in respect of prior expressions by two recent models: R2Gen (Chen et al., 2020) and M 2 Tr (Cornia et al., 2020). It can be clearly seen that inappropriate priors are included in synthetic reports. For example, the first report generated by R2Gen mentions the comparison with the previous exam, although the ground truth report does not include any prior information. The second report of M 2 Tr writes "again noted," which means Ground Truth R2Gen (Chen et al., 2020) M 2 Tr (Cornia et al., 2020) Heart size is normal. Aorta is tor-  that there exists a previous image but the ground truth report has no prior expression.
Even these state-of-the-art (SOTA) models improperly learn prior expressions, which are widely shown in the reports of IU X-ray and MIMIC-CXR. This is mainly due to the fundamental difference between how radiologists write the report and how models generate the report. Radiologists are usually informed about not only the prior information from patients such as their previous exams and medical history, but also their current X-ray images, while report generation models are only given X-ray images at one moment. With restricted prior information, it is impossible for the model to generate a comprehensive and insightful report as the experts do.
Based on this observation, we try to infuse prior information into existing models in our paper in order to reduce the knowledge gap between the generation models and radiologists. By doing so, we expect the improved model will be able to produce a more informative and practical report. Because we do not have prior information (previous X-ray images) in existing datasets (IU X-ray and MIMIC-CXR), we built it in a data-driven way referring to experienced radiologists. In particular, we focus on comparison phrases that denote whether a specific medical report is from the first or the following exam for each patient.

Related Work
Initial research (Bai and An, 2018;Liu et al., 2019b) in radiology report generation employed a simple encoder-decoder architecture where an encoder extracts key features from medical images and converts them to a latent vector, and a decoder produces the target text from the latent vector. In a typical setting, CNN (LeCun et al., 2015) is used as the encoder and LSTM (Hochreiter and Schmidhuber, 1997) is chosen as the decoder. Then, visual attention mechanisms were applied to highlight specific features of images and generate more interpretable reports (Zhang et al., 2017;Jing et al., 2017;Wang et al., 2018;Yin et al., 2019;Yuan et al., 2019). Recent studies (Lovelace and Mortazavi, 2020;Chen et al., 2020;Nooralahzadeh et al., 2021;Miura et al., 2020) explored more sophisticated architectures with transformers to obtain more complete and consistent medical reports.
On the other hand, generating medical reports can be regarded as a retrieval task because similar sentences and a specific form of writing are repeated in most reports. Simply reusing the diagnostic text from visually similar X-ray images may generate a more consistent and accurate report compared to generating a whole report from scratch.  showed the superiority of the retrieval-based model as it outperformed most encoder-decoder-based models. Recently, Endo et al. (2021) developed a retrieval-based model called CXR-RePaiR that adopted a contrastive lan-guage image pre-training (CLIP) (Radford et al., 2021) and achieved state-of-the-art (SOTA) performance on their newly invented metrics, including old ones.
As seen in Table 1, even the best models so far suffer from unexpected prior expressions. Therefore, we propose a two-step approach to address this problem: (1) build a rule-based labeler to distinguish reports with and without prior expression (2) extend the original architectures (R2Gen and M 2 Tr) to include the comparison prior as input.
In the first step, we invent a rule-based labeler to detect specific comparison prior expressions and classify each report to the first exam or following exam depending on its detection. We refer to the negation and classification of the CheXpert labeler (Irvin et al., 2019) to design our novel labeler. In the next stage, we integrate our prior label as input to the original SOTA architectures to generate a more practical and complete report and compare it with the original models. In this way, our novel model is given the comparison prior, which is also usually informed to radiologists in a real diagnostic situation and is capable of generating a more comprehensive and consistent medical report.
Recently, Ramesh et al. (2022) proposed a new dataset called MIMIC-PRO, where they detected and modified all reports containing hallucinated references to non-existent prior exams. Hallucinated references can be understood as a similar concept to prior expressions in our paper. Ramesh et al. (2022) suggested the BioBERT-based model that paraphrases or removes sentences referring to previous reports or images, arguing that these expressions confuse the model and generate falsely referred sentences. With the help of experts, they built a so-called clean MIMIC-CXR test dataset and compared the models trained on MIMIC-CXR and MIMIC-PRO. Compared to their work, we choose a different direction to alleviate the problem related to comparison priors: include the prior information in the model and let the model generate a more comprehensive report in an end-to-end fashion instead of totally removing comparison prior. Writing comparisons of prior expression in radiology reports is inevitable in the real medical field, and building a clean and accurate dataset out of real reports is also laborious work to do. Thus, our work focused on how to directly apply the comparison prior to existing models such as R2Gen and M 2 Tr.   (0) or positive (1) by our rule-based labeler.

Method
In this section, we explain our novel approach in two steps that integrate comparison prior with existing models to emulate the realistic process of radiologists. First, we invent our rule-based labeler that differentiates whether the medical report includes prior expressions by detecting specific patterns in the radiology report (Section 3.1). The comparison prior is then incorporated into previous SOTA models (R2Gen and M 2 Tr) for the medical report generation task (Section 3.2).

Rule-based Labeler
Our rule-based labeler follows the basic structure of the CheXpert labeler (Irvin et al., 2019) which detects the presence of 14 observations in radiology reports based on fixed rules designed by experts. Therefore, our labeler also consists of three distinct stages: mention extraction, mention classification, and mention aggregation. The labeler receives the Finding part of radiology reports as an input and generates a binary output (0 or 1). Negative label (0) denotes the report without prior expressions, and Positive label (1) means that the report contains prior expressions.
Mention extraction Mention is defined as specific keywords that are likely to be included in prior expressions such as "previous", "prior", "preced- Figure 1: A conceptual diagram of our approach. The report generation models (R2Gen and M 2 Tr) consist of a Visual Extractor, Encoder, and Decoder. Our key idea is to infuse comparison priors generated by our rule-based labeler into (1) Visual Embedding V and (2) Latent Representation L.
ing", "previously", "again", "comparison", "interval", "increase", "decrease", "enlarge", and so on. In this stage, the mentions are extracted from each report and marked in each sentence. Even if some sentences include the designated keywords, we cannot confirm the existence of a prior expression at this step because those keywords might be used in other contexts. For example, the word "comparison" can be used as "with no comparison studies," which means the report does not include any prior expression.
Mention classification After extracting mentions at the first stage, our labeler decides whether each mention matches predefined prior expressions. As similar expressions are repeatedly used in reports to indicate comparison with previous exams, we are able to formalize the patterns of these prior expressions into several key phrases referring to experienced radiologists. For instance, "compared / similar to {mention}" ensures the existence of prior reports where "{mention}" stands for keywords such as "previous", "preceding", and "prior". "{mention} seen/identified/visualized/ ... /noted" also becomes a prior expression when "{men-tion}" denotes keywords such as "again" and "previously".
Mention aggregation At the last stage, the labeler simply combined the classified mentions and produced a negative label (0) or a positive label (1), where the negative and positive labels denote the report without and with the prior expression respectively. Some outputs of the rule-based labeler can be checked in Table 2 and the numbers of negative and positive exams in the IU X-ray and MIMIC-CXR datasets are shown in Table 3.

Extending Model
In this section, we explain how we integrate the comparison prior into existing models such as R2Gen and M 2 Tr to generate a more informative and comprehensive report.

Generation Process
The generation process of R2Gen and M 2 Tr can be simply framed as Figure 1: input radiology images X → visual embedding V → latent representation L → output report Y . First, chest X-ray images X are given as inputs to the visual extractor, where X includes the frontal image X f and the lateral image X l such that X = {X f , X l }. The output of the visual extractor is the visual embedding V = {v 1 , v 2 , ..., v S } which consists of patch features v s ∈ R d and d are the sizes of feature vectors. Next, V passes through several transformer layers in the encoder to get the latent representation L = {l 1 , l 2 , ..., l T } with the latent feature vector l t ∈ R f . Finally, the decoder generates the final output report Y from L.

Infusing Comparison Prior
The comparison prior P ∈ R is generated from our rule-based labeler and it denotes a negative (0) or positive (1) label. We intend to incorporate the comparison prior into the original data pipeline in such a way that the addition of the prior does not change the architecture or add any additional weights to train. Otherwise, it will become hard to measure the effect of comparison prior to the generative models. As a result, we added prior P to both Vi-  Table 4: Training results of the original models and models infused with prior information. The results of our approaches are shown in gray rows. All metrics are averaged over 3 runs.
sual Embedding V and Latent Representation L in the generation models shown in Figure 1. The encoder should be given the prior information so that it can generate an appropriate intermediate representation. Furthermore, we also add P on L since the knowledge of P could be weakened after deep transformer layers in the encoder. The decoder will generate the output report based on the latent representation conditioned on P . This whole process emulates the radiologists' examination with prior exams. Therefore, our new visual embedding V new and new latent representation L new can be calculated as follows: where ⊕ indicates element-wise summation. The strength of our method is that it is applicable to most existing transformer-based models and does not require an extra dataset or information.

Experiment
Architecture For visual extraction, we employ pretrained Convolutaional Neural Networks (CNN) such as DenseNet121 (Huang et al., 2017) and ResNet121 (He et al., 2016). We empirically find that DenseNet is more effective in our generation task, thereby setting it as our base visual extractor. We follow the original structure of Meshed-Memory Transformer M 2 Tr (Cornia et al., 2020) and Relational Memory-driven Transformer R2Gen (Chen et al., 2020) to construct our encoder and decoder.
Datasets We examine our proposed methods on two representative datasets which are widely used in medical report generation tasks: IU Xray (Demner-Fushman et al., 2016) and MIMIC-CXR (Johnson et al., 2019). IU X-ray is a radiology dataset that is publicly available and includes 7,470 chest X-ray images and 3,955 radiology reports. Each radiology report is paired with one frontal view image and one optional lateral view image. MIMIC-CXR is a large chest radiograph database that contains 473,057 chest X-ray images and 206,563 reports. We trained our model only with intact data pairs with two images (frontal and lateral) and one report (Findings section). The datasets are divided into train, validation, and test sets following the original split in Chen et al. (2020).
Training Details We first generate the comparison prior of each report using a rule-based labeler. Then, we train our model with the two images and the comparison prior as input and the medical report as output. The Adam optimizer is adopted with an initial learning rate of 0.00005 for the visual extractor and 0.0001 for the encoder-decoder model, and the learning rate decreases every pre-defined step. We conducted all experiments with 3 different seeds and batch size 16 on the GPU "NVIDIA GeForce RTX 1080 Ti". Our code implementation is based on the publicly available codes of Chen et al. (2020) and Nooralahzadeh et al. (2021).

Evaluation Metrics
We reported general natural language generation (NLG) metrics including BLEU (Papineni et al., 2002), CIDEr (Vedantam et al., 2015), and ROGUE-L (Lin, 2004). BLEU, CIDEr, and ROUGE-L are commonly used metrics to evaluate the quality of generated text. BLEU measures the n-gram overlap between the generated text and the reference text, while CIDEr is based on cosine similarity between word embeddings and considers both unigrams and multi-word phrases. ROUGE-L evaluates the longest common subsequence between generated text and the reference text. Including these metrics allows for a quantitative comparison of the generated reports to the ground truth and previous models, providing insight into the performance of the proposed approach.

Results
In Table 4, we present the results of our proposed approach, which infuses prior information into state-of-the-art natural language generation models, on two medical image report generation datasets: IU X-Ray and MIMIC-CXR. On both datasets, our approach outperforms the original models across all NLG metrics, demonstrating its efficacy in improving the quality of medical image reports generated by NLG models. On the IU X-Ray dataset, our approach improves the R2Gen and M 2Tr models by an average of 11.58% and 4.49% on all NLG metrics, respectively, when compared to the original models. CIDEr shows the greatest improvement, with an increase of 31.46% and 7.00% for R2Gen and M 2Tr. This indicates that, as measured by CIDEr, our approach generates more diverse and relevant captions, which correlate better with human judgments of quality than other metrics. On the MIMIC-CXR dataset, our approach improves the R2Gen and M 2Tr models by 8.40% and 9.62% on all NLG metrics, respectively, when compared to the original models. The improvement is most noticeable in RG-L, with an increase of 8.27% for R2Gen and 11.83% for M 2Tr. This suggests that our method produces more grammatically correct captions, as measured by RG-L, which is especially important in medical reports where language errors can have serious consequences. We find that the highest order n-grams (i.e., n=3,4) achieve the greatest improvements, indicating that incorporating external prior information is especially useful for generating more fluent and informative sentences, which typically contain longer phrases and more complex structures. Overall, our findings show that incorporating external prior information can improve the performance of existing NLG models for medical image reporting tasks, resulting in more informative and accurate medical reports. By instilling additional domainspecific knowledge into the models, we are able to generate more accurate and informative reports while reducing computational overhead and training data requirements.

Analysis
In this section, we compare the ground truths, synthetic reports created by our proposed model, and two previously published models, R2Gen and M 2Tr. We want to see how effective our model is at generating more concise and accurate reports with no irrelevant or false priors. Table 5 in the Appendix illustrates examples of synthetic reports generated by each of the models and the ground truth for the same radiology image. The first two rows compare reports generated by R2Gen and R2Gen with prior infusion (our model). As can be seen, R2Gen generates false prior expressions such as compared to prior examination, unchanged from prior, and again unchanged which refer to nonexistent prior exams. On the other hand, our model generates more concise and accurate reports including no prior expression, thereby achieving higher performance in NLG metrics. Similarly, the last two rows of Table 5 in the Appendix compare reports generated by M 2 Tr and our proposed model. M 2 Tr generates the report with false prior expressions such as present on the previous exam and again noted, while our model does not contain any comparison phrases. Moreover, the report including the prior expression tends to be longer than the one without it because comparison requires additional explanations. However, the report generation model is actually not given the previous exams for comparison, which indicates that the report with prior expression delivers irrelevant or wrong extra information. As a result, our models can directly manipulate these phrases by conditioning the generation via priors. Overall, the synthetic reports generated by our proposed model are more concise and accurate than those generated by R2Gen and M 2 Tr, as evidenced by the higher performance in NLG metrics. Our model achieves this by avoiding irrelevant or false prior expressions and generating reports that contain only relevant and accurate information. These succinct and precise reports generated by our model will effectively assist radiologists in practice.

Conclusion
In this work, we proposed a novel approach to generate medical reports from chest X-ray images by reducing the gap between radiologists' knowledge and the generation model's lack of prior information. Specifically, we developed a rule-based labeler to extract comparison priors from radiology reports in the IU X-ray and MIMIC-CXR datasets, which were then incorporated into state-of-the-art models for conditional report generation. Our approach emulates the realistic diagnostic process of radiologists who have access to the prior information of patients. Experimental results show that our method outperforms previous state-of-the-art models in terms of NLG metrics and significantly reduces the number of falsely referred prior exams. Our analysis shows that incorporating comparison priors results in more accurate and concise reports and has the potential to improve the quality and efficiency of medical report generation for chest X-ray images, ultimately benefiting healthcare professionals and patients. Furthermore, our work highlights the potential of generating medical reports in an end-to-end fashion if a dataset including all previous exams becomes available in the near future. The trachea is midline. Negative for pleural or focal airspace consolidation.
The heart size is normal.
The heart is top normal in size. The mediastinum is unremarkable. The lungs are hypoinflated but grossly clear.
Significant degenerative changes of the xxxx are again noted bilaterally.
The heart is normal in size. The mediastinum is unremarkable. The lungs are clear. Table 5: Ground truth report from IU X-ray (first column), reports generated by R2Gen and M 2 Tr (second column), and reports generated by our model. Prior expressions are written in bold.