Contrastive Attention for Automatic Chest X-ray Report Generation

Recently, chest X-ray report generation, which aims to automatically generate descriptions of given chest X-ray images, has received growing research interests. The key challenge of chest X-ray report generation is to accurately capture and describe the abnormal regions. In most cases, the normal regions dominate the entire chest X-ray image, and the corresponding descriptions of these normal regions dominate the ﬁnal report. Due to such data bias, learning-based models may fail to attend to abnormal regions. In this work, to effectively capture and describe abnormal regions, we pro-pose the Contrastive Attention (CA) model. Instead of solely focusing on the current input image, the CA model compares the current input image with normal images to distill the contrastive information. The acquired contrastive information can better represent the visual features of abnormal regions. According to the experiments on the public IU-X-ray and MIMIC-CXR datasets, incorporating our CA into several existing models can boost their performance across most metrics. In addition, according to the analysis, the CA model can help existing models better attend to the abnormal regions and provide more accurate descriptions which are crucial for an interpretable diagnosis. Speciﬁcally, we achieve the state-of-the-art results on the two public datasets.


Introduction
A medical report is a paragraph containing multiple sentences that describe both the normal and abnormal regions in the chest X-ray image. Chest X-ray images and their corresponding reports are widely used in clinical diagnosis (Delrue et al., 2011). However, writing medical reports requires particular domain knowledge (Goergen et al., 2013), and only experienced radiologists can accurately interpret chest X-ray images and note down correspond-

Input Images
Normal Image Figure 1: By contrasting current input images and known normal images, it could be easier to capture the suspicious abnormal regions (Red bounding boxes). The images with Green boxes are normal.

Contrast
ing findings in a coherent manner. An automatic chest X-ray report generation system (Jing et al., 2018(Jing et al., , 2019Liu et al., 2021Liu et al., , 2019b can reduce the workload of radiologists and are in urgent need (Brady et al., 2012;Delrue et al., 2011).
In recent years, several deep learning-based methods have been proposed (Jing et al., 2018(Jing et al., , 2019Li et al., 2018Chen et al., 2020c;Liu et al., 2021) for automatic chest X-ray report generation; however, there are serious data deviation problems in the medical report corpus. For example, 1) the normal images dominate the dataset over the abnormal ones (Shin et al., 2016); 2) given an input image, the normal regions usually dominate the image and their descriptions dominate the medical report (Jing et al., 2019;Liu et al., 2021). Such data deviation may prevent learning-based methods from capturing the rare but important abnormal regions (e.g., lesion regions). As a result, the learning-based model tends to generate plausible general reports with no prominent abnormal narratives (Jing et al., 2019;Li et al., 2018;Yuan et al., 2019). In clinical practice, accurate detection and depiction of abnormalities are more helpful in disease diagnosing and treatment planning. There-fore, existing learning-based methods may fail to assist radiologists in clinical decision-making (Goergen et al., 2013).
To capture the abnormal regions from a chest X-ray image, a natural intuition is to compare it with normal images and identify the differences. As Figure 1 shows, given known normal images, it cloud be easier for models to learn and identify the suspicious abnormal regions (Red bounding boxes). Therefore, we propose the Contrastive Attention (CA) model (see Figure 2), which is based on the attention mechanism (Bahdanau et al., 2015a;Vaswani et al., 2017). The CA model can be easily integrated into existing learning-based methods and enable them to better capture and describe the abnormalities. We build the CA model in the following three steps: 1) we first build a set of normal images which are all extracted from the training dataset; 2) we introduce the Aggregate Attention to prioritize normal images that are closer to the current input image, and filter out normal images that appear differently; 3) and we further introduce the Differentiate Attention to distill the common features between the input image and the refined normal images. Then, the acquired common features are subtracted from the visual features of the input image. In this manner, the residual visual features of the input image are treated as the contrastive information that captures the differentiating properties between input image and normal images.
We evaluate our approach on two datasets, including a widely-used benchmark IU-X-ray dataset  and a recently released large-scale MIMIC-CXR dataset (Johnson et al., 2019). On both automatic metrics and human evaluations, existing methods equipped with the proposed Contrastive Attention model outperform baselines, which proves our argument and demonstrates the effectiveness of our approach.
Overall, the main contributions of our work are: • We propose the Contrastive Attention model to capture and depict the abnormalities by comparing the input image with known normal images. The proposed approach can be easily incorporated into existing models to improve their performance.
• We evaluate our approach on two public datasets. After equipping our Contrastive Attention model, the baselines achieve up to 14% gain and 17% gain in BLEU-4 on the MIMIC-CXR and IU-X-ray datasets, respectively.
• More encouragingly, we achieve the state-ofthe-art performance on the two public datasets, i.e., IU-X-ray and MIMIC-CXR. Moreover, we invite professional clinicians to conduct human evaluation to measure the effectiveness in terms of its usefulness for clinical practice.

Related Works
In this section, we will describe the related works in three categories: 1) Image Captioning; 2) Chest Xray Report Generation and 3) Contrastive Learning.
Image Captioning Image captioning aims to understand the given images and generate corresponding descriptive sentences (Chen et al., 2015). The task combines image understanding and language generation. In recent years, a large number of encoder-decoder based neural systems have been proposed for image captioning (Cornia et al., 2020;Pan et al., 2020;Pei et al., 2019;Venugopalan et al., 2015;Vinyals et al., 2015;Rennie et al., 2017;Lu et al., 2017;Anderson et al., 2018;Liu et al., 2018Liu et al., , 2019a. However, the sentence generated by image captioning is usually short and describes the most prominent visual contents, which cannot fully represent the rich feature information of the image. Recently, visual paragraph generation (Krause et al., 2017), which aims to generate long and coherent reports or stories to describe visual contents, has recently attracted increasing research interests. However, due to the data bias in the medical domain, the widely-used hierarchical LSTM in the visual paragraph generation does not perform very well in automatic chest X-ray report generation and is tend to produce normal reports (Xue et al., 2018;Li et al., 2018;Jing et al., 2019;.
Chest X-ray Report Generation Inspired by the success of deep learning models on image captioning, a lot of encoder-decoder based frameworks have been proposed (Jing et al., 2018(Jing et al., , 2019Liu et al., 2021Liu et al., , 2019bYuan et al., 2019;Xue et al., 2018;Li et al., 2018Zhang et al., 2020a;Kurisinkel et al., 2021;Ni et al., 2020;Nishino et al., 2020;Chen et al., 2020c;Wang et al., 2021;Boag et al., 2019;Syeda-Mahmood et al., 2020;Yang et al., 2020;Lovelace and Mortazavi, 2020;Zhang et al., 2020b;Miura et al., 2021). Specifically, Jing et al. (2018) proposed a hierarchical LSTM with the attention mechanism (Bahdanau et al., 2015b;You et al., 2016). Yuan et al. (2019) further incorporated the medical concept to enrich the decoder with descriptive semantics. Xue et al. (2018) proposed a multimodal recurrent model containing an iterative decoder with visual attention to improve the coherence between sentences. Miura et al. (2021) proposed an Exact Entity Match Reward and an Entailing Entity Match Reward to improve the factual completeness and consistency of the generated reports, resulting in significant improvements on clinical accuracy.  Liu et al. (2021) introduced the reinforcement learning and medical knowledge graph for chest X-ray report generation, respectively. However, some errors occur in the generated reports of the existing methods, like duplicate reports, inexact descriptions, etc (Xue et al., 2018;Yuan et al., 2019;.

Contrastive Learning
The most related work to our contrastive attention mechanism is in the field of contrastive learning (Chen et al., 2020a;He et al., 2020;Hénaff et al., 2020;Grill et al., 2020;Chen et al., 2020b;Radford et al., 2021;Jia et al., 2021), which learns similar/dissimilar image representations from data that are organized into similar/dissimilar image pairs. In image captioning, Dai and Lin (2017) introduced the contrastive learning to extract the contrastive information from additional images into the captioning models to improve the distinctiveness of the generated captions. Moreover, Song et al. (2018) and Duan et al. (2019) proposed the contrastive attention mechanism for person re-identification and summarization, respectively. Song et al. (2018) utilized a pre-provided person and background segmentation to learn features contrastively from the body and background regions, resulting they can be easily discriminated. Duan et al. (2019) contrastively attended to relevant parts and irrelevant parts of source sentence for abstractive sentence summarization. In this work, we leverage the contrastive information between the input image and the normal images to help models efficiently capture and describe the abnormalities for automatic chest X-ray report generation.

Approach
In Section 3.1, we formulate the automatic chest X-ray report generation problems; In Section 3.2, we introduce the Contrastive Attention in detail.

Problem Formulation
Given a chest X-ray image I, the goal of automatic chest X-ray report generation is to generate a coherent report R that addresses findings of I. Most existing methods (Jing et al., 2018) adopt the encoder-decoder frameworks, which normally include an image encoder (He et al., 2016) and a report decoder (Krause et al., 2017). The encoderdecoder framework can be formulated as: The image encoder, e.g., ResNet (He et al., 2016), aims to generate the visual features V ∈ R N I ×d . The report decoder, e.g., Hierarchical LSTM (Krause et al., 2017), is used to generate the report R from V . Specifically, in the Hierarchical LSTM, a paragraph-level LSTM first generates topic vectors to represent the sentences, then a sentence-level LSTM takes each topic vector as input to generate the corresponding sentence. As a result, the Hierarchical LSTM can better model a paragraph of multiple sentences (report) than a single LSTM (Jing et al., 2018;Krause et al., 2017). Given the ground truth medical report provided by the radiologists for the input chest X-ray image, existing methods train the encoder-decoder frameworks by minimizing training loss, e.g., crossentropy loss. Due to limited space, please refer to Huang et al. (2019); Jing et al. (2018) for detailed introduction.
In this paper, we adopt the ResNet-50 (He et al., 2016) to extract the visual features, i.e., the output of the last convolutional layer is used as the visual information: where I denotes the input image, ResNet(I) ∈ R 49×2048 and W I ∈ R 2048×d which reduces the dimension from 2,048 to d 1 . Specifically, the d is set to 512, resulting where N I = 49 and d = 512. Moreover, we apply the average pooling to obtain the global visual feature: After the above calculations, we obtain the visual feature vectors V = {v 1 , v 2 , . . . , v N I } ∈ R N I ×d and the global visual feature vectorv ∈ R 1×d .  Figure 2: Illustration of our proposed Contrastive Attention, which consists of the Aggregate Attention and Differentiate Attention. In particular, the Aggregate Attention devotes to finding the normal images that are closest to the current input image in the normality pool. The Differentiate Attention devotes to summarizing the common information between the input image and the closest normal images and subtract it from the input image to capture the differentiating properties between the input image and the normal images.

Contrastive Attention
In this paper, we propose the Contrastive Attention to enable the models to capture the differentiating properties between the input image and normal images. To this end, we first collect a normality pool which consists of N P = 1, 000 normal images randomly extracted from the training dataset, wherê v Normal i ∈ R 1×d denotes the global visual feature of i th extracted normal image. Then, as shown in Figure 2, the proposed Contrastive Attention introduces the Aggregate Attention and Differentiate Attention which aims to obtain the contrastive information between the input imagev and the normality pool P .
Aggregate Attention Since the images in the normality pool P are all normal, there is no ranking order among these images. Therefore it's natural to treat all normal images equally in capturing the contrastive information. However, as shown in Figure 3, we note that there are many noisy images in the normality pool (see the Purple boxes in Figure 3). For example, some normal images have different orientation information or rotation angles from the input image, we cannot direct compare these images with the input image. Therefore, these noisy images will bring noise information, preventing the Contrastive Attention from capturing accurate abnormal regions efficiently. Motivated by the above observations, we introduce Aggregate Attention to increase the weights of normal images that are close to the current input image and lower the weights of images that are not close to the current input image. As a result, the contrasting process could be improved. To implement the Aggregate Attention, we utilize the dot-product attention 2 (Vaswani et al., 2017), which is defined as: where W x , W y ∈ R d×d are learnable parameters. Given x ∈ R Nx×d and y ∈ R Ny×d , the acquired M is then of the shape of N x × N y , the function sof tmax is conducted on each row of M , resulting in Att(x, y) ∈ R Nx×d . Then we apply Eq. (4) tô v ∈ R 1×d and P ∈ R N P ×d : v Closest = Att(v, P ).
Since the attention mechanism is a function that computes the similarity ofv and eachv Normal i in P , and Att(v, P ) ∈ R 1×d is the attended vector for thev. In this way, we can increase the weights of normal images that are similar to current image and lower the weights of the ones that are dissimilar.
Moreover, following Lin et al. (2017), we repeat the Aggregate Attention n times with different learnable attention weights to further promote the performance of our approach, defined as follows: where [·; ·] stands for concatenation operation.
In particular, for chest X-ray images, the result Att(v, P ) in Eq. (5) turns out to be the found normal images that are closest to the current entire input image. However, Att(v, P ) cannot capture those normal images that are closest to the input image only in a specific part rather than the entire image. Fortunately, through repeating the function Att(v, P ) in Eq. (5) n times, our Aggregate Attention can efficiently find the closest normal images Differentiate Attention To learn the contrastive information between the input image and the closest normal images, we first attempt to capture their similarity, i.e., the common information, then we remove this similarity portion from the input image to obtain the contrastive information. To this end, we introduce the Differentiate Attention. The first step is learning to summarize the common information v c ∈ R 1×d between the information of current input imagev ∈ R 1×d and the information of closest normal images P ∈ R n×d . In implementation, we employ the same dot-product self-attention mechanism in Eq. (5) and average pooling operation to obtain the v c ∈ R 1×d : (7) where [·; ·] denotes row-wise concatenation operation, then the [v; P ] is in the shape of (n + 1) × d and the Att ([v; P ], [v; P ]) function outputs a matrix in the shape of (n + 1) × d.
In this way, via such self-attention mechanism, we exploit the similarity between P andv to capture the significant common information. Next, to obtain the contrastive information v d ∈ R 1×d , we remove (i.e., subtract) the common information v c ∈ R 1×d from the input imagev ∈ R 1×d : At last, we update the original image features, i.e., v ∈ R 1×d and V where ReLU(·) represents the ReLU activation function and W ∈ R 2d×d is the matrix for linear transformation. The resultingv and V = {v 1 , v 2 , . . . , v N I } are used to replace the original image features V in Eq.
(1) and are then fed into existing models to generate coherent reports. In our subsequent analysis, we show that the contrastive features indeed focus on the abnormal regions and provide a better starting point for downstream models. There are 70%, 10% and 20% instances in training set, validation set and test set, respectively.
• MIMIC-CXR 4 is the recently released largest dataset to date and consists of 377,110 chest X-ray images and 227,835 reports from 64,588 patients. Following Chen et al. (2020c); Liu et al. (2021), we use the official splits to report our results. Thus, there are 368,960 in the training set, 2,991 in the validation set and 5,159 in the test set. We convert all tokens of reports to lower-cases and remove the tokens whose frequency of occurrence in the training set is less than 10, resulting in 4k words.
Baselines We experiment with two lines of baselines that are originally designed for image captioning and chest X-ray report generation.
Settings For our Contrastive Attention model, the model size d is set to 512. Based on the average performance on the validation set, the n in the Aggregate Attention is set to 6. For the normality pool, we randomly extract 1,000 normal images, i.e., N P = 1, 000, for both two datasets from their training set. To re-implement the baselines, follow-  Table 1: Performance of automatic evaluations on the test set of the MIMIC-CXR dataset and the IU-X-ray dataset. † denotes our own implementation. B-n, M and R-L are short for BLEU-n, METEOR and ROUGE-L, respectively. Higher is better in all columns. In this paper, the Red colored numbers denote the best results across all approaches in Table. As we can see, most baseline models enjoy a comfortable improvement with our approach.   Table 2: Comparison with existing state-of-the-art methods on the test set of the MIMIC-CXR dataset and the IU-X-ray dataset. As we can see, we achieve the state-of-the-art performance on major metrics on the two datasets. and fine-tuned on CheXpert dataset (Irvin et al., 2019) to extract the patch visual features with the dimension of each feature is 2,048, which will be projected to 512. Besides, we utilize paired images of a patient as the input for IU-X-ray and utilize single image as the input for MIMIC-CXR to ensure consistency with the experiment settings of previous works (Chen et al., 2020c). For all baselines, since our focus is to provide explicit abnormal region features, which tends to improve existing baselines, we keep the inner structure of the baselines untouched and preserve the original parameter setting and training strategy.

Main Evaluation
Metrics We first perform the automatic evaluation to conduct a fair comparison. To measure performance, we adopt the widely-used evaluation toolkit (Chen et al., 2015) to calculate the standard metrics: BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005) and ROUGE-L (Lin, 2004).
Specifically, BLEU and METEOR are originally designed for machine translation evaluation, while ROUGE is originally proposed for automatic evaluation of the extracted text summarization.

Results
The results on the test set of MIMIC-CXR and IU-X-ray datasets are reported in Table 1. As we can see, our Contrastive Attention can successfully boost baselines with improvement up to 14% and 17% for MIMIC-CXR and IU-X-ray in terms of BLEU-4 score, respectively, where the Setting (g) achieves the greatest improvements. The results prove the effectiveness and generalization capabilities of our approach to a wide range of models. Moreover, in Table 2, we choose five competitive models including the current state-of-the-art models, i.e., SentSAT + KG (Zhang et al., 2020a) and R2Gen (Chen et al., 2020c), for comparison. For these competitive models, we directly report the results from the original paper. Table 2 shows that when our approach is applied to the Multi-Attention, we outperform these existing state-of-  Table 3: Results of human evaluation on MIMIC-CXR and IU-X-ray datasets for comparing our method with baselines in terms of the fluency of generated reports, the comprehensiveness of the generated true abnormalities and the faithfulness to the ground truth reports. All values are reported in percentage (%).
the-art models on major metrics on the IU-X-ray and MIMIC-CXR datasets, which further proves the effectiveness of our Contrastive Attention.

Clinical Efficacy
Metrics The metrics used in the Table 1 measure the match between the generated reports and ground truth reports, but are not specialized for the abnormalities in the reports. Therefore, to measure the accuracy of descriptions for clinical abnormalities, we follow Chen et al. (2020c) to adopt the CheXpert labeler (Irvin et al., 2019), which will label the given reports in 14 different categories related to thoracic diseases and support devices, to further report the clinical efficacy metrics. As a result, we can calculate the clinical efficacy scores by comparing the generated reports with ground truth reports in 14 different categories, producing the Precision, Recall and F1 scores.

Results
The results are shown in Table 4. As we can see, our approach can boost the performance of baselines under all clinical efficacy metrics. The results prove our arguments and the effectiveness of our approach in boosting the baselines to correctly capture and depict the abnormalities. Specifically, our Multi-Attention w/ CA outperforms the R2Gen (Chen et al., 2020c) with relatively 6%, 9% and 10% margins in terms of Precision, Recall and F1 scores, respectively The superior clinical efficacy scores, which measure the accuracy of descriptions for clinical abnormalities, demonstrate that our approach can help existing models produce higher quality descriptions for clinical abnormalities.

Human Evaluation
Metrics For medical-related task, it is important to know (1) on what fraction of images with abnormalities did the system not mention the abnormality and (2) on what fraction of images the system described abnormality that does not exist according to  doctors. To this end, we randomly select 200 samples from the IU-X-ray and MIMIC-CXR test sets, which are 100 samples from each dataset. Specifically, it is important to generate accurate reports (faithfulness) with comprehensive true abnormalities (comprehensiveness) and it is unacceptable to generate repeated sentences (fluency). Therefore, we invite several professional clinicians to compare our approach and baselines independently and evaluate the perceptual quality, including the fluency of generated reports, the comprehensiveness of generated true abnormalities and the faithfulness to the ground truth reports. The clinicians are unaware of which model generates these reports.

Results
To conduct the human evaluation, we select a representative chest X-ray report generation baseline: HLSTM (Krause et al., 2017) and a competitive chest X-ray report generation baseline: Multi-Attention (Huang et al., 2019). The results in Table 3 show that our approach enjoys the obvious advantage in terms of the three aspects, i.e., fluency, comprehensiveness and faithfulness, meaning that the reports generated by the "Baseline w/ Contrastive Attention" are of higher clinical quality, which also proves the advantage of our approach in  Table 5: Quantitative analysis of our Contrastive Attention, which includes the Differentiate Attention (DA) and Aggregate Attention (AA). We conduct the analysis on a widely-used baseline model HLSTM (Krause et al., 2017) and a competitive baseline model Multi-Attention (Huang et al., 2019).
clinical practice. Especially, by using our proposed Contrastive Attention, the winning chances of models increased by maximum of 54 − 13 = 41 points and 71−10 = 61 points in terms of the comprehensiveness metric on the MIMIC-CXR and IU-X-ray datasets, respectively. It demonstrates the effectiveness of our approach in helping existing baselines generate more accurate abnormality descriptions, and thus improve the usefulness of models in better assisting radiologists in clinical decision-makings and reducing their workload.
Overall From the results of automatic and human evaluations, we can see that our proposed Contrastive Attention can provide a solid basis for describing chest X-ray images, especially for the abnormalities. As a result, our approach can successfully boost baselines and achieves new stateof-the-art results on the MIMIC-CXR and IU-Xray datasets, which verifies the effectiveness of the proposed approach and indicates that our approach is less prone to the variations of model structures, hyper-parameters (e.g., learning rate and batch size), and learning paradigms.

Analysis
We conduct analysis on the benchmark IU-X-ray dataset to better understand our proposed approach.

Quantitative Analysis
We conduct the quantitative analysis on two representative models, i.e., HLSTM and Multi-Attention, to evaluate the contribution of each component.

Effect of Contrastive Attention Our Contrastive
Attention consists of the Differentiate Attention (DA) and Aggregate Attention (AA). As shown in Table 5(b), the DA can promote the performance over all metrics. Especially for the HLSTM (Krause et al., 2017), which does not incorporate the attention mechanism to allow more effi-cient use of the image features. We can see that an up to 5% gain in BLEU-4 score makes the "HLSTM w/ DA" an equally competitive model as the "HLSTM+att+Dual" model in Table 1(h). This indicates that the contrastive information extracted by the DA contains sufficient accurate abnormal information, which is vital in improving the performance of chest X-ray report generation. In other words, our approach can ease the design of the neural models for the task. For the AA, it devotes to identifying the closest normal images and filtering out the noisy images, which can improve the contrasting process in the DA. As expected, Table 5(c) shows that the AA can consistently boost the performance of baselines under all metrics, which further demonstrates the effectiveness of our approach.
Effect of n In this section, we analyze the effect of n in our Aggregate Attention (see Eq. (6)). Table 5(f-j) shows that when n is smaller than 6, the performance will increase with as n increases. The reason may be that repeating the Aggregate Attention n times with different learnable attention weights can encourage the model to identify the closest normal images from n aspects, e.g., organs or tissues, i.e., if an aspect of the normal image is similar to the input image, it will be identified as the closest normal image. In this way, the Aggregate Attention can capture accurate and robust closest normal images from n aspects. To verify this, we randomly visualize two attention weights of Aggregate Attention in Figure 3. As we can see, the Aggregate Attention can indeed identify the closest normal images from multiple aspects, e.g., 'Bone/Clavicle' (Blue boxes) and 'Right Lung' (Green boxes), which proves our arguments. Moreover, larger n will bring noise, i.e., normal images that are not similar to the current input image, to the model and thus impair the performance.

Ground Truth
The heart is within normal limits in size. Surgical suture material projects over the right lung apex.

Multi-Attention
There is mild cardiomegaly. There is a small right pleural effusion. There is no pneumothorax. The aorta is tortuous. There is no focal airspace consolidation.

w/ CA
Heart size is normal. There is a moderate left sided pleural effusion. No acute bony abnormalities. There is left lower lobe airspace disease. There is a small right pleural effusion. There is no pneumothorax. Figure 3: Examples of the generated reports and the visualization of the Contrastive Attention (CA). Please view in color. CA model can capture the abnormal region (Red bounding box) by contrasting the input image with normal images. Besides, our Aggregate Attention in the CA model can find the closest normal images (Blue and Green boxes visualized from two different attention weights (see Eq. (6))) and filter out the noisy images (Purple boxes). The Red colored text denotes the abnormal descriptions in the ground truth report. Underlined text denotes the generated wrong sentences. Bold text denotes the generated true abnormalities. As we can see, the abnormal region and the abnormal descriptions generated by our method show significant alignment with ground truth reports.

Qualitative Analysis
In this section, we show the reports generated by the Baseline models, i.e., HLSTM and Multi-Attention, and the "Baseline w/ CA" models, and the visualization in Figure 3 to analyze the strength of our Contrastive Attention model intuitively. As we can see, for the HLSTM, it tends to produce repeated findings and normal findings, which results from the overwhelming normal findings in the dataset, i.e., data deviation (Shin et al., 2016). For the Multi-Attention, with the help of attention mechanism, it can describe abnormalities, but some abnormalities are incorrect (Underlined text). The reason is that it is difficult for the Multi-Attention model to efficiently learn the medical expertise from the dataset with data deviation to correctly detect the abnormal regions. Since our Contrastive Attention model can efficiently capture the suspicious abnormal regions by contrasting the input images and normal images and transfer such power to the downstream models and datasets, we can thus help multiple baseline models to detect and describe comprehensive and accurate abnormalities. As a result, the "Baseline w/ CA" models can generate fluent and accurate reports supported with accurate abnormal descriptions, showing significant alignment with ground truth reports.

Conclusion
In this paper, we propose the Contrastive Attention model to capture abnormal regions by contrasting the input image and normal images for chest X-ray report generation. The experiments on two public datasets demonstrate the effectiveness of our approach, which can be easily incorporated into existing models to boost their performance under most metrics. The clinical efficacy scores and human evaluation further prove our arguments and the effectiveness of our approach in helping existing models capture and depict the abnormalities. Specifically, we achieve the state-of-the-art results on the two datasets with the best human preference, which could better assist radiologists in clinical decision-making and reduce their workload.
In the future, there are two potential ways to improve the contrastive attention. First, it may be better to perform the contrastive attention in path feature rather than global feature. Second, it may be better to utilize multiple feature maps from different convolutional layers rather than just the feature maps of last convolutional layer.

Ethical Considerations
In this work, we focus on helping several existing chest X-ray report generation systems better capture and describe the abnormalities. To this end, we provide a detailed human evaluation in Table 3 (Section 4.4) and an automatic evaluation in terms of clinical efficacy metrics in Table 4 (Section 4.3) to know (1) on what fraction of images with abnormalities did the system not mention the abnormality and (2) on what fraction of images the system described abnormality that does not exist according to doctors. The results show that our work can help existing systems generate more accurate descriptions for clinical abnormalities, improving the usefulness of existing systems in better assisting radiologists in clinical decision-makings and reducing their workload. In particular, for radiologists, given a large amount of medical images, the systems can automatically generate medical reports, the radiologists only need to make revisions rather than write a new report from scratch. This study uses the public MIMIC-CXR and IU-Xray datasets. All protected health information was de-identified. De-identification was performed in compliance with Health Insurance Portability and Accountability Act (HIPAA) standards in order to facilitate public access to the datasets. Deletion of protected health information (PHI) from structured data sources (e.g., database fields that provide patient name or date of birth) was straightforward. All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.