Cross-modal Memory Networks for Radiology Report Generation

Medical imaging plays a significant role in clinical practice of medical diagnosis, where the text reports of the images are essential in understanding them and facilitating later treatments. By generating the reports automatically, it is beneficial to help lighten the burden of radiologists and significantly promote clinical automation, which already attracts much attention in applying artificial intelligence to medical domain. Previous studies mainly follow the encoder-decoder paradigm and focus on the aspect of text generation, with few studies considering the importance of cross-modal mappings and explicitly exploit such mappings to facilitate radiology report generation. In this paper, we propose a cross-modal memory networks (CMN) to enhance the encoder-decoder framework for radiology report generation, where a shared memory is designed to record the alignment between images and texts so as to facilitate the interaction and generation across modalities. Experimental results illustrate the effectiveness of our proposed model, where state-of-the-art performance is achieved on two widely used benchmark datasets, i.e., IU X-Ray and MIMIC-CXR. Further analyses also prove that our model is able to better align information from radiology images and texts so as to help generating more accurate reports in terms of clinical indicators.


Introduction
Interpreting radiology images (e.g., chest X-ray) and writing diagnostic reports are essential operations in clinical practice and normally requires considerable manual workload. Therefore, radiology report generation, which aims to automatically generate a free-text description based on a radiograph, is highly desired to ease the burden of † Corresponding author. 1 Our code and the best performing models are released at https://github.com/cuhksz-nlp/R2GenCMN.

Findings
There is no focal consolidation, pleural effusion or pneumothorax. Bilateral nodular opacities that most likely represent nipple shadows. The cardiomediastinal silhouette is normal. Clips project over the left lung, potentially within the breast. The imaged upper abdomen is unremarkable. Impression No acute cardiopulmonary process. radiologists while maintaining the quality of health care. Recently, substantial progress has been made towards research on automated radiology report generation models Johnson et al., 2019;Liu et al., 2019;Jing et al., 2019). Most existing studies adopt a conventional encoder-decoder architecture, with convolutional neural networks (CNNs) as the encoder and recurrent (e.g., LSTM/GRU) or non-recurrent networks (e.g., Transformer) as the decoder following the image captioning paradigm (Vinyals et al., 2015;Anderson et al., 2018). Although these methods have achieved remarkable performance, they are still restrained in fully employing the information across radiology images and reports, such as the mappings demonstrated in Figure 1 that aligned visual and textual features point to the same content. The reason for the restraint comes from both the limitation of annotated correspondences between image and text for supervised learning as well as the lack of good model design to learn the correspondences. Unfortunately, few studies 2 are dedicated to solving the restraint. Therefore, it is expected to have a better solution to model the alignments across modalities and further improve the generation ability, although promising results are continuously acquired by other approaches Liu et al., 2019;Jing et al., 2019;Chen et al., 2020). 2 Along this research track, recently there is only  studying on a multi-task learning framework with a coattention mechanism to explicitly explore information linking particular parts in a radiograph and its corresponding report.  Figure 2: The overall architecture of our proposed approach, where the visual extractor, encoder and decoder are shown in gray dash boxes with the details omitted. The cross-modal memory networks are illustrated in blue dash boxes with presenting the detailed process of memory querying and responding.

Visual
In this paper, we propose an effective yet simple approach to radiology report generation enhanced by cross-modal memory networks (CMN), which is designed to facilitate the interactions across modalities (i.e., images and texts). In detail, we use a memory matrix to store the cross-modal information and use it to perform memory querying and memory responding for the visual and textual features, where for memory querying, we extract the most related memory vectors from the matrix and compute their weights according to the input visual and textual features, and then generate responses by weighting the queried memory vectors. Afterwards, the responses corresponding to the input visual and textual features are fed into the encoder and decoder, so as to generate reports enhanced by such explicitly learned cross-modal information. Experimental results on two benchmark datasets, IU X-RAY and MIMIC-CXR, confirm the validity and effectiveness of our proposed approach, where state-of-the-art performance is achieved on both datasets. Several analyses are also performed to analyze the effects of different factors affecting our model, showing that our model is able to generate reports with meaningful image-text mapping while requiring few extra parameters in doing so.

The Proposed Approach
We regard radiology report generation as an imageto-text generation task, for which there exist sev-eral solutions (Vinyals et al., 2015;Xu et al., 2015;Anderson et al., 2018;Cornia et al., 2019). Although images are organized as 2-D format, we follow the standard sequence-to-sequence paradigm for this task as that performed in Chen et al. (2020). In detail, the source sequence is X = {x 1 , x 2 , ..., x s , ..., x S }, where x s ∈ R d are extracted by visual extractors from a radiology image I and the target sequence are the corresponding report Y = {y 1 , y 2 , ..., y t , ..., y T }, where y t ∈ V are the generated tokens, T the length of the report and V the vocabulary of all possible tokens. The entire generation process is thus formalized as a recursive application of the chain rule p(Y|I) = T t=1 p(y t |y 1 , ..., y t−1 , I) (1) The model is then trained to maximize p(Y|I) through the negative conditional log-likelihood of Y given the I: log p(y t |y 1 , ..., y t−1 , I; θ) (2) where θ is the parameters of the model. An overview of the proposed model is demonstrated in Figure 2, with cross-modal memories emphasized. The details of our approach are described in following subsections regarding to its three major components, i.e., the visual extractor, the crossmodal memory networks and the encoder-decoder process enhanced by the memory.

Visual Extractor
To generate radiology reports, the first step is to extract the visual features from radiology images. In our approach, the visual features X of a radiology image I are extracted by pre-trained convolutional neural networks (CNN), such as VGG (Simonyan and Zisserman, 2015) or ResNet (He et al., 2016). Normally, an image is decomposed into regions of equal size 3 , i.e., patches, and the features (representations) of them are extracted from the last convolutional layer of CNN. Once extracted, the features in our study are expanded into a sequence by concatenating them from each row of the patches on the image. The resulted representation sequence is used as the source input for all subsequent modules and the process is formulated as where f v (·) refers to the visual extractor.

Cross-modal Memory Networks
To model the alignment between image and text, existing studies tend to map between images and texts directly from their encoded representations (e.g.,  used a co-attention to do so). However, this process always suffers from the limitation that the representations across modalities are hard to be aligned, so that an intermediate medium is expected to enhance and smooth such mapping.
To address the limitation, we propose to use CMN to better model the image-text alignment, so as to facilitate the report generation process. With using the proposed CMN, the mapping and encoding can be described in the following procedure. Given a source sequence {x 1 , x 2 , ..., x S } (features extracted from the visual extractor) from an image, we feed it to this module to obtain the memory responses of the visual features {r x 1 , r x 2 , ..., r x S }. Similarly, given a generated sequence {y 1 , y 2 , ..., y t−1 } with its embedding {y 1 , y 2 , ..., y t−1 }, it is also fed to the cross-modal memory networks to output the memory responses of the textual features {r y 1 , r y 2 , ..., r y t−1 }. In doing so, the shared information of visual and textual features can be recorded in the memory so that the entire learning process is able to explicitly map between the images and texts. Specifically, the cross-modal memory networks employs a matrix to preserve information for encoding and decoding process, where each row of the matrix (i.e., a mem-ory vector) records particular cross-modal information connecting images and texts. We denote the matrix as M = {m 1 , m 2 , ..., m i , ..., m N }, where N represents the number of memory vectors and m i ∈ R d the memory vector at row i with d referring to its dimension. During the process of report generation, CMN is operated with two main steps, namely, querying and responding, whose details are described as follows. 4 Memory Querying We apply multi-thread 5 querying to perform this operation, where in each thread the querying process follows the same procedure described as follows.
In querying memory vectors, the first step is to ensure the input visual and textual features are in the same representation space. Therefore, we convert each memory vector in M as well as input features through linear transformation by where W k and W q are trainable weights for the conversion. Then we separately extract the most related memory vector to visual and textual features according to their distances D s i and D t i through where the number of extracted memory vectors can be controlled by a hyper-parameter K to regularize how much memory is used. We denote the queried memory vectors as {k s 1 , k s 2 , ..., k s j , ..., k s K } and {k t 1 , k t 2 , ..., k t j , ..., k t K } . Afterwards, the importance weight of each memory vector with respect to visual and textual features are obtained by normalization over all distances by Note that the above steps are applied in each thread to allow memory querying from different memory representation subspaces.

Memory Responding
The responding process is also conducted in a multi-thread manner corresponding to the query process. For each thread, we firstly perform a linear transformation on the queried memory vector via Then, we obtain the memory responses for visual and textual features by weighting over the transferred memory vectors by where w s i and w t i are the weights obtained from memory querying. Similar to memory querying, we apply memory responding to all the threads so as to obtain responses from different memory representation subspaces.

Encoder-Decoder
Since the quality of input representation plays an important role in model performance (Pennington et al., 2014;Song et al., 2017Peters et al., 2018;Devlin et al., 2019;, the encoder-decoder in our model is built upon standard Transformer (which is a powerful architecture that achieved state-of-the-art in many tasks), where memory responses of visual and textual features are functionalized as the input of the encoder and decoder so as to enhance the generation process. In detail, as the first step, the memory responses {r x 1 , r x 2 , ..., r x S } for visual features are fed into the encoder through where f e (·) represents the encoder. Then the resulted intermediate states {z 1 , z 2 , ..., z S } are sent to the decoder at each decoding step, jointly with the memory responses {r y 1 , r y 2 , ..., r y t−1 } for the textual features of generated tokens from previous steps, so as to generate the current output y t by y t = f d (z 1 , z 2 , ..., z S , r y 1 , r y 2 , ..., r y t−1 ) (15) where f d (·) refers to the decoder. As a result, to generate a complete report, the above process is repeated until the generation is finished.  Jing et al., 2019;Chen et al., 2020), we only generate the findings section and exclude the samples without the findings section for both datasets. For IU X-RAY, we use the same split (i.e., 70%/10%/20% for train/validation/test set) as that stated in  and for MIMIC-CXR we adopt its official split. Table 1 show the statistics of all datasets in terms of the numbers of images, reports, patients and the average length of reports with respect to train/validation/test set.

Baseline and Evaluation Metrics
To examine our proposed model, we use the following ones as the main baselines in our experiments:  Table 2: NLG and CE evaluations of different models on the test sets of IU X-RAY and MIMIC-CXR datasets. BL-n denotes BLEU score using up to 4-grams; MTR and RG-L denote METEOR and ROUGE-L, respectively. The average improvement over all NLG metrics compared to BASE is also presented in the "AVG. ∆" column.
ing conventional image captioning models, e.g., Following Chen et al. (2020), we evaluate the above models by two types of metrics, conventional natural language generation (NLG) metrics and clinical efficacy (CE) metrics 8 . The NLG metrics 9 include BLEU (Papineni et al., 2002), METEOR (Denkowski and Lavie, 2011) and ROUGE-L (Lin, 2004). For CE metrics, the CheXpert (Irvin et al., 2019) 10 is applied to label the generated reports and compare the results with ground truths in 14 different categories related to thoracic diseases and support devices. We use precision, recall and F1 to evaluate model performance for CE metrics.

Implementation Details
To ensure consistency with the experiment settings of previous work Chen et al., 2020), we use two images of a patient as input for report generation on IU X-RAY and one image for MIMIC-CXR. For visual extractor, we adopt the ResNet101 (He et al., 2016) pretrained on Ima-geNet (Deng et al., 2009) to extract patch features with 512 dimensions for each feature. For the encoder-decoder backbone, we use a Transformer structure with 3 layers and 8 attention heads, 512 dimensions for hidden states and initialize it randomly. For the memory matrix in CMN, its dimen-8 Note that CE metrics only apply to MIMIC-CXR because the labeling schema of CheXpert is designed for MIMIC-CXR, which is different from that of IU X-RAY. 9 https://github.com/tylin/coco-caption 10 https://github.com/MIT-LCP/mimic-cxr/ tree/master/txt/chexpert sion and the number of memory vectors N are set to 512 and 2048, respectively, and also randomly initialized. For memory querying and responding, thread number and the K are set to 8 and 32, respectively. We train our model under cross entropy loss with Adam optimizer (Kingma and Ba, 2015). The learning rates of the visual extractor and other parameters are set to 5 × 10 −5 and 1 × 10 −4 , respectively, and we decay them by a 0.8 rate per epoch for all datasets. For the report generation process, we set the beam size to 3 to balance the effectiveness and efficiency of all models. Note that the optimal hyper-parameters mentioned above are obtained by evaluating the models on the validation sets from the two datasets.

Effect of Cross-Modal Memory
The main experimental results on the two aforementioned datasets are shown in Table 2, where BASE+CMN represents our model (same below). There are several observations drawn from different aspects. First, both BASE+MEM and BASE+CMN outperform the vanilla Transformer (BASE) on both datasets with respect to NLG metrics, which confirms the validity of incorporating memory to introduce more knowledge into the Transformer backbone. Such knowledge may come from the hidden structures and regularity patterns shared among radiology images and their reports, so that the memory modules are able to explicitly and reasonably model them to promote the recognition of diseases (symptoms) and the generation of reports. Second, the comparison between BASE+CMN and two baselines on different metrics confirms the effectiveness of our proposed model with significant improvement. Particularly, BASE+CMN outperforms BASE+MEM by a large margin, which indicates the  usefulness of CMN in learning cross-modal features with a shared structure rather than separate ones. Third, when comparing between datasets, the performance gains from BASE+CMN over two baselines (i.e., BASE and BASE+MEM) on MIMIC-CXR are larger than that of IU X-RAY. This observation owes to the fact that MIMIC-CXR is relatively larger, which helps the learning of the alignment between images and texts so that CMN helps more on report generation on MIMIC-CXR. Third, when compared between datasets, the performace gain from BASE+CMN over two baselines (i.e., BASE and BASE+MEM) on IU X-RAY are larger than that of MIMIC-CXR. This observation owes to the fact that IU X-Ray is relatively small and has less complicated visual-textual mappings, thus easier for generation by CMN. Moreover, this size effect also helps that our model shows the same trend on the CE metrics on MIMIC-CXR as that for NLG metrics, where it outperforms all its baselines in terms of precision, recall and F1.

Comparison with Previous Studies
To further demonstrate the effectiveness, we further compare our model with existing models on the same datasets, with their results reported in Table 3 on both NLG and CE metrics. We have following observations. First, cross-modal memory shows its effectiveness in this task, where our model outper-forms COATT, although both of them improve the report generation by the alignment of visual and textual features. The reason behind might be that our model is able to use a shared memory matrix as the medium to softly align the visual and textual features instead of direct alignment using the co-attention mechanism, thus unifies cross-modal features within same representation space and facilitate the alignment process. Second, our model confirms its superiority of simplicity when comparing with those complicated models. For example, HRGR uses manually extracted templates and CMAS-RL utilizes reinforcement learning with a careful design of adaptive rewards and our model achieves better results with a rather simpler method. Third, applying memory to both the encoding and decoding can further improve the generation ability of Transformer when compared with R2GEN which only uses memory in decoding. This observation complies with our intuition that the crossmodal operation tightens the encoding and decoding so that our model generates higher quality reports. Fourth, note that although there are other models (i.e., COATT and HRGR) with exploiting extra information (such as private datasets for visual extractor pre-training), our model still achieves the state-of-the-art performance without requiring such information. It reveals that in this task, the hidden structures among the images and texts and a good solution of exploiting them are more essential in promoting the report generation performance.

Analysis
Memory Size To analyze the impacts of memory size, we train our model with different numbers of memory vectors, i.e., N ranges from 32 to 4096, with the results on MIMIC-CXR shown in Figure 3. It is observed that, first, enlarging memory by the number of vectors results in better overall performance when the entire memory matrix is relatively small (N ≤ 1024), which can be explained by that, within a certain memory capacity, larger memory size helps store more cross-modal information; second, when the memory matrix is larger than a threshold, increasing memory vectors is not able to continue promising a better outcome. An explanation to this observation may be that, when the matrix is getting to large, the memory vectors can not be fully updated so they do not help the generation process other than being played as noise.
More interestingly, it is noted that even if we use a rather large memory size (i.e., N = 4096), only 3.34% extra parameters are added to the model compared to BASE, which justifies that introducing memory to report generation process through our model can be done with small price.

Number of Queried Memory Vectors
To analyze how querying impacts report generation, we try CMN with different numbers of queried vectors, i.e., K ranges from 1 to 512, and show the results in Figure 4. It is found that the number of queried vectors should be neither too small nor too big, where enlarging K leads to better results when K ≤ 32 and after this threshold the performance starts to drop. The reason behind might be the overfitting of memory updating since the memory matrix is sparsely updated in each iteration when K is small, i.e., it is hard to be overfit under this scenario, while more queried vectors should cause intensive updating on the matrix and some of the essential vectors are over-updated accordingly. As a result, it is interesting to find the optimal number (i.e., 32) of queried vectors and this is a useful guidance to further improve report generation with controlling the querying process.
Case Study To further qualitatively investigate how our model learns from the alignments between the visual and textual information, we perform a case study on the generated reports from different models regarding to an input chest X-ray image chosen from MIMIC-CXR. Figure 5 shows the image with ground-truth report, and different reports with selected mappings from visual (some part of the image) and textual features (some words and phrases), 11 where the mapped areas on the image are highlighted with different colors. In general, BASE+CMN is able to generate more accurate descriptions (in terms of better visual-textual mapping) in the report while other baselines are inferior in doing so. For instance, normal medical conditions and abnormalities presented in the chest X-ray image are covered by the generated report from BASE+CMN (e.g., "severe cardiomegaly", "pulmonary edema" and "pulmonary arteries") and the related regions on the image are precisely located regarding to the texts, while the areas highlighted on the image from other models are inaccurate.  Figure 6: T-SNE visualization of memory vectors with an example input image and its partial generated report from MIMIC-CXR test set. The queried vectors for visual and textual features are indicated by arrows.
To further illustrate how the alignment works between visual and textual features, we perform a t-SNE visualization on the memory vectors linking to an image and its generated report from the MIMIC-CXR test set. It is observed that the word "lung" in the report and the visual feature for the region of lung on the image query similar memory vectors from CMN, where similar observation is also drawn for "hemidiaphragms" and its corresponding regions on the image. This case confirms that memory vector is effective intermediate medium to interact between image and text features.

Related Work
In general, the most popular related task to ours is image captioning, a cross-modal task involving natural language processing and computer vision, which aims to describe images in sentences (Vinyals et al., 2015;Xu et al., 2015;Anderson et al., 2018;Cornia et al., 2019).
Among these studies, the most related study from Cornia et al. (2019) also proposed to leverage memory matrices to learn a priori knowledge for visual features using memory networks Sukhbaatar et al., 2015;Zeng et al., 2018;Santoro et al., 2018;Nie et al., 2020;Diao et al., 2020;Tian et al., 2020b, but such operation is only performed during the encoding process. Different from this work, the memory in our model is designed to align the visual and textual features, and the memory operations (i.e., querying and responding) are performed in both the encoding and decoding process.
Recently, many advanced NLP techniques (e.g., pre-trained language models) have been applied to tasks in the medical domain (Pampari et al., 2018;Wang et al., 2018;Alsentzer et al., 2019;Tian et al., 2019Tian et al., , 2020aLee et al., 2020;. Being one of the applications and extensions of image captioning to the medical domain, radiology report generation aims to depicting radiology images with professional reports. Existing methods were designed and proposed to better align images and texts or to exploit highly-patternized features of texts. For the former studies,  proposed a co-attention mechanism to simultaneously explore visual and semantic information with a multi-task learning framework. For the latter studies,  introduced a template database to incorporate patternized information and Chen et al. (2020) improved the performance of radi-ology report generation by applying a memorydriven Transformer to model patternized information. Compared to these studies, our model offers an effective yet simple alternative to generating radiology reports, where a soft intermediate layer is provided to facilitate the mappings between visual and textual features, so that more accurate descriptions are produced for generation.

Conclusion
In this paper, we propose to generate radiology reports with cross-modal memory networks, where a memory matrix is employed to record the alignment and interaction between images and texts, with memory querying and responding performed to obtain the shared information across modalities. Experimental results on two benchmark datasets demonstrate the effectiveness of our model, which achieves the state-of-the-art performance. Further analyses investigate the effects of hyper-parameters in our model and show that our model is able to better align information from images and texts, so as to generate more accurate reports, especially with the fact that enlarging the memory matrix does not significantly affect the entire model size.