Writing by Memorizing: Hierarchical Retrieval-based Medical Report Generation

Medical report generation is one of the most challenging tasks in medical image analysis. Although existing approaches have achieved promising results, they either require a predefined template database in order to retrieve sentences or ignore the hierarchical nature of medical report generation. To address these issues, we propose MedWriter that incorporates a novel hierarchical retrieval mechanism to automatically extract both report and sentence-level templates for clinically accurate report generation. MedWriter first employs the Visual-Language Retrieval (VLR) module to retrieve the most relevant reports for the given images. To guarantee the logical coherence between generated sentences, the Language-Language Retrieval (LLR) module is introduced to retrieve relevant sentences based on the previous generated description. At last, a language decoder fuses image features and features from retrieved reports and sentences to generate meaningful medical reports. We verified the effectiveness of our model by automatic evaluation and human evaluation on two datasets, i.e., Open-I and MIMIC-CXR.


Introduction
Medical report generation is the task of generating reports based on medical images, such as radiology and pathology images. Given that this task is timeconsuming and cumbersome, researchers endeavor to relieve the burden of physicians by automatically generating the findings and descriptions from medical images with machine learning techniques.
Existing studies can be roughly divided into two categories, i.e., generation-based and retrievalbased approaches. Generation-based methods, including LRCN (Donahue et al., 2015), CoAtt (Jing et al., 2018), and MvH+AttL (Yuan et al., 2019), focus on generating image captions with a encoderdecoder model that leverage image features. However, they are unable to produce linguistically diverse descriptions and depict rare but prominent medical findings. On the other hand, Retrievalbased methods such as HRGR-Agent (Li et al., 2018) and KEPP (Li et al., 2019), pay attention to memorizing templates to generate standardized reports from a predefined retrieval database. However, the quality of generated reports significantly depends on the manually curated template database. Besides, they only use sentence-level templates for the generation but ignore to learn the report-level templates, which prevent them from generating more accurate reports.
To address the aforementioned issues, we propose a new framework called MedWriter as shown in Figure 1. MedWriter introduces a novel hierarchical retrieval mechanism working with a hierarchical language decoder to automatically learn the dynamic report and sentence templates from the data for generating accurate and professional medical reports. MedWriter is inspired by the process of how physicians write medical reports in real life. They keep report templates in mind and then generate reports for new images by using the key information that they find in the medical images to update the templates sentence by sentence.
In particular, we use three modules to mimic this process. First, MedWriter generates reportlevel templates from the Visual-Language Retrieval (VLR) module using the visual features as the queries. To generate accurate reports, MedWriter also predicts disease labels based on the visual features and extracts medical keywords from the retrieved reports. We propose a multiquery attention mechanism to learn the reportlevel template representations. Second, to make the generated reports more coherent and fluent, we propose a Language-Language Retrieval (LLR) module, which aims to learn sentence-level templates for the next sentence generation by analyzing between-sentence correlation in the retrieved reports. Finally, a hierarchical language decoder is adopted to generate the full report using visual features, report-level and sentence-level template representations. The designed two-level retrieval mechanism for memorization is helpful in generating accurate and diverse medical reports. To sum up, our contributions are: • To the best of our knowledge, we are the first to model the memory retrieval mechanism in both report and sentence levels. By imitating the standardized medical report generation in real life, our memory retrieval mechanism effectively utilizes existing templates in the twolayer hierarchy in medical texts. This design allows MedWriter to generate more clinically accurate and standardized reports. • On top of the retrieval modules, we design a new multi-query attention mechanism to fuse the retrieved information for medical report generation. The fused information can be well incorporated with the existing image and report-level information, which can improve the quality of generated report . • Experiments conducted on two large-scale medical report generation datasets, i.e., Openi and MIMIC-CXR show that MedWriter achieves better performance compared with state-of-the-art baselines measured by CIDEr, ROUGE-L, and BLEUs. Besides, case studies show that MedWriter provides more accurate and natural descriptions for medical images through domain expert evaluation.

Related work
Generation-based report generation Visual captioning is the process of generating a textual de-scription given an image or a video. The dominant neural network architecture of the captioning task is based on the encoder-decoder framework (Bahdanau et al., 2014;Vinyals et al., 2015;Mao et al., 2014), with attention mechanism (Xu et al., 2015;You et al., 2016;Lu et al., 2017;Anderson et al., 2018;Wang et al., 2019). As a sub-task in the medical domain, early studies directly apply state-of-theart encoder-decoder models as CNN-RNN (Vinyals et al., 2015), LRCN (Donahue et al., 2015) and AdaAtt (Lu et al., 2017) to medical report generation task. To further improve long text generation with domain-specific knowledge, later generationbased methods introduce hierarchical LSTM with co-attention (Jing et al., 2018) or use the medical concept features (Yuan et al., 2019) to attentively guide the report generation. On the other hand, the concept of reinforcement learning (Liu et al., 2019) is utilized to ensure the generated radiology reports correctly describe the clinical findings.
To avoid generating clinically non-informative reports, external domain knowledge like knowledge graphs (Zhang et al., 2020;Li et al., 2019) and anchor words (Biswal et al., 2020) are utilized to promote the medical values of diagnostic reports. CLARA (Biswal et al., 2020) also provides an interactive solution that integrates the doctors' judgment into the generation process.
Retrieval-based report generation Retrievalbased approaches are usually hybridized with generation-based ones to improve the readability of generated medical reports. For example, KERP (Li et al., 2019) uses abnormality graphs to retrieve most related sentence templates during the generation. HRGR-Agent (Li et al., 2018) incorporates retrieved sentences in a reinforcement learning framework for medical report generation. However, they all require a template database as the model input. Different from these models, MedWriter is able to automatically learn both report-level and sentence-level templates from the data, which significantly enhances the model applicability.

Report-level Retrieval
Spatial Attention

Visual Feature Extraction
The  Figure 2: Details of the proposed MedWriter model for medical report generation. The left part is used to learn report template representations via the visual-language retrieval (VLR) module, which is further used to generate the first sentence via the hierarchical language decoder. The right part shows the details of the language-language (LLR) module used for generating the remaining sentences.
works on the sentence level and retrieves a series of candidates that are most likely to be the next sentence from the retrieval pool given the generated language context. Finally, MedWriter generates accurate, diverse, and disease-specified medical reports by a hierarchical language decoder that fuses the visual, linguistics and pathological information obtained by VLR and LLR modules. To improve the effectiveness and efficiency of retrieval, we first pretrain VLR and LLR modules to build up a retrieval pool for medical report generation as follows.

VLR module pretraining
The VLR module aims to retrieve the most relevant medical reports from the training report corpus for the given medical images. The retrieved reports are further used to learn an abstract template for generating new high-quality reports. Towards this goal, we introduce a self-supervised pretraining task by judging whether an image-report pair come from the same subject, i.e., image-report matching. It is based on an intuitive assumption that an imagereport pair from the same subject shares certain common semantics. More importantly, the disease types associated with images and the report should be similar. Thus, in the pretraining task, we also take disease categories into consideration.

Disease classification
The input of the VLR module is a series of multi-modal images and the corresponding report consists of b images, and r denotes the report. We employ a Convolutional Neural Network (CNN) f v (·) as the image encoder to obtain the feature of a given im- With all the extracted features {v i } b i=1 , we add them together as the inputs of the disease classification task, which is further used to learn the disease type representation as follows, where W cls ∈ R c×d and b cls ∈ R c are the weight and bias terms of a linear model, AvgPool is the operation of average pooling, c is the number of disease classes, and c pred ∈ R c can be used to compute disease probabilities as a multi-label classification task with a sigmoid function, i.e., p dc = sigmoid(c pred ).

Image-report matching
The next training task for VLR is to predict whether an image-report pair belongs to the same subject. In this subtask, after obtaining the image features {v i } b i=1 and the disease type representation c pred , we extract a context visual vector v by the pathological attention.
First, for each image feature v i , we use the disease type representation c pred to learn the spatial attention score through a linear transformation, where a v ∈ R k×k , W a , W v and W c are the linear transformation matrices. After that, we use the normalized spatial attention score α v = softmax(a v ) to add visual features over all locations (x, y) across the feature map, Then, we compute the context vector v of the input image set {I i } b i=1 using a linear layer on the concatenation of all the representation v For the image-report matching task, we also need a language representation, which is extracted by a BERT (Devlin et al., 2018) model f l (·) as the language encoder. f l (·) converts the medical report r into a semantic vector r = f l (r) ∈ R d . Finally, the probability of the input pair ({I i } b i=1 , r) coming from the same subject can be computed as Given these two sub-tasks, we simultaneously optimize the cross-entropy losses for both disease classification and image-report matching to train the VLR module.

LLR module pretraining
A medical report usually has some logical characteristics such as describing the patient's medical images in a from-top-to-bottom order. Besides, the preceding and following sentences in a medical report may provide explanations for the same object or concept, or they may have certain juxtaposition, transition and progressive relations. Automatically learning such characteristics should be helpful for MedWriter to generate high-quality medical reports. Towards this end, we propose to pretrain a language-language retrieval (LLR) module to search for the most relevant sentences for the next sentence generation. In particular, we introduce a self-supervised pretraining task for LLR to determine if two sentences {s i , s j } come from the same report, i.e., sentence-sentence matching.
Similar to the VLR module, we use a BERT model f s (·) as the sentence encoder to embed the sentence inputs Then the probability that two sentences {s i , s j } come from the same medical report is measure by Again, the cross-entropy loss is used to optimize the learning objective given probability p ll and the ground-truth label of whether s 1 and s 2 belong to the same medical report or not.

Retrieval-based report generation
Using the pretrained VLR and LLR modules, MedWriter generates a medical report given a sequence of input images {I i } b i=1 using a novel hierarchical retrieval mechanism with a hierarchical language decoder.
3.3.1 VLR module for report-level retrieval Report retrieval Let D (tr) r = {r j } Ntr j=1 denote the set of all the training reports, where N tr is the number of reports in the training dataset. For each report r j , MedWriter first obtain its vector representation using f r (·) in the VLR module, which is denoted as r j = f r (r j ). Let P r = {r j } Ntr j=1 denote the set of training report representations. Given the multi-modal medical images {I i } b i=1 of a subject, the VLR module aims to return the top k r medical reports {r j } kr j=1 as well as medical keywords within in the retrieved reports.
Specifically, MedWriter extracts the image feature v for {I i } b i=1 using the pathological attention mechanism as described in Section 3.1. According to Eq. (4), MedWriter then computes a image-report matching sore p vl between v and each r ∈ P r . The top k r reports {r j } kr j=1 with the largest scores p vl are considered as the most relevant medical reports corresponding to the images, and they are selected as the template descriptions. From these templates, we identify n medical keywords {w i } n i=1 using a dictionary as a summarization of the template information. The medical keyword dictionary includes disease phenotype, human organ, and tissue, which consists of 36 medical keywords extracted from the training data with the highest frequency.
Report template representation learning The retrieved reports are highly related to the given images, which should be helpful for the report generation. To make full use of them, we need to learn a report template representation using the image feature v, the features of retrieved reports for {w i } n i=1 learned from the pretrained word embeddings, and the disease embeddings {c k } m k=1 from predicted disease labels {c k } m k=1 using Disease Classification in Section 3.1.1.
We propose a new multi-query attention mechanism to learn the report template representation. To specify, we use the image features v as the key vector K, the retrieved report features {r j } kr j=1 as the value matrix V , and the embeddings of both medical keywords {w i } n i=1 and disease labels {c k } m k=1 as the query vectors Q. We modify the original self-attention (Vaswani et al., 2017) into a multi-query attention. For each query vector Q i in Q, we first get a corresponding attended feature and then transform them into the report template vector r s after concatenation, where attn i = Attention(Q i , KW K , V W V ), and W K , W V and W O are the transformation matrices. Generally, the Attention function is calculated by where Q, K, V are queries, keys and values in general case, and d g is the dimension of the query vector.

LLR module for sentence-level retrieval
Since retrieved reports {r t j } kr j=1 are highly associated with the input images, the sentence within those reports must contain some instructive pathological information that is helpful for sentencelevel generation. Towards this end, we first select sentences from the retrieved reports and then learn sentence-level template representation.
Sentence retrieval We first divide the retrieved reports into L candidate sentences {s j } L j=1 as the retrieval pool in the LLR module. Given the pretrained LLR language encoder f s (·), we can obtain the sentence-level feature pool, which is P s = {f s (s j )} L j=1 = {s j } L j=1 . Assume that the generated sentence at time t is denoted as o t , and its embedding is o t = f s (o t ), which is used to find k s sentences {s j } ks j=1 with the highest probabilities p ll from the candidate sentence pool using Eq. (5) in Section 3.2.
Sentence template representation learning Similar to the report template representation, we still use the multi-query attention mechanism. From the retrieved k s sentences, we extract the medical keywords {w i } n i=1 . Besides, we have the predicted disease labels {c k } m k=1 . Their embeddings are considered as the query vectors. The embeddings of the extracted sentence, i.e., {f s (s j )} ks j=1 = {s j } ks j=1 , are treated as the value vectors. The key vector is the current sentence (word) hidden state h s t (h w i ), which will be introduced in Section 3.3.3. According to Eq. (6), we can obtain the sentence template representation at time t, which is denoted as u t (u w i used for word-level generation).

Hierarchical language decoder
With the extracted features by the retrieval mechanism described above, we apply a hierarchical decoder to generate radiology reports according to the hierarchical linguistics structure of the medical reports. The decoder contains two layers, i.e., a sentence LSTM decoder that outputs sentence hidden states, and a word LSTM decoder which decodes the sentence hidden states into natural languages. In this way, reports are generated sentence by sentence.
Sentence-level LSTM For generating the t-th sentence, MedWriter first uses the previous t − 1 sentences to learn the sentence-level hidden state h s t . Specifically, MedWriter learns the image feature v s based on Eq. (3). When calculating the attention score with Eq. (2), we consider both the information obtained from the previous t − 1 sentences (the hidden state h s t−1 ) and the predicted disease representation from Eq. (1), i.e., replacing c pred with concat(h t−1 , c pred ). Then the concatenation of the image feature v s , the report template representation r s from Eq. (6), and the sentence template representation u s t−1 is used as the input of the sentence LSTM to learn the hidden state h s t h s t = LSTM s (concat(v s , u s t−1 , r s ), h s t−1 ), (7) where u s t−1 is obtained using the multi-query attention, the key vector is the hidden state h s t−1 , the value vectors are the representations of the retrieved sentences according to the (t − 1)-th sentence, and the query vectors are the embeddings of both medical keywords extracted from the retrieved sentences and the predicted disease labels.
Word-level LSTM Based on the learned h s t , MedWriter conducts the word-by-word generation using a word-level LSTM. For generating the (i + 1)-th word, MedWriter first learns the image feature v w using Eq. (2) by replacing c pred with h w i in Eq.
(2), where h w i is the hidden state of the i-th word. MedWriter then learns the sentence template representation u w i using the multi-query attention, where the key vector is the hidden state h w i , value and query vectors are the same as those used for calculating u s t−1 . Finally, the concatenation of h s t , u w i , v w , and r s is taken as the input of the word-level LSTM to generate the (i + 1)-th word as follows: where F F N (·) is the feed-forward network. Note that for the first sentence generation, we set u 0 as 0, and h 0 is the randomly initialized vector, to learn the sentence-level hidden state h s 1 . When generating the words of the first sentence, we set u w i as the 0 vector. For both datasets, we tokenize all words with more than 3 occurrences and obtain 1,252 tokens on the Open-i dataset and 4,073 tokens on the MIMIC-CXR dataset, including four special tokens PAD , START , END , and UNK . The findings and impression sections are concatenated as the ground-truth reports. We randomly divide the whole datasets into train/validation/test sets with a ratio of 0.7/0.1/0.2. To conduct the disease classification task, we include 20 most frequent finding keywords extracted from MeSH tags as disease categories on the Open-i dataset and 14 CheXpert categories on the MIMIC-CXR dataset.
Baselines On both datasets, we compare with four state-of-the-art image captioning models: CNN-RNN (Vinyals et al., 2015), CoAttn (Jing et al., 2018), MvH+AttL (Yuan et al., 2019), and V-L Retrieval. V-L Retrieval only uses the retrieved report templates with the highest probability as prediction without the generation part based on our pretrained VLR module. Due to the lack of the opensource code for (Wang et al., 2018;Li et al., 2019Li et al., , 2018Donahue et al., 2015) and the template databases for (Li et al., 2019(Li et al., , 2018, we only include the reported results on the Open-i dataset in our experiments.

Experimental setup
All input images are resized to 512 × 512, and the feature map from DenseNet-121 (Huang et al., 2017) is 1024 × 16 × 16. During training, we use random cropping and color histogram equalization for data augmentation.
To pretrain the VLR module, the maximum length of the report is restricted to 128 words. We train VLR module for 100 epochs with an Adam (Kingma and Ba, 2014) optimizer with 1e-5 as the initial learning rate, 1e-5 for L2 regularization, and 16 as the mini-batch size. To pretrain the LLR module, the maximum length of each sentence is set to 32 words. We optimize the LLR module for 100 epochs with an Adam (Kingma and Ba, 2014) optimizer with the initial learning rate of 1e-5 and a mini-batch size of 64. The learning rate is multiplied by 0.2 every 20 epochs.
To train the full model for MedWriter, we set the retrieved reports number k r = 5 and sentences number k s = 5. Extracting n = 5 medical keywords and predicting m = 5 disease labels are used for report generation. Both sentence and word LSTM have 512 hidden units. We freeze the weights for the pretrained VLR and LLR modules and only optimize on the language decoder. We set the initial learning rate as 3e-4 and mini-batch size as 32. MedWriter takes 10 hours to train on the Open-i dataset and 3 days on the MIMIC-CXR dataset with four GeForce GTX 1080 Ti GPUs. Table 1 shows the CIDEr, ROUGE-L, BLUE, and AUC scores achieved by different methods on the test sets of Open-i and MIMIC-CXR. Table 1, we make the following observations. First, compared with Generation-based model, Retrieval-based model that uses the template reports as results has set up a relatively strong baseline for medical report generation. Second, compared with V-L retrieval, other Retrieval-based approaches perform much better in terms of all the metrics. This again shows that that by integrating the information retrieval method into the deep sequence generation framework, we can not only use the retrieved language information as templates to help generate long sentences, but also overcome the monotony of only using the templates as the generations. Finally, we see that the proposed MedWriter achieves the highest language scores on 5/6 metrics on Open-i  Table 1: Automatic evaluation on the Open-i and MIMIC-CXR datasets. * indicates the results reported in (Li et al., 2019).

Language evaluation From
datasets and all metrics on MIMIC-CXR among all methods. MedWriter not only improves current SOTA model CoAttn (Jing et al., 2018) by 5% and MvH+AttL (Yuan et al., 2019) by 4% on Open-i in average, but also goes beyond SOAT retrieval-based approaches like KERP (Li et al., 2019) and HRGR-Agent (Li et al., 2018) and significantly improves the performance, even without using manually curated template databases. This illustrates the effectiveness of automatically learning templates and adopting hierarchical retrieval in writing medical reports.
Clinical evaluation We train two report classification BERT models on both datasets and use it to judge whether the generated reports correctly reflect the ground-truth findings. We show the mean ROC-AUC scores achieved by generated reports from different baselines in the last column of Table 1. We can observe that MedWriter achieves the highest AUC scores compared with other baselines. In addition, our method achieves the AUC scores that are very close to those of professional doctors' reports, with 0.814/0.915 and 0.833/0.923 on two datasets. This shows that the generation performance of MedWriter has approached the level of human domain experts, and it embraces great medical potentials in identifying disease-related medical findings.
Human evaluation We also qualitatively evaluate the quality of the generated reports via a user study. We randomly select 50 samples from the Open-i test set and collect ground-truth reports and the generated reports from both MvH+AttL (Yuan et al., 2019) and MedWriter to conduct the human evaluation. Two experienced radiologists were asked to give ratings for each selected report, in terms of whether the generated reports are realistic and relevant to the X-ray images. The ratings are integers from one to five. The higher, the better.   Figure 3 shows qualitative results of MedWriter and baseline models on the Open-i dataset. MedWriter not only produces longer reports compared with MvH+AttL but also accurately detects the medical findings in the images (marked in red and bold). On the other hand, we find that MedWriter is able to put forward some supplementary suggestions (marked in blue) and descriptions, which are not in the original report but have diagnostic value. The underlying reason for this merit comes from the memory retrieval mechanism that introduces prior medical knowledge to facilitate the generation process.

Ablation study
We perform ablation studies on the Open-i and MIMIC-CXR datasets to investigate the effectiveness of each module in MedWriter. In each of the following studies, we change one module with other modules intact.
Removing the VLR module In this experiment, global report feature r s is neglected in Eqs. (7) chest. large nodule at the right lung base that probably represents a granuloma although not it is not UNK calcified. there is a UNK mm nodule in the right lower lobe that is relatively dense but not UNK calcified on the corresponding rib series. there are probably right hilar calcified lymph UNK . lungs otherwise are clear. there is no pleural effusion. left ribs. no fracture or focal bony destruction.
no acute cardiopulmonary disease. the heart is normal in size and contour. are clear without evidence of infiltrate. is no pneumothorax. degenerative changes of the thoracic spine.. head.. right upper lobe pneumonia. consideration may be given for primary or UNK . recommend ct of the chest may be helpful for further diagnosis. in the interval a 3 cm UNK mass has developed in the right lower lobe. no pneumothorax or pleural effusion. the mediastinal contours are normal. Figure 3: Examples of ground-truth and generated reports by MvH+AttL (Yuan et al., 2019) and MedWriter. Highlighted red phrases are medical abnormality terms that generated and ground-truth reports have in common. Bold terms are common descriptions of normal tissues. The text in italics is the opposite meaning of the generated report and the actual report. We mark the supplementary comments to the original report in blue.

Dataset
Model CIDEr ROUGE-L BLEU-1 BLEU-2 BLEU-3 BLEU-4 Open-i  and (8), and the first sentence is generated only based on image features. The LLR module keeps its functionality. However, instead of looking for sentence-level templates from the retrieved reports, it searches for most relevant sentences from all the reports. As can be seen from Table 3, removing VLR module ("w/o VLRM") leads to performance reduction by 2% on average. This demonstrates that visual-language retrieval is capable in sketching out the linguistic structure of the whole report. The rest of the language generation is largely influenced by report-level context information.
Removing the LLR module The generation of (t + 1)-th sentence is based on the global report feature r s and the image feature v, without using the retrieved sentences information in Eq. (8). Table 3 shows that removing LLR module ("w/o LLRM") results in the decease of average evaluation scores by 4% compared with the full model. This verifies that the LLR module plays an essential role in generating long and coherent clinical reports.
Replacing hierarchical language decoder We use a single layer LSTM that treats the whole report as a long sentence and conduct the generation wordby-word. Table 3 shows that replacing hierarchical language decoder with a single-layer LSTM ("w/o HLD") introduces dramatic performance reduction. This phenomenon shows that the hierarchical generative model can effectively and greatly improve the performance of long text generation tasks.

Conclusions
Automatically generating accurate reports from medical images is a key challenge in medical image analysis. In this paper, we propose a novel model named MedWriter to solve this problem based on hierarchical retrieval techniques. In particular, MedWriter consists of three main modules, which are the visual-language retrieval (VLR) module, the language-language retrieval (LLR) module, and the hierarchical language decoder. These three modules tightly work with each other to automatically generate medical reports. Experimental results on two datasets demonstrate the effectiveness of the proposed MedWriter. Besides, qualitative studies show that MedWriter is able to generate meaningful and realistic medical reports.