Automated Generation of Accurate & Fluent Medical X-ray Reports

Our paper aims to automate the generation of medical reports from chest X-ray image inputs, a critical yet time-consuming task for radiologists. Existing medical report generation efforts emphasize producing human-readable reports, yet the generated text may not be well aligned to the clinical facts. Our generated medical reports, on the other hand, are fluent and, more importantly, clinically accurate. This is achieved by our fully differentiable and end-to-end paradigm that contains three complementary modules: taking the chest X-ray images and clinical history document of patients as inputs, our classification module produces an internal checklist of disease-related topics, referred to as enriched disease embedding; the embedding representation is then passed to our transformer-based generator, to produce the medical report; meanwhile, our generator also creates a weighted embedding representation, which is fed to our interpreter to ensure consistency with respect to disease-related topics. Empirical evaluations demonstrate very promising results achieved by our approach on commonly-used metrics concerning language fluency and clinical accuracy. Moreover, noticeable performance gains are consistently observed when additional input information is available, such as the clinical document and extra scans from different views.


Introduction
Medical reports are the primary medium, through which physicians communicate findings and diagnoses from the medical scans of patients. The process is usually laborious, where typing out a medical report takes on average five to ten minutes (Jing et al., 2018); it could also be error-prone. This has led to a surging need for automated generation of medical reports, to assist radiologists and physicians in making rapid and meaningful diagnoses. * indicates equal contribution.
Code is available at https://github.com/ginobilinie/xray_ report_generation Its potential efficiency and benefits could be enormous, especially during critical situations such as COVID or a similar pandemic. Clearly a successful medical report generation process is expected to possess two key properties: 1) clinical accuracy, to properly and correctly describe the disease and related symptoms; 2) language fluency, to produce realistic and human-readable text.
Fueled by recent progresses in the closely related computer vision problem of image-based captioning (Vinyals et al., 2015;Tran et al., 2020), there have been a number of research efforts in medical report generation in recent years (Jing et al., 2018(Jing et al., , 2019Li et al., 2018Xue et al., 2018;Yuan et al., 2019;Wang et al., 2018;Lovelace and Mortazavi, 2020;Srinivasan et al., 2020). These methods often perform reasonably well in addressing the language fluency aspect; on the other hand, as is also evidenced in our empirical evaluation, their results are notably less satisfactory in terms of clinical accuracy. This we attribute to two reasons: one is closely tied to the textual characteristic of medical reports, which typically consists of many long sentences describing various disease related symptoms and related topics in precise and domain-specific terms. This clearly sets the medical report generation task apart from a typical image-to-text problem such as image-based captioning; another reason is related to the lack of full use of rich contextual information that encodes prior knowledge. These information include for example the patient's clinical document describing key clinical history and indication from doctors, and multiple scans from distinct 3D views -information that are typically existed in abundance in practical scenarios, as e.g. in the standard X-ray benchmarks of Open-I (Demner-Fushman et al., 2016) and MIMIC-CXR (Johnson et al., 2019).
The aforementioned observations motivate us to propose a categorize-generate-interpret framework that places specific emphasis on clinical accuracy Figure 1: Our approach consists of three modules: a classifier that reads chest X-ray images and clinical history to produce an internal checklist of disease-related topics, a transformer-based generator to generate fluent text, and an interpreter to examine and fine-tune the generated text to be consistent with the disease-related topics.
while maintaining adequate language fluency of the generated reports. It consists of a classifier module reads chest X-ray images (e.g., either single-view or multi-view images) and related documents to detect diseases and output enriched disease embedding, a transformer-based medical report generator, and a differentiable interpreter to evaluate and finetune the generated reports for factual correctness. The main contributions are two-fold: • A differentiable end-to-end approach is proposed, consisting of three modules (classifiergenerator-interpreter): the classifier module learns the disease feature representation via context modeling (section 3.1.3) and diseasestate aware mechanism (section 3.1.4); the generator module transforms the disease embedding to medical report; the interpreter module reads and fine-tunes the generated reports, enhancing the consistency of the generated reports and the classifier's outputs.
• Empirically our approach is shown to outperform against a number of strong baselines over two widely-used benchmarks on an equal footing (i.e. without accessing to additional information). Moreover, empirical evidence demonstrates clinical patient history as well as additional scans may play a vital role in improving the quality of the generated reports.
It is worth noting that most existing methods concentrate on the image-to-fluent-text aspect of the medical report generation problem; on the other hand, their results are considerably less well-versed at uncovering the intended disease and symptom related topics in the generated texts, the true gems where the physicians would base their decisions upon. To alleviate this issue, a graph-based approach is considered in : it starts by compiling a list of common abnormalities, then transforms them into correlated disease graphs, and categorizes medical reports into templates for paraphrasing. Its practical performance is however less stellar, which may be credit to the fact that  is fundamentally based on detecting abnormalities from medical images, thus may overlook other important information.

Transformers
The transformer technique (Vaswani et al., 2017) is first introduced in the context of machine translation with the purpose of expediting training and improving long-range dependency modeling. They are achieved by processing sequential data in parallel with an attention mechanism, consisting of a multi-head self-attention module and a feedforward layer. By considering multi-head selfattention mechanisms, including e.g. a graph attention network (Velickovic et al., 2018), recent transformer-based models have shown considerable advancement in many difficult tasks, such as image generation , story generation (Radford et al., 2018), question answering, and language inference (Devlin et al., 2019).

CheXpert Labeler
The CheXpert labeler (Irvin et al., 2019) is a rulebased system that extracts and classifies medical reports into 14 common diseases. Each disease label is either positive, negative, uncertain, or unmentioned. This is a crucial part in building large-scale chest X-ray datasets, such as (Irvin et al., 2019;Johnson et al., 2019), where an alternative manual labeling process may take years of effort. It could also be used to evaluate the clinical accuracy of a generated medical report . Another important use of the CheXpert labeler is to facilitate the generation of medical reports. Since the rule-based CheXpert labeler is not differentiable, it is regarded as a score function estimator for reinforcement learning models  to fine-tune the generated texts. However, the reinforcement learning methods are often computationally expensive and practically difficult to convergence. As an alternative, Lovelace et al. (Lovelace and Mortazavi, 2020) propose an attention LSTM model and fine-tune the generated report via a differentiable Gumbel random sampling trick, with promising results.

Our Approach
Our framework consists of a classification module, a generation module, and an interpretation module, as illustrated in Fig. 1. The classification module reads multiple chest X-ray images and extracts the global visual feature representation via a multiview image encoder. They are then disentangled into multiple low-dimensional visual embedding. Meanwhile, the text encoder reads clinical documents, including, e.g., doctor indication, and summarizes the content into text-summarized embedding. The visual and text-summarized embeddings are entangled via an "add & layerNorm" operation to form contextualized embedding in terms of disease-related topics. The generation module takes our enriched disease embedding as initial input and generates text word-by-word, as shown in Fig. 2. Finally, the generated text is fed to the interpretation module for fine-tuning to align to the checklist of disease-related topics from the classification module. In what follows, we are to elaborate on these three modules in detail. Figure 2: An exemplar illustration of our approach in action. Specifically, the enriched disease embedding produced from the classification module are fed into the generation module as initial inputs. Then, at each time step, the hidden state h i is obtained to predict the next output word. Finally, the interpretation module takes as input all predicted outputsŴ to produce a checklist of disease-related topics, which are to be gauged with the same topics output from the classification module for consistency verification.

Multi-view Image Encoder
For each medical study which consists of m chest where c is the number of features, via a shared DenseNet-121 image encoder (Huang et al., 2017). Then, the multi-view latent features x ∈ R c can be obtained by max-pooling across the set of m latent features {x i } m i=1 , as proposed in (Su et al., 2015). When m = 1, the multi-view encoder boils down to a single-image encoder.

Text Encoder
Let T be a text document with length l consisting of word embeddings {w 1 , w 2 , ..., w l }, where w i ∈ R e embodies the i-th word in the text and e is the embedding dimension. We use the transformer encoder (Vaswani et al., 2017) as our text feature extractor to retrieve a set of hidden states H = {h 1 , h 2 , ..., h l }, where h i ∈ R e is the attended features of the i-th word to other words in the text, (1) The entire document T is then summarized by Q = {q 1 , q 2 , ..., q n }, representing n diseaserelated topics (e.g., pneumonia or atelectasis) to be queried from the document. We refer to this retrieval process as text-summarized embedding D txt ∈ R n×e , Here matrix Q ∈ R n×e is formed by stacking the set of vectors {q 1 , q 2 , ..., q n } where q i ∈ R e is randomly initialized, then learned via the attention process. Similarly, the matrix H ∈ R l×e is formed by {h 1 , h 2 , ..., h l } from Eq. (1). The term Softmax(QH ) is the word attention heat-map for the n queried diseases in the document. The intuition here is for each disease (e.g., pneumonia) to be queried from the text document T . We only pay attention to the most relevant words (e.g., cough or shortness of breath) in the text that associates with that disease, also known as a vector similarity dot product. This way, the weighted sum of these words by Eq. (2) gives the feature that summarizes the document w.r.t. the queried disease.

Contextualized Disease Embedding
The latent visual features x ∈ R c are subsequently decoupled into low-dimensional disease representations, as illustrated in Fig. 1. They are regarded as the visual embedding D img ∈ R n×e , where each row is a vector φ j (x) ∈ R e , j = 1, . . . , n defined as follows: Here A j ∈ R c×e and b j ∈ R e are learnable parameters of the j-th disease representation. n is the number of disease representations, and e is the embedding dimension. Now, together with the available clinical documents, the visual embedding D img and the text-summarized embedding D txt are entangled to form contextualized disease represen- Intuitively, the entanglement of visual and textual information allows our model to mimic the hospital workflow, to screen the disease's visual representations conditioned on the patients' clinical history or doctors' indication. For example, the doctor's indication in Fig. 1 shows cough and shortness of breath symptoms. It is reasonable for a medical doctor to request a follow-up check of the pneumonia disease. As for the radiologists receiving the doctors' indication, they may prioritize diagnosing the presence of pneumonia and related diseases based on X-ray scans and look for specific abnormalities. As empirically shown in Table 4, the proposed contextualized disease representations bring a significant performance boost in the medical report generation task. Meanwhile, our current embedding is basically a plain mingling of heterogeneous sources of information such as disease type (i.e., disease name) and disease state (e.g., positive or negative). As shown by the ablation study in Table 4, this embedding by itself is insufficient for generating accurate medical reports. This leads us to conceive a follow-up enriched representation below.

Enriched Disease Embedding
The main idea behind enriched disease embedding is to further encode informative attributes about disease states, such as positive, negative, uncertain, or unmentioned. Formally, let k be the number of states and S ∈ R k×e the state embedding. Then the confidence of classifying each disease into one of the k disease states is S ∈ R k×e is randomly initialized, then learned via the classification of D fused . D fused acts as features for the multi-label classification, and the classification loss is computed as where y ij ∈ {0, 1} and p ij ∈ (0, 1) are the jth ground-truth and predicted values for the disease i-th, respectively. The state-aware embedding D states ∈ R n×e are then computed as D states = yS, if training phase pS, otherwise.
y ∈ {0, 1} n×k is the one-hot ground-truth labels about the disease-related topics, whereas p ∈ (0, 1) n×k is the predicted values. During training, the ground-truth disease states facilitate our generator in describing the diseases & related symptoms based on accurate information (teacher forcing). At test time, our generator then furnishes its recount based on the predicted states. Finally, the enriched disease embedding D enriched ∈ R n×e is the composition of state-aware disease embedding D states (i.e., good or bad), disease names D topics (i.e., which disease/topic), and the disease representations D fused (i.e., severity and details of the diseases), Like the disease queries Q, D topics ∈ R n×e is randomly initialized, representing diseases or topics to be generated. It is then learned in training through the medical report generation pipeline. The enriched disease embedding provides explicit and precise disease descriptions, and endows our followup generation module with a powerful data representation.

The Generation Module
Our report generator is derived from the transformer encoder of (Vaswani et al., 2017). The network is formed by sandwiching & stacking a masked multi-head self-attention component and a feed-forward layer being on top of each other for N times, as illustrated in Fig. 2. The hidden state for each word position h i ∈ R e in the medical report is then computed based on previous words and disease embedding, as D enriched = {d i } n i=1 , h i = Encoder(w i |w 1 , w 2 , ..., w i−1 , d 1 , d 2 , ..., d n ).
(9) This is followed by predicting future words based on the hidden states Here W ∈ R v×e is the entire vocabulary embedding, v the vocabulary size, and l the document length. Let p word,ij denote the confidence of selecting the j-th word in the vocabulary W for the i-th position in the generated medical report. The generator loss is defined as a cross entropy of the groundtruth words y word and predicted words p word , y word,ij log(p word,ij ). (11) Finally, the weighted word embeddingŴ ∈ R l×e , also known as the generated report, are: It is worth noting that this set-up facilitate the backpropagation of errors from the follow-up interpretation module.

The Interpretation Module
It is observed from empirical evaluations that the generated reports are often distorted in the process, such that they become inconsistent with the original output of the classification module -the enriched disease embedding that encodes the disease and symptom related topics. Inspired by the CycleGAN idea of , we consider a fully differentiable network module to estimate the checklist of disease-related topics based on the generator's output, and to compare with the original output of the classification module. This provides a meaningful feedback loop to regulate the generated reports, which is used to fine-tune the generated report through the word representation outputsŴ . Specifically, we build on top of the proposed text encoder (described in section 3.1.2) a classification network that classifies disease-related topics, as follows. First, the text encoder summarizes the current medical reportŴ , and outputs the reportsummarized embedding of the queried diseases Q, HereĤ is computed from the generated medical reportsŴ using Eq. (1). Second, each of the reportsummarized embeddingd i ∈ R e (i.e., each row of the matrixD txt ∈ R n×e ) is classified into one of the k disease-related states (i.e., positive or negative), as Finally, the interpreter is trained to minimize the subsequent multi-label classification loss, here y ij ∈ {0, 1} is the ground-truth disease label and p int,ij ∈ (0, 1) is the predicted disease label of the interpreter. In fine-tuning the generated medical reportsŴ , all interpreter parameters are frozen, which acts as a guide to force the word representationsŴ being close to what the interpreter has learned from the ground-truth medical reports. If the weighted word embeddingŴ is different from the learned representation -which leads to incorrect classificationa large loss value will be imposed in the interpretation module. This thus forces the generator to move toward producing a correct word representation.
Collectively our model is trained in an end-toend manner by jointly minimizing the total loss,

Experiments
This section evaluates the medical report generation task on two fronts: the language performance and the clinical accuracy performance. Empirical evaluations are carried out on two widely-used chest X-ray datasets,  (Lovelace and Mortazavi, 2020) to focus on generating text in the "findings" section as the corresponding medical report.

Open-I Dataset
The Open-I dataset (Demner-Fushman et al., 2016) collected by the Indiana University hospital network contains 3,955 radiology studies that correspond to 7,470 frontal and lateral chest X-rays. Some radiology studies are associated with more than one chest X-ray image. Each study typically consists of impression, findings, comparison, and indication sections. Similar to the MIMIC-CXR dataset, we utilized both the multi-view chest X-ray images (frontal and lateral) and the indication section as our contextual inputs. For generating medi-   cal reports, we follow the existing literature (Jing et al., 2018;Srinivasan et al., 2020) by concatenating the impression and the findings sections as the target output. An important note: the implementation details, dataset splits, preprocessing steps, generated examples, and qualitative analysis are described in the supplementary materials.

Language Generation Performance
A comprehensive quantitative comparison of our approach and many baselines as shown in Table 1 on the two benchmarks using the widely-used language evaluation metrics: BLEU-1 to BLEU-4 (Papineni et al., 2002), ROUGE-L (Lin, 2004), and METEOR (Banerjee and Lavie, 2005) scores. Since all comparison methods have their own experiment setups, for a fair comparison, we further categorize these methods into four aspects: singleview (SV), multi-view (MV), accessing to additional information (AI) such as clinical document, and applying fine-tuning (FT) to the generated medical reports. Experiments in Table 1 show that our models outperform the baselines in most language metrics.
With a single input X-ray image as the sole input, ours (SV) outperforms by a noticeable margin the best SOTA methods of CoAtt on Open-I and Transformer on MIMIC, respectively. This we mainly attribute to the utilization of the enriched disease embedding that explicitly incorporates the diseaserelated topics. With multiple X-ray images as input, Ours (MV) again outperforms the best comparison methods of HRG-Transformer on Open-I. With multiple X-ray images and additional clinical document information as input, ours (MV+T) outperforms the comparison methods of KERP on Open-I. Finally, with the complete contextual information available as input, ours (MV+T+I) outperforms all the comparison methods available in both Open-I and MIMIC datasets.

Clinical Accuracy Performance
To evaluate the clinical accuracy of the generated reports, we use the LSTM CheXpert labeler (Lovelace and Mortazavi, 2020) as a universal measurement. We compare different methods based on accuracy, F-1, precision (prec.), and recall (rec.) metrics on 14 common diseases. Since there are 14 independent diseases, we also report the macro and micro scores. Intuitively, a high macro score means the detection of all 14 diseases is improved. Meanwhile, a high micro score implies the dominant diseases are improved (i.e., some diseases appear more frequently than others). As  (Vinyals et al., 2015) 0.915 (Xu et al., 2015) 0.908 (Wang et al., 2018) 0.902  0  observed in Table 2, our clinical performance increased significantly compared to the baselines in both macro and micro scores. Among our ablation models in Table 2, the precision and accuracy scores of our contextualized variant (MV+T) tend to be higher, whereas other scores are lower than the one with the interpreter (MV+T+I). This opposite behavior is due to the interpreter, which encourages detecting diseases, thus increases False Positives (FP). Note in the medical context, it is usually critically important to lower the False Negatives (FN) rate, thus a high recall score with a slight decrease in precision is more preferred.

Human Evaluation
In addition to the automated evaluations, we ask an experienced medical doctor to evaluate our generated medical reports. Specifically, the chest X-ray images and ground-truth medical reports are given to the doctor. Then, the doctor evaluates the quality of the generated reports by assigning a score from 0 (totally disagree) to 10 (totally agree). The final score for each model is computed by averaging all scores (97 test samples for each proposed model).
It can be inferred from Table. 3 that the MV+T+I gives more accurate medical reports and using the interpreter to fine-tune the outputs is indeed improving the reports' quality. Additionally, it is also clear from the human evaluation that incorporating clinical history information positively affects the final performance. Moreover, the human evaluation shows that most generated examples are good (8.031 on average), indicating the proposed model's effectiveness in terms of clinical accuracy.

Ablation studies 4.3.1 Enriched disease embedding
We observe that the latent features D fused extracted from the classifier are insufficient to generate robust medical reports, as shown in Table 4. Based on our human languages, a meaningful story needs three factors: the topic (i.e., what disease), the tone (i.e., is it negative or positive), and the details (i.e., the severity). However, there is no guarantee that the learned latent features D fused has all three required elements. On the other hand, with the the explicit representations (i.e., D fused , D topics , and D states ), all three factors are preserved. Therefore, the enriched disease embedding D enriched can generate precise and complete medical reports, leading to the language metrics' substantial improvement.

Contextualized embedding
Table 4 also shows that our proposed "contextualized" version can improve the language scores over the "regular" version, which reads only images. Notably, the contextualized version is the entanglement of the chest X-ray images and the clinical history, which is crucial to improve the generated report's quality and accommodate doctors' practical needs. It mimics how radiologists receive requests from medical doctors and write reports to answer their questions. Hence, the generated reports are believed to be more "on point" and receives higher language scores than the regular "image-to-text" setting.

Limitations and Future Work
Our work has several limitations that future works can take into consideration for further improvement. Firstly, our model does not explicitly consider disease orientation or direction (e.g., left or right, top or bottom). For example, future works can include visual-semantic embedding (direction/orientation/location) to learn and localize diseases during generating medical reports. Secondly, our work does not support time-series relationships between different studies of a patient. This information is vital to analyze existing diseases by comparing their size or structure to determine if a disease is getting worse or not. If these limitations can be addressed, the medical report system can be much more reliable for real-world applications.
Noticeably, we observe some hallucination facts (False Positives) where some diseases are mistakenly described as positive in the medical reports. For example, some images with "pneumonia" are wrongly described as "pulmonary edema". In fact, human radiologists often mistakenly classify some diseases (Satia et al., 2013). For example, (Satia et al., 2013) shows that human radiologists or physicians can accurately detect normal lung X-ray images almost all of the time; but, for abnormal lung X-ray images, the correctness of diagnosis drops to only 50% (Satia et al., 2013). For this reason, it is challenging to generate accurate medical reports even for experienced radiologists.
In the future, we will expand our work to related medical applications such as retinal and brain medical report generation on X-ray/MRI/CT scans. We believe that our model can be generalized to a wide range of medical report generation problems where common symptoms or disease labels and medical reports are available in most medical scan datasets. Moreover, extending the current work to incorporate tabular data inputs could be another exciting direction because some clinical information is in the form of tabular structure such as patient's age, heart pressure, or temperature (Cohen et al., 2020). In some cases, physicians must include this information in medical reports, which cannot be inferred from only reading medical scans.

Conclusion
This paper introduces a novel three-module approach for generating medical reports from Xray scans. Superior performance of our approach over state-of-the-art methods has been empirically demonstrated on widely-used benchmarks with a range of evaluation metrics. Our approach is also flexible and can work with additional input information, where consistent performance gains are observed. For future work, we plan to apply our approach to related medical report generation tasks that go beyond X-rays.