Word Graph Guided Summarization for Radiology Findings

Radiology reports play a critical role in communicating medical findings to physicians. In each report, the impression section summarizes essential radiology findings. In clinical practice, writing impression is highly demanded yet time-consuming and prone to errors for radiologists. Therefore, automatic impression generation has emerged as an attractive research direction to facilitate such clinical practice. Existing studies mainly focused on introducing salient word information to the general text summarization framework to guide the selection of the key content in radiology findings. However, for this task, a model needs not only capture the important words in findings but also accurately describe their relations so as to generate high-quality impressions. In this paper, we propose a novel method for automatic impression generation, where a word graph is constructed from the findings to record the critical words and their relations, then a Word Graph guided Summarization model (WGSum) is designed to generate impressions with the help of the word graph. Experimental results on two datasets, OpenI and MIMIC-CXR, confirm the validity and effectiveness of our proposed approach, where the state-of-the-art results are achieved on both datasets. Further experiments are also conducted to analyze the impact of different graph designs to the performance of our method.


Introduction
A radiology report usually contains a findings section (FINDINGS) describing detailed medical observations and an impression section (IMPRESSION) summarizing the most critical observations. In practice, the IMPRESSION is an essential part and  Figure 1: An example radiology report including its FINDINGS and IMPRESSION, and a word graph constructed from the FINDINGS. Different colored edges in the graph represent different types of relations. The curved arrow indicates the AIG task to generate the IM-PRESSION from the FINDINGS.
plays an important role in delivering critical findings to clinicians. Therefore, summarizing FIND-INGS helps to locate the most prominent observations so that the automatic process of doing so greatly eases the workload of radiologists. Recently, many methods are proposed for automatic impression generation (AIG) (Hassanpour and Langlotz, 2016;Gharebagh et al., 2020), which are mainly based on the sequence-tosequence architecture with specific designs for the characteristics of this task. so as to improve the performance. Although these efforts are able to find the important words to promote AIG, less attention is paid to the important aspect on leveraging the relation information among them. For example, in Figure 1, the observation word "effusion" and its modifier word "moderate" have a relation (which describes the severity of the symptoms) in between them, where such relation needs to appear in the IMPRESSION. Therefore, to better generate IMPRESSION, in addition to using important words, it is required to recognize the relations among such words in the FINDINGS and describe their corresponding relations for AIG.
In this paper, we propose to enhance AIG via a summarization model integrated with a word graph by leveraging salient words and their relations in the FINDINGS. In detail, the word graph is constructed by identifying the important words in the FINDINGS and building connections among them via different typed relations. To exploit the word graph, a Word Graph guided Summarization model (WGSUM) is designed to perform AIG, where the information from the word graph is integrated into the backbone decoder (e.g., LSTM (See et al., 2017) or Transformer (Liu and Lapata, 2019)) from two aspects, including enriching the decoder input as extra knowledge, as well as guiding the decoder to update its hidden states. Experimental results illustrate that WGSUM outperforms all baselines on two benchmark datasets, where the state-of-the-art performance is observed on all datasets in com-parison with previous studies. Further analyses also investigate how different types of edges in the graph affects the performance of our proposed model.

The Proposed Method
We follow the standard sequence-to-sequence paradigm for AIG. In doing so, we regard the FINDINGS as the source sequence X = {x 1 , ..., x i , ..., x N }, where N is the number of tokens in the FINDINGS, and the goal of the task is to generate a target sequence (i.e., the IMPRESSION) Y = {y 1 , ...y i , ..., y L }, where L is the number of tokens in the IMPRESSION. An overview of our method is shown in Figure 2 where the details are illustrated in following subsections.

The Overall Structure
The model used in our method contains three major components, i.e., the FINDINGS encoder, the graph encoder, and the graph guided decoder with their details and the training objective described below.
FINDINGS Encoder Given a FINDINGS, denoted by X with N tokens, LSTM or the standard encoder from Transformer is applied to model the sequence and its output is the hidden state h x . The process is formulated as where f f e (·) refers to the FINDINGS encoder.
Graph Encoder For the node V and adjacency matrix A in the graph G constructed from the FINDINGS, we utilize graph neural networks 2 (GNN) to encode them, because GNNs are powerful in encoding graph-like information Zheng and Kordjamshidi, 2020;Tian et al., 2021a,b). In detail, two encoders are employed to extract features from G, where one is used to construct the background information and the other is used to generate the dynamic guiding information.
The process is thus formalized as where f gb (·) and f gl (·) refer to two graph encoders, with z b and z l the intermeidate states used to generate the static background information and the dynamic guiding information, respectively.
Graph Guided Decoder In our model, z b and z l from the graph encoders are integrated into the backbone decoder (e.g., LSTM and Transformer decoder.) and perform a decoding process via where f d (·) represents the decoder.
Objective Since the FINDINGS and the IMPRES-SION are highly related, a pointer generator (PG) is also introduced to our model by P (y t | X, y <t )=p gen P vocab (y t )+(1−p gen ) where a t i is the distribution over source tokens at step t, which is obtained by performing the attention mechanism on the source tokens; p gen and (1 − p gen ) are the weights of predicting the next token from the vocabulary or the source tokens, respectively. The model is then trained to maximize the negative conditional log-likelihood P (Y|X, G) by of Y given X and G: log p (y t | y 1 , ..., y t−1 , X, G; θ) (6) where θ are the parameters of the model.

Graph Construction
In FINDINGS, the most critical content that the radiologists need to summarize is the abnormal-ity which usually includes the corresponding specific observations as well as their modifying words. Therefore, in our study, we first extract 5 types of entities from FINDINGS: anatomy, observation, anatomy modifier, observation modifier, and uncertainty, where these entities are able to cover most of the key words that need to appear in IMPRESSION. In addition, Gharebagh et al. (2020) has shown that fine-grained words are more effective than the entire ontology. Inspiring by this idea, we regard words from these entities as nodes in our graph. To avoid confusion, the repeated words are treated as one single node in each FINDINGS even though they are not presented in the same entity. Besides modifying relation, some other relations are also important for IMPRESSION generation, such as relations between anatomy and observation. For example, in Figure 1, the relation between "pleural" (anatomy) and "effusion" (observation), which can be obtained for the detailed abnormality in the IM-PRESSION. To capture these types of relations, we leverage dependency trees which have been widely used to model word-word relations in many studies (Tian et al., 2020a,b;Pouran Ben Veyseh et al., 2020;Chen et al., 2021). Thus, we define three types of edges for our word graph. Note that, each FINDINGS has its corresponding word graph: • Type I: this is the type using the natural order of words in an entity. In detail, we connect words if they are adjacent in the same entity. In Figure  1, the pink dashed lines serve as the type I edge.
For example, "endotracheal tube" is an entity, so that "endotracheal" is connected to "tube". • Type II: this is the type using the relations among entities within the same category (e.g., observation and its modifier, and anatomy and its modifier). As shown in Figure 1, given an observation,"effusion" and a observation modifier "moderate", the relation is constructed by connecting them with a green dash line. • Type III: this is the type using the relations among entities across different categories (e.g., observation and anatomy). Different from the previous two types, this type is able to provide the global relation information while the previous two types emphasize the local information.
In detail, we construct a dependency tree using stanza in the Universal Dependencies (UD) format (Nivre et al., 2020). As shown in Figure  1, given "effusion" and its head "left", they are connected with a purple dash line.
Then the nodes and edges in graph are recorded by a node list V and an adjacency matrix A which are then used as the input of graph encoder.

Graph Guided Decoder
We utilize graph to generate two different kinds of information and they are working on two aspects: enriching decoder input by static background information h b and controlling decoder hidden state by dynamic guiding information h l , which are introduced in following parts.
Background Information Since the graph can be considered as a condensing of source sequence (i.e., FINDINGS) which contains the most important information, it is appropriate to serve as a static background information to enrich the decoder inputs. The first output z b of graph encoder is used to construct the background information. For the hidden state z b i in z b of each node, we can obtain attention weights by: where W b ,W h and p b are learnable parameters. For LSTM, we define h f as the final hidden state, and for Transformer, we calculate the mean of all hidden states as h f . The attention distribution a b can be viewed as a probability distribution over nodes in graph. Next, a b is used to produce a weighted sum of the nodes and then we obtain the static background information: For clarity, we simplify Equation (7), (8) and (9) as a function AttCon(·). Therefore, h b can be obtained by: In our model, the background information h b is directly concatenated to the decoder input. For each decoder input y t−1 at step t, it is expanded as Dynamic Guiding Information Since h b remains unchanged and works as the global static knowledge during the decoding process, to make the guidance more flexibly, the other information z l from graph encoder in Equation (3)  LSTM decoder, each cell updates its information by two states and one input: cell state c t−1 , hidden state s t−1 and input y t−1 , which is formulated as: where c t usually contains rich contextual information and it is appropriate to compute guiding infor- For Transformer, the general decoder only has one hidden state s t which is the output of the last layer. In this part, we regard the output of the penultimate layer as another hidden state c t which is then used to generate dynamic information h l t by: After obtaining the dynamic guidance h l t from the LSTM decoder or Transformer decoder, it is then utilized to update decoder hidden state s t by: where f g (·) and f u (·) are fully connected layers.
Vocabulary Distribution To incorporate the FINDINGS information for final prediction, we calculate the attention context vector g c t by the same way as Equation (7), (8) and (9) using sequence encoder hidden state h x as well as the updated s t : Then both g c t and decoder hidden state s t are used to calculate vocabulary distribution at step t:

Baseline and Evaluation Metrics
To compare the performance of our proposed models, we use the following models as our baselines: and decoder are replaced with the transformer and the copy mechanism is removed. Besides, we also compare our model with those in previous studies, including extractive summarization models, e.g., LEXRANK (Erkan and Radev, 2004), TRANSFORMEREXT (Liu and Lapata, 2019), as well as abstractive summarization models, e.g., CAVC , CGU  and ONTOLOGYABS (Gharebagh et al., 2020). In our experiments, we use ROUGE metrics (Lin, 2004) to evaluate the generated IMPRES-SIONs. We only report F 1 scores of ROUGE-1 (R-1), ROUGE-2 (R-2) and ROUGE-L (R-L), where R-1, R-2 are unigram and bigram overlap measuring the informativeness and R-L is the longest common sub-sequence overlap aiming to assess fluency. In addition, to evaluate the factual consistency (FC), CheXbert (Smit et al., 2020) 4 is utilized to detect 14 observations related to diseases in reference impressions and generated impressions. Then precision, recall and F1 score are used to evaluate the performance.

Implementation Details
We employ stanza (Zhang et al., 2020) 5 , a pythonbased natural language processing library, to recognize named entities and get the syntactic analysis. Then we use the extracted entities and dependency tree to construct graph for each FINDINGS. We im-   (2019) 7 . Since the quality of text representation plays an important role in model performance (Mikolov et al., 2013;Song et al., 2017Peters et al., 2018;Devlin et al., 2019;Joshi et al., 2020;, we try two powerful FINDINGS encoders, namely, LSTM and Transformer, which have achieved state-of-the-art results in many natural language processing tasks. For WGSUM (LSTM+GAT)), we employ 2-layer GAT with hidden size of 200 as our graph encoder, 2-layer Bi-LSTM encoder for findings sequence with hidden size of 100 for each direction and 1-layer LSTM for decoder with hidden size of 200. The dropout is set to 0.5 for embedding layer. We use Adam optimizer (Kingma and Ba, 2014) with the learning rate of 0.001. For WGSUM(TRANS+GAT), the graph encoder is a 2-layer GAT with hidden size 512 and the FINDINGS encoder is a 6-layer Transformer with 512 hidden size and 2,048 feedforward filter size. The decoder is also a 6-layer Transformer with hidden size 512.

Effect of word graph
To illustrate the validity of the word graph, we conduct experiments with the aforementioned baselines on the two benchmark datasets. Besides, we also try two different types of GNNs: GCN (Kipf and Welling, 2016) and GAT (Veličković et al., 2017) respectively. The results are shown in Table  2  our word graph, the encoder GAT is more effective than GCN where GAT can bring more significant improvements on the two datasets. The main reason might be that GAT is more powerful in updating node representation via self-attention. Second, integrating the word graph into the two different PGNs gains better performance on both the datasets, which confirms the usefulness of the word graph. Third, for OPENI, the LSTM-based models outperform much more than the Transformer-based models, while on MIMIC-CXR, the Transformerbased models are more effective. The main reason could be that the LSTM is able to obtain prominent performance in the small dataset and the Transformer is more powerful under a large amount of data. Fourth, on the FC metrics on MIMIC-CXR, our proposed methods also outperform the BASE model, indicating that the generated IMPRESSIONs from our methods are more accurate and reasonable, which is because the word graph can provide both key word and relation information for the generation process so that the decoder tends to produce words with correct relations.  Figure 4: The improvements (R-1) of WGSUM with different graph edges on MIMIC-CXR dataset.

Comparison with Previous Studies
In this subsection, we compare our models with existing studies on the two datasets and report all results (i.e., ROUGE scores) in Table 3. We can get several observations. First, the abstractive models are apparently more effective than the extractive models in AIG, owing to the characteristics of FINDINGS and IMPRESSION in radiology reports. Second, our models with the word graph show the effectiveness of both key words information and their relations in this task when being compared to the previous models that only leverage medical term information, e.g., ONTOLOGYABS only uses ontology information in database RadLex 8 . Third, our methods achieve the best performance among all previous models, which demonstrates that using background knowledge and dynamic guidance information to control the decoding process is an appropriate design to improve the quality of the generated IMPRESSIONS.

Expert Evaluation
Since the limitation of ROUGE metrics, we further conduct an expert evaluation for a better understanding of the generated IMPRESSIONs. We randomly select 100 generated IMPRESSIONs along with their corresponding reference IMPRESSIONs and FINDINGS from MIMIC-CXR. To avoid potential bias, we randomly order the predicted and reference IMPRESSIONs. We extend Gharebagh et al. (2020) metrics to four metrics: Accuracy, Completeness, Conciseness and Readability. Three medical experts are employed to score each IM-PRESSION on these metrics. Figure 3 presents the results of human evaluation. We can observe that although reference IMPRESSIONs written by the radiologists are still better, there are still over 80% of generated IMPRESSIONs have roughly equal or better quality. About 85%, 75%, 70%, and 94% of generated IMPRESSIONs are equal to human written IMPRESSIONs on the four metrics (Accuracy, Completeness, Conciseness, and Readability). In addition, there are even 5%, 7%, 12%, 2% of generated IMPRESSIONs surpassed the reference IMPRESSIONs on these metrics.

Analyses
We conduct further analyses on Graph Edge, IM-PRESSION Length, and Case Study.
Graph Edge As we introduced before, our graph contains three types of edges, i.e., entity interval edge (Type I), entity modifier edge (Type II), and edge from dependency tree (Type III). To show the effect of different edges, we conduct experiments for WGSUM (LSTM+GAT) and WGSUM (Trans+GAT) with different edges on MIMIC-CXR. The improvements from these different edge combinations are shown in Figure 4. First, we can observe that models incorporating word graph outperform the baselines no matter with what type of edge, indicating the effectiveness of our innovation in combining the entity words and their relations into the word graph. Second, regardless of WG-SUM (LSTM+GAT) or WGSUM (Trans+GAT), Type III edge can bring the most significant improvements, while Type I brings little improvements. The main reason might be that the dependency tree contains more comprehensive and accurate relations for the entity words from FINDINGS. Third, WGSUM (Trans+GAT) usually obtains bet-  Figure 6: Examples of the generated IMPRESSIONs from two models (i.e., the PG-TRANS and the WGSUM (TRANS+GAT), respectively) , as well as the reference IMPRESSIONs.

IMPRESSION Length
Another factor that could affect the model performance is the number of tokens in the IMPRESSION. To test the effect of IM-PRESSION length, we categorize all generated IM-PRESSIONs in the MIMIC-CXR test set into 6 groups (within [15,40] with the interval of 5) and compare the R-1 score for each group. The results are shown in Figure 5. There are several observations. First, when IMPRESSION gradually increases in the length, the performance of all models shows a downward trend, which indicates that long IM-PRESSION generation is difficult for all models. Second, TRANSFORMER-based methods are more effective than LSTM based models, especially for long IMPRESSION. The main reason might be that the Transformer is more powerful in dealing with longer sequences via its self-attention mechanism. Third, both WGSUM (TRANS+GAT) and WG-SUM (LSTM+GAT) show their superiority when being compared to their baselines and obtain better results in almost all of the groups.
Case Study To further analyze the effect of our proposed model, we perform qualitative analysis on some cases with their reference and generated IM-PRESSIONs from different models. Figure 6 shows three cases from MIMIC-CXR where different colors on text refer to varied key information. It is found that in the first case, when referring to the corresponding FINDINGS our model generates more complete IMPRESSION than the reference, e.g., "et tube terminates 3 .9 cm above the carina" is a helpful text piece but does not appear in reference IMPRESSION. In addition, compared to the reference IMPRESSIONs written by radiologists, our method covers almost all of the key information in the generated IMPRESSIONs. For example, the key information "moderate bilateral pleural effusions","mild pulmonary edema" and "small opacity in the right media " in the three examples are not generated in PG-TRANS model, but they are necessary for describing the clinical condition.

Related Work
Our work focuses on summarizing the FINDINGS of radiology reports to generate the IMPRESSION, which is essentially an abstractive summarization task. For abstractive summarization, there exists a serious problem known as hallucination (Maynez et al., 2020), in which the generated summary contains fictional content. This problem also exists in AIG and would lead to misdiagnosis for the patient. To tackle this problem, in the general domain, many attempts have been made in terms of guiding information to control the generation process and output the high-quality summary Hsu et al., 2018;Pilault et al., 2020;Huang et al., 2020;Haonan et al., 2020). For the IMPRESSION generation task in the medical domain, there also exist several solutions.  encodes a section of the radiology report as the background information to guide the decoding process. MacAvaney et al. (2019) employs the entire ontological terms extracted from FINDINGS as the medical terms, and then enhances the summarizer by selecting the salient information. Gharebagh et al.
(2020) further splits ontological terms into words and then incorporates these words into summarization by a separate encoder. Compared to these studies, our model offers an alternative solution to robustly enhancing guidance with a word graph for summarizing the FINDINGS of radiology reports without requiring external resources. To our best knowledge, this is the first work employing word relation information for AIG.

Conclusion
In this paper, we propose a novel method for AIG, where a word graph is constructed from the FIND-INGS by identifying the salient words and their relations and a graph-based model WGSUM is designed to generate IMPRESSIONs with the help of the word graph. In doing so, the information from the word graph guides the decoding process with the help of background information and dynamic guiding information. Experimental results on two benchmark datasets show the validity of our proposed method, which obtains the state-of-the-art performance on both datasets. Further analyses on the effect of edge types demonstrate that our model can generate IMPRESSION with accurate medical items.