Fusion: Towards Automated ICD Coding via Feature Compression

ICD coding aims to automatically assign International Classification of Diseases (ICD) codes from unstructured clinical notes or discharge summaries, which saves human labor and reduces errors. Although several studies are proposed to solve this challenging task, none distinguishes the importance of different phrases with a word window. Intuitively, informative phrases should be useful for the prediction. This paper proposes a feature compressed ICD coding model named Fusion to address this issue. In particular, we propose an attentive soft-pooling approach to compress the sparse and redundant word representations into informative and dense ones as local features. Next, we use the key-query attention mechanism for modeling the inner relations among local features to generate the global features, which are further used to predict ICD codes. Experiments on two widely used datasets demonstrate that Fusion is comparable with baselines. We also find that none of the state-of-the-art approaches significantly perform better than others. Thus, automated ICD coding is still a challenging task.


Introduction
The International Classification of Diseases (ICD) coding system helps standardize the recording of diagnoses and treatments assigned to patients by medical professionals in the world. These ICD codes are generated from massive unstructured clinical notes. However, manual code assignments is laborintensive and prone to errors. Thus, automatic ICD code assignment becomes an urgent need in the healthcare domain.
Traditional machine learning methods (Larkey and Croft, 1996) tried to tackle this task based on feature extraction. However, it does not work well * Corresponding author. since clinical notes are noisy and complex. Recently, deep learning-based approaches (Cao et al., 2020;Xie et al., 2019;Li and Yu, 2020;Mullenbach et al., 2018) are proposed to improve its performance. Among others, convolutional methods (Cao et al., 2020;Xie et al., 2019;Li and Yu, 2020;Mullenbach et al., 2018) outperform other approaches. Besides, some studies try to incorporate external information to further improve the performance (Cao et al., 2020;Xie et al., 2019). However, they still suffer the following issues.
• Redundant Information Deduction. The clinical notes are noisy and complex, where only some key phrases are highly related to the coding. However, convolutional methods treat all the word windows equally, ignoring that different words have different importance and should be weighted differently within word windows. Besides, the sliding windows used in the convolutional methods produce a lot of redundant information. Thus, it is important to reduce the non-informative and redundant information and distinguish the contributions of different convolutional features.
• Interactions among Local Features. Most existing approaches such as MultiResCNN (Li and Yu, 2020) only use the local features for coding obtained using different filters. However, they ignore the importance of interactions among different local features. For example, sleep apnoea (OSA) and insomnia are related to hypertension and ischaemic heart disease (Harrison and Wood, 1949). Thus, combining different local features may discover new useful patterns to improve coding.
To tackle these issues, we propose a feature compressed ICD coding model named Fusion, which can automatically compress the local fea-tures and further learn global features to enhance the coding performance. In particular, Fusion takes an attention-based soft-pooling approach to compress local features learned by word convolutions, passing residual convolution blocks. By aggregating all the local features from different convolutional filters, Fusion then applies key-query attention mechanism to model interactions among local features and obtain global ones. A code-wise attention mechanism is then used to learn a feature vector associated with each ICD code. This vector is finally used to make a prediction. Experiments on two public datasets show that Fusion outperforms state-of-the-art baselines over five evaluation metrics. Moreover, we find that none of the existing approaches outperforms others on the MIMIC-III Full dataset. Thus, automated ICD coding is still an open challenge.

Related Work
Traditional machine learning models have been applied to automatically extract ICD codes using the hand-crafted feature vectors as the inputs (Larkey and Croft, 1996;Gundersen et al., 1996;Franz et al., 2000;Pestian et al., 2007;Farkas and Szarvas, 2008). However, they did not achieve satisfactory performance due to the difficulty of extracting useful features from complex and noisy clinical notes. Deep learning models have shown their superiority for this task, including recurrent-based deep models (Shi et al., 2017;Xu et al., 2019) and convolution-based models (Kim, 2014;Mullenbach et al., 2018;Cao et al., 2020;Li and Yu, 2020). In general, convolutional models perform better than recurrent-based ones. Several studies try to incorporate advanced pretrained language model BERT (Devlin et al., 2019), ICD code descriptions Mullenbach et al., 2018;Xie and Xing, 2018;Li and Yu, 2020), ICD code structure (Wang et al., 2020;Cao et al., 2020), and knowledge graph (Cao et al., 2020;Xie et al., 2019) to improve the performance.

Model
The goal of automated ICD coding is to predict a set of unique ICD codes Y from the code set C = {c 1 , c 2 , · · · , c s } when given clinical note D = {w 1 , w 2 , · · · , w n }, where Y ⊆ C, s is the number of unique ICD codes, and n is the number of words in D. This task is challenging since s is very large, which is over 15,000 for ICD-9 codes and 60,000 for ICD-10 codes, respectively. Besides, extensive noisy information exists in the clinical note D.
To solve these challenges, we propose a feature denoised model (Fusion) for automated ICD coding as shown in Figure 1. This model consists of five modules: the input layer, the compressed convolutional layer, the feature aggregation layer, the code-wise attention layer, and the prediction layer. Next, we introduce the details of each module in the following subsections.

Input Layer
We take the clinical note D = {w 1 , w 2 , · · · , w n } as the model input. For each unique word w i , word2vec (Mikolov et al., 2013) is used to pretrain its embedding, which is denoted as e i , a d edimensional embedding. Thus, the input of Fusion is a matrix D = {e 1 , e 2 , · · · , e n }.

Compressed Convolutional Layer
Given the input data D, the compressed convolutional layer aims to learn dense and informative word representations, which are further used to learn the clinical note representation. In particular, we first use convolutional neural networks (CNN) to learn word representations and then propose an attention-based soft-pooling approach to compress those representations. Finally, residual convolution blocks (He et al., 2016) are introduced as MultiResCNN (Li and Yu, 2020) on top of the compressed features.

Word Convolution
CNNs are powerful for text classification tasks (Kim, 2014) that they have multiple filters with different kernel sizes (i.e., word windows) to capture diverse patterns. Let m be the number of filters. The kernel of each filter f i is denoted as k i . Thus, we can apply m different 1-dimensional convolutions on the input data D. For the i-th filter, we have where conv(·; ·) represents the 1-dimensional convolutional operation, and W i x denotes the learned parameter.

Attention-based Soft-pooling
The word convolutional operation uses sliding windows, which produces redundant information existing in adjacent word representations. Thus, to  remove such information, we propose to compress word representations learned by Eq.
(1) via an attention-based soft-pooling operation. Given a word w j , its neighboring words {w j+1 , · · · , w j+g−1 }, and the corresponding filter f i , we first learn the local-based attention scores α where W i α and b are learnable parameters. Then we conduct attention-based soft-pooling on the g words and obtain the compressed representation as in Eq. (2).
In such a way, the whole n word representations learned by Eq. (1) will be replaced by P = n g new representations, i.e., {o i 1 , o i 2 , · · · , o i P }. In such a way, we can reduce the number of word representations and obtain more dense ones.

Residual Convolution Block
For each filter f i , we now have a denoised matrix {o i 1 , o i 2 , · · · , o i P } that represents the input D. To avoid vanishing gradients and train the model easier, we also introduce residual blocks on top of the compressed features. In particular, we replace the batch norm layer with the group norm layer. Let a denotes the number of residual blocks, and we have r i p = ResidualBlcok({o i p , · · · , o i p+a−1 }).

Feature Aggregation Layer
Since m filters are used to obtain m compressed features, we concatenate them together as the local features, i.e., l p = [r 1 p , · · · , r m p ]. Then the whole document can be represented by a matrix D l = {l 1 , l 2 , · · · , l P }. However, such an aggrega-tion only takes local information into account but ignores the interactions with the remaining words.
Thus, we propose to use the key-query attention mechanism (Vaswani et al., 2017) to learn a global feature representation for each compressed word window. Thus, we have the global features D g = [g 1 , · · · , g P ] = attention([l 1 , · · · , l P ]).

Code-wise Attention Layer
Due to a large number of labels, directly using the global features D g to make predictions may not perform well. Thus, we use a codewise attention layer to generate a matching vector for each ICD code used to make a prediction. Let u k represent the embedding of the k-th ICD code, i.e., c k . Then we calculate the attention weights on all the global features using u k , i.e., [γ k 1 , · · · , γ k P ] = sof tmax([u k g 1 , · · · , u k g P ]). Then the code-wise vector can be obtained by v k = P p=1 γ k p g p .

Prediction Layer
Using the code-wise vector v k , we can make a prediction using the sigmoid function, i.e., where w k is the learnable parameter vector. Finally, cross-entropy loss function on a specific clinical note D is used to optimize the proposed model.   2016) to extract ICD-9 codes from discharge summaries. We use the same setting as previous works (Mullenbach et al., 2018;Shi et al., 2017;Li and Yu, 2020;Cao et al., 2020). The MIMIC-III 50 dataset contains the top 50 most frequent codes, 8,067, 1,574, and 1,730 discharge summaries for training, development, and testing, respectively. The MIMIC-III Full dataset consists of 8,921 codes, 47,719, 1,631, and 3,372 discharge summaries for training, development, and testing, respectively. The number of labels on the MIMIC-III Full dataset is significantly greater than that on the MIMIC-III 50 dataset, making the task more difficult.

Metrics and Parameter Settings
We follow previous work (Mullenbach et al., 2018) and use Micro Macro AUC (area under the ROC), Micro Macro F1, and Precision@K scores as metrics. For MIMIC-III 50, we report Precision@5 (P@5) and P@8 for MIMIC-III Full. We use the same parameter setting as MultiResCNN (Li and Yu, 2020) 1 , and set g as 2 in our experiments, i.e., compress two features together.

Baselines
Existing studies either only take clinical notes as the inputs or incorporate external information, working with notes to enhance the performance. Our work belongs to the first category. For the "note only" category, we employ C-MemNN (Prakash et al., 2017), C-LSTM-ATT (Shi et al., 2017), CAML (Mullenbach et al., 2018), DR-CAML (Mullenbach et al., 2018), and Mul-tiResCNN (Li and Yu, 2020) as baselines. We also use HyperCore (Cao et al., 2020), and MSATT-KG (Xie et al., 2019) as baselines, which incorporate the ICD code ontology to enhance the performance. Since all the approaches use the same 1 https://bit.ly/3opDmjM settings, we directly use the results reported in the original papers. Table 1 shows the experimental results of all approaches in terms of different metrics. We can observe that Fusion outperforms all the baselines on the MIMIC-III 50 dataset in terms of all metrics. Compared to the best baselines, the scores of Macro Micro F1 and P@5 obtained by Fusion improve over 7%, 6%, and 5%, respectively. These results clearly demonstrate the effectiveness of the proposed feature compression and aggregation approaches for the automated ICD coding task. Compared with the baselines only taking notes as the input, the proposed Fusion achieves the highest scores on the MIMIC-III Full dataset. Although HyperCore and MSATT-KG incorporate external information to improve the performance, the performance of Fusion is still comparable. We also can observe that on the MIMIC-III Full dataset, none of the methods can be significantly better than others. The reason may be that all the models cannot be trained sufficiently with the huge number of ICD code labels on noisy, sparse, and unstructured medical clinical notes, which makes this task more challenging.

Ablation Study
In this section, we remove parts of the full Fusion model to validate the contribution of each individual module. Table 2 shows the ablation study results. "MaxPool" means replacing our soft-pooling layer with the traditional max-pooling layer. As shown in Table 2, the results drop on all metrics, which indicates the importance and benefits of using the proposed soft-pooling layer. Max-pooling will lose part of critical information during the compression and is not differentiable. With soft-pooling, the key information can be better preserved during the compression process, since  the selection process is guided by the gradient. "DocLevel" refers to replacing the code-wise attention layer with the single document-level attention. The attention is based on the document feature, and all codes use the same attention weight during the prediction instead of calculating codespecific attention weights. Thus, all codes will use the same feature for the prediction. In such a way, much unrelated information will also be kept. For example, we do not want to preserve the heart-failure-related information while predicting the COPD code. As a result, most scores drop significantly compared to the original design. The introduction of the code-specific attention makes it possible that the predictor can dynamically adjust the attentions based on the cases. Thus, the redundant information can be better removed with our design.

Conclusion
In this paper, we propose Fusion for the automated ICD coding task. In particular, Fusion focuses on compressing redundant feature information, distinguishing the importance of adjacent phrases, and considering interactions among local features. We conduct experiments on two widely-used datasets to show the effectiveness of Fusion in terms of five evaluation metrics. From experimental results on the MIMIC-III Full dataset, we find that automated ICD coding is still challenging due to the noisy data and a large number of ICD code labels.