Automatic ICD Coding via Interactive Shared Representation Networks with Self-distillation Mechanism

The ICD coding task aims at assigning codes of the International Classification of Diseases in clinical notes. Since manual coding is very laborious and prone to errors, many methods have been proposed for the automatic ICD coding task. However, existing works either ignore the long-tail of code frequency or the noisy clinical notes. To address the above issues, we propose an Interactive Shared Representation Network with Self-Distillation Mechanism. Specifically, an interactive shared representation network targets building connections among codes while modeling the co-occurrence, consequently alleviating the long-tail problem. Moreover, to cope with the noisy text issue, we encourage the model to focus on the clinical note’s noteworthy part and extract valuable information through a self-distillation learning mechanism. Experimental results on two MIMIC datasets demonstrate the effectiveness of our method.


Introduction
The International Classification of Diseases (ICD) is a healthcare classification system launched by the World Health Organization. It contains a unique code for each disease, symptom, sign and so on. Analyzing clinical data and monitoring health issues would become more convenient with the promotion of ICD codes (Shull, 2019) (Choi et al., 2016) (Avati et al., 2018). The ICD coding task aims at assigns proper ICD codes to a clinical note. It has drawn much attention due to the importance of ICD codes. This task is usually undertaken by experienced coders manually. However, the manually process is inclined to be labor-intensive and error-prone (Adams et al., 2002). A knowledgeable coder with medical experience has to read the whole clinical note with thousands of words in medical terms and assigning multiple codes from a large number of candidate codes, such as 15,000 and 60,000 codes in the ninth version (ICD-9) and the tenth version (ICD-10) of ICD taxonomies. On the one hand, medical expert with specialized ICD coding skills is hard to train. On the other hand, it is a challenge task even for professional coders, due to the large candidate code set and tedious clinical notes. As statistics, the cost incurred by coding errors and the financial investment spent on improving coding quality are estimated to be $25 billion per year in the US (Lang, 2007).
Automatic ICD coding methods (Stanfill et al., 2010) have been proposed to resolve the deficiency of manual annotation, regarding it as a multi-label text classification task. As shown in Figure 1, given a plain clinical text, the model tries to predict all the standardized codes from ICD-9. Recently, neural networks were introduced (Mullenbach et al., 2018) (Falis et al., 2019) (Cao et al., 2020) to alleviate the deficiency of manual feature engineering process of traditional machine learning method (Larkey and Croft, 1996) (Perotte et al., 2014) in ICD coding task, and great progresses have been made. Although effective, those methods either ignore the long-tail distribution of the code frequency or not target the noisy text in clinical note. In the following, we will introduce the two characteristics and the reasons why they are critical for the automatic ICD coding. Long-tail: The long-tail problem is unbalanced data distribution phenomenon. And this problem is particularly noticeable in accompanied by a large target label set.
According to our statistics, the proportion of the top 10% high-frequency codes in MIMIC-III (Johnson et al., 2016) occupied 85% of total occurrence. And 22% of the codes have less than two annotated samples. This is intuitive because people usually catch a cold but seldom have cancer. Trained with these long-tail data, neural automatic ICD coding method would inclined to make wrong predictions with high-frequency codes. Fortunately, intrinsic relationships among different diseases could be utilized to mitigate the deficiency caused by long-tail. For example, Polyneuropathy in diabetes is a complication of diabetes, with a lower probability than other complications since the long term effect of vessel lesion reflect at nerve would come out in the late-stage. If a model could learn shared information between polyneuropathy in diabetes and more common diseases diabetes, the prediction space would range to a set of complication of diabetes. Further, utilizing the dynamic code co-occurrence, (the cascade relationship among complications of diabetes) the confidence of predicting polyneuropathy in diabetes is gradually increased with the occurrence of vessel blockages, angina pectoris, hypertorphy of kidney, respectively. Therefore, how to learn shared information with considering dynamic code co-occurrence characteristics, is a crucial and challenging issue.
Noisy text: The noisy text problem means that plentiful of information showing in clinical notes are redundant or misleading for ICD coding task. Clinical notes are usually written by doctors and nurses with different writing styles, accompanied by polysemous abbreviations, abundant medication records and repetitive records of physical indicators. According to our statistics 1 , about 10% of words in a clinical note contribute to the code assign task, on average. Other words are abundant medication records and repetitive records of physical indicators. These words are not just redundant but also misleading to the ICD coding task. For example, two critical patients with entirely different diseases could take similar medicines and have similar physical indicators in the rescue course. We argue that the noisy clinical notes are hard to read for both humans and machines. Training with such noisy text would confuse the model about where to focus on, and make wrong decisions due to the semantic deviation. Therefore, another challenging problem is how to deal with the noisy text in ICD coding task.
In this paper, we propose an Interactive Shared Representation Network with Self-Distillation Mechanism (ISD) to address the above issues.
To mitigate the disadvantage caused by the longtail issue, we extract shared representations among high-frequency and low-frequency codes from clinical notes. Codes with different occurrence frequencies all make binary decisions based on shared information rather than individually learning attention distributions. Additional experiments indicate that those shared representations could extract common information relevant to ICD codes. Further, we process the shared representations to an interaction decoder for polishing. The decoder additional supervised by two code completion tasks to ensure the dynamic code co-occurrence patterns were learned.
To alleviate the noisy text issue, we further propose a self-distillation learning mechanism to ensure the extracted shared representations focus on the long clinical note's noteworthy part. The teacher part makes predictions through constructed purified text with all crucial information; meanwhile, the student part takes the origin clinical note as a reference. The student is forced to learn the teacher's shared representations with identical target codes.
The contributions of this paper are as follows: 1) We propose a framework capable of dealing with the long-tail and noisy text issues in the ICD coding task simultaneously.
2) To relieve the long-tail issue, we propose an interactive shared representation network, which can capture the internal connections among codes with different frequencies. To handle the noisy text, we devise a selfdistillation learning mechanism, guiding the model focus on important parts of clinical notes.
3) Experiments on two widely used ICD coding datasets, MIMIC-II and MIMIC-III, show our method outperforms state-of-the-art methods in macro F1 with 4% and 2%, respectively. The source code is available at www.github. com/tongzhou21/ISD.

Related Work
ICD coding is an important task in the limelight for decades. Feature based methods firstly brought to solve this task. (Larkey and Croft, 1996) (Mullenbach et al., 2018) propose a convolutional neural network with an attention mechanism to capture each code's desire information in source text also exhibit interpretability. (Xie and Xing, 2018) develop tree LSTM to utilize code descriptions.
To further improve the performance, customized structures were introduced to utilize the code cooccurrence and code hierarchy of ICD taxonomies. (Cao et al., 2020) embedded the ICD codes into hyperbolic space to explore their hierarchical nature and constructed a co-graph to import code co-occurrence prior. We argue that they capture code co-occurrence in a static manner rather than dynamic multi-hop relations. (Vu et al., 2020) consider learning attention distribution for each code and introduce hierarchical joint learning architecture to handle the tail codes. Taking advantage of a set of middle representations to deal with the long-tail issue is similar to our shared representation setting, while our method enables every label to choose its desire representation from shared attention rather than its upper-level node, with more flexibility.
The direct solution to deal with an imbalance label set is re-sampling the training data (Japkowicz and Stephen, 2002)  or reweighting the labels in the loss function (Wang et al., 2017)  . Some studies treat the classification of tail labels as few-shot learning task. (Song et al., 2019) use GAN to generate label-wise features according to ICD code descriptions. (Huynh and Elhamifar, 2020) proposed shared multi-attention for multi-label image labeling. Our work further constructs a label interaction module for label relevant shared representation to utilize dynamic label co-occurrence.
Lots of effects tried to normalize noisy texts before inputting to downstream tasks. (Vateekul and Koomsubha, 2016) (Joshi and Deshpande, 2018) apply pre-processing techniques on twitter data for sentiment classification. (Lourentzou et al., 2019) utilized seq2seq model for text normalization. Others targeted at noisy input in an end2end manner by designing customized architecture.   . Different from previous works on noisy text, our method neither need extra text processing nor bring in specific parameters.

Method
This section describes our interactive shared representation learning mechanism and self-distillation learning paradigm for ICD coding. Figure 2 shows the architecture of interactive shared representation networks and manifest the inference workflow of our method. We first encode the source clinical note to the hidden state with a multi-scale convolution neural network. Then a shared attention module further extracts code relevant information shared among all codes. A multi-layer bidirectional Transformer decoder insert between the shared attention representation extraction module and code prediction, establishes connections among shared code relevant representations.

Multi-Scale Convolutional Encoder
We employ convolutional neural networks (CNN) for source text representation because the computation complexity affected by the length of clinical notes is non-negligible, although other sequen- tial encoders such as recurrent neural networks or Transformer (Vaswani et al., 2017) could capture longer dependency of text, theoretically. CNN could encode local n-gram pattern, critical in text classification, and with high computational efficiency. The words in source text are first mapped into low-dimensional word embedding space, constitute a matrix E = {e 1 , e 2 , ..., e Nx }. Note that N x is the clinical note's length, e is the word vector with dimension d e . As shown in Eq. 1 and 2, we concatenate the convolutional representation from kernel set C = {c 1 , c 2 , ..., c S } with different size k c to hidden representation matrix

Shared Attention
The label attention method tends to learn relevant document representations for each code. We argue that the attention of rare code could not be well learned due to lacking training data. Motivated by (Huynh and Elhamifar, 2020) we propose shared attention to bridge the gap between highfrequency and low-frequency codes by learning shared representations H S through attention. Code set with total number of N l codes represents in code embedding E l = {e l 1 , e l 2 , ..., e l N l } according to their text descriptions. A set of trainable shared queries for attention with size N q ×d l is introduced, noted as E q = {e q 1 , e q 2 , ..., e q Nq }, where N q is the total number of shared queries as a hyperparameter. Then E q calculates shared attention representation H S = {h S 1 , h S 2 , ..., h S Nq } with hidden representation H in Eq. 3 to 5: In ideal conditions, those shared representations reflect the code relevant information corresponding to the source text. We can predict codes through H S . Each code i has its right to choose a shared representation in H S for code-specific vector through the highest dot product score s i .
The product score was further applying to calculate the final scoreŷ l through the sigmoid function.
With the supervision of binary cross-entropy loss function, the shared representation should have learned to represent code relevant information.

Interactive Shared Attention
Above shared attention mechanism lacks interaction among code relevant information, which is of great importance in the ICD coding task. We implement this interaction through a bidirectional multi-layer Transformer decoder D with an additional code completion task. The shared representation H S is considered the orderless sequential input of the decoder D. Each layer of the Transformer contains interaction among shared representation H S through self-attention and interaction between shared representation and source text through source sequential attention.
To make sure the decoder could model the dynamic code co-occurrence pattern, we propose two code set completion tasks, shown at the bottom of Figure 3.
(1) Missing code completion: We construct a code sequence L tgt of a real clinical note X in the training set, randomly masking one code l mis . The decoder takes this code sequence as input to predict the masked code.
L mis = −logP (l mis |L tgt \ l mis ∪ l mask , X) (9) (2) Wrong code removal: Similar to the above task, we construct a code sequence L tgt , but by randomly adding a wrong code l wro . The decoder is aiming to fade the wrong code's representation with a special mask representation l mask .
L rem = −logP (l mask |L tgt ∪ l wro , X) (10) The decoder could generate purificatory code relevant information with higher rationality with the above two tasks' learning. The decoder is plugged to refine the shared representation H S to H S , so the subsequent dot product score is calculated by

Self-distillation Learning Mechanism
We argue that learning the desired shared attention distribution over such a long clinical text is difficult, and the α i tends to be smooth, brings lots of unnecessary noise information. Therefore we propose a self-distillation learning mechanism showing in the gray dotted lines of Figure 3. With this mechanism, the model could learn superior intermediate representations from itself without introducing another trained model. Considering a single clinical note X with target code set L tgt for training, we derive two paths inputted to the model. The teacher's training data consists of the text descriptions X Ltgt = {X l 1 , X l 2 , ..., X l N l tgt }. We handle those code descriptions separately through the encoder and concatenate them into a flat sequence of hidden state H Ltgt = {H l 1 ; H l 2 ; ...; H l N l tgt }, where N ltgt is the number of code in L tgt , so the subsequent process in our model is not affected. We optimize the teacher's prediction resultŷ tgt i through binary cross-entropy loss.
(12) Student takes origin clinical note Xas input and also have BCE loss to optimize. We assume that an origin clinical note with thousands of words contains all desired codes' information, as well as less essential words. The teacher's input contains all desired information that indicates codes to be predicted without any noise. Ideal shared representations obtained from attention are supposed to collect code relevant information only. Hence we treat the teacher's share representation H Ltgt as a perfect example to the student. A distillation loss encourages those two representation sequences to be similar.

cosine(H
Since we treat the shared representations without order restrict, every teacher have its rights to choose a suitable student, meanwhile, considering other teachers' appropriateness. It implements with Hungarian algorithm (Kuhn, 1955) to calculates the cosine distance globally minimum. Where denotes any shuffle version of the origin representation sequence.

Training
The complete training pipeline of our method is shown in Figure 3. The final loss function is the  weighting sum of the above losses.

Datasets
For fair comparison, we follow the datasets used by previous work on ICD coding (Mullenbach et al., 2018) (Cao et al., 2020), including MIMIC-II (Jouhet et al., 2012) and MIMIC-III (Johnson et al., 2016). The third edition is the extension of II. Both datasets contain discharge summaries that are tagged manually with a set of ICD-9 codes. The dataset preprocessing process is consistent with (Mullenbach et al., 2018

Metrics and Parameter Settings
As in previous works (Mullenbach et al., 2018), we evaluate our method using both the micro and macro, F1 and AUC metrics. As well as P@8 indicates the proportion of the correctly-predicted codes in the top-8 predicted codes. PyTorch (Paszke et al., 2019) is chosen for our method's implementation. We perform a grid search over all hyperparameters for each dataset. The parameter selections are based on the tradeoff between validation performance and training efficiency. We set the word embedding size to 100. We build the vocabulary set using the CBOW Word2Vec method (Mikolov et al., 2013) to pre-train word embeddings based on words in all MIMIC data, resulting in the most frequent 52254 words included. The multi-scale convolution filter size is 5, 7, 9, 11, respectively. The size of each filter output is onequarter of the code embedding size. We set code embedding size to 128 and 256 for the MIMIC-II and MIMIC-III, respectively. The size of shared representation is 64. We utilize a two-layer Transformer for the interactive decoder. For the loss function, we set λ mis = 0.5, λ mis = 5e − 4, λ rem = 5e − 4, λ tgt = 0.5, and λ dist = 1e − 3 to adjust the scale of different supervisory signals. We use Adam for optimization with an initial learning rate of 3e-4, and other settings keep the default.

Baselines
We compare our method with the following baselines: HA-GRU: A Hierarchical Attention Gated Recurrent Unit model is proposed by (Baumel et al., 2017) to predict ICD codes on the MIMIC-II dataset.
CAML & DR-CAML: (Mullenbach et al., 2018) proposed the Convolutional Attention Net-  work for Multi-Label Classification (CAML), which learning attention distribution for each label. DR-CAML indicates Description Regularized CAML, an extension incorporating the text description of codes.
MSATT-KG: The Multi-Scale Feature Attention and Structured Knowledge Graph Propagation was proposed by  They capture variable n-gram features and select multi-scale features through densely connected CNN and a multiscale feature attention mechanism. GCN is also employed to capture the hierarchical relationships among medical codes.
MultiResCNN: The Multi-Filter Residual Convolutional Neural Network was proposed by (Li and Yu, 2020). They utilize the multi-filter convolutional layer capture variable n-gram patterns and residual mechanism to enlarge the receptive field.
HyperCore: Hyperbolic and Co-graph Representation was proposed by (Cao et al., 2020). They explicitly model code hierarchy through hyperbolic embedding and learning code co-occurrence thought GCN.
LAAT & JointLAAT: (Vu et al., 2020) Label Attention model (LAAT) for ICD coding was proposed by (Vu et al., 2020), learning attention distributions over LSTM encoding hidden states for each code. JointLAAT is an extension of LAAT with hierarchical joint learning.

Compared with State-of-the-art Methods
The left part of Table 1 and Table 2 show the results of our method on the MIMIC-III and MIMIC-II dataset with the whole ICD code set. Compared with previous methods generating attention distribution for each code, our method achieves better results on most metrics, indicating the shared attention mechanism's effectiveness. It is noteworthy that the macro results have more significant improvement compare to micro than previous methods. Since the macro indicators are mainly affected by tail codes' performance, our approach benefits from the interactive shared representations among codes with different frequencies.
Compared with the static code interaction of cooccurrence implemented in (Cao et al., 2020), our method achieves higher scores, indicating that the dynamic code interaction module could capture more complex code interactive information other than limit steps of message passing in GCN.
The right part of Table 1 shows the results of our method on the MIMIC-III dataset with the most frequent 50 codes. It proved that our approach's performance would not fall behind with a more balanced label set.

Ablation Experiments
To investigate the effectiveness of our proposed components of the method, we also perform the ablation experiments on the MIMIC-III-full dataset. The ablation results are shown in Table 3, indicating that none of these models can achieve a comparable result with our full version. Demonstrate that all those factors contribute a certain improvement to our model.
(1) Effectiveness of Self-distillation. Specifically, when we discard the whole self-distillation part (w/o self-distillation), the performance drops, demonstrate the effectiveness of the selfdistillation. To further investigate the contribution of the self-distillation module, whether the more training data we constructed, we retain the teacher path and remove the loss between shared representations (w/o distillation loss), the performance still slightly drops. It can be concluded that although the positive effects of the constructed training data in the teacher path, the distillation still plays a role.
(2) Effectiveness of Shared Representation. When we remove the self-distillation mechanism (w/o self-distillation), the contribution of shared representation part can be deduced compared to the performance of CAML. Result showing our version still have 1.1% advantage in macro F1, indicating the effectiveness of shared representation.  (3) Effectiveness of Code Completion Task. When we neglect the missing code completion task and wrong code removal task (w/o code completion tasks), the code interactive decoder optimizes with final prediction loss only. The performance is even worse than the model without the whole code interaction module (w/o co-occurrence decoder). It indicates that the additional code completion task is the guarantee of modeling dynamic code co-occurrence characteristics. Further compared with the model with label attention rather than our proposed shared representations (w/o shared representation), the performance even worse, showing the code completion task is also the guarantee of the effectiveness of shared representations. Without this self-supervised task, the shared information is obscure and the performance drops due to the join of dubiously oriented model parameters.

Discussion
To further explore our proposed interactive shared attention mechanism, we conduct comparisons among various numbers of shared representations in our method. And visualization the attention distribution over source text of different shared representations, as well as the information they extracted.
(1) The Analysis of Shared Representations Size. As shown in Table 4, both large or small size would harm the final performance. When the shared size is set to 1, the shared representation degrades into a global representation. A single vector compelled to predict multiple codes causes the performance drops, as Table 4 shows. We also initialize the shared embeddings with ICD's hierarchical parent node. Specifically, there are 1159 unique first three characters in the raw ICD code set of MIMIC-III-full. We initialize those shared embeddings with the mean vector of their corresponding child codes. Although the hierarchical priori knowledge is introduced, the computation Clinical Note: chief complaint elective admit major surgical or invasive procedure recoiling acomm aneurysm history of present illness on she had a crushing headache but stayed at home the next day ... angiogram with embolization and or stent placement medication take aspirin 325mg ... Codes: 437.3 (cerebral aneurysm, nonruptured); 39.75 (endovascular repair of vessel); 88.41 (arteriography of cerebral arteries) Table 5: The attention distribution visualization over a clinical note of different shared representations. We determine the shared representations according to the target codes' choice. Since we calculate the attention score over hidden states encoded by multi-scale CNN, we take the most salient word as the center word of 5gram and highlight.

Model
Standard Deviation ISD (Ours) 0.013992 w/o self-distillation 0.004605 complexity and uneven node selection could cause the model to be hard to optimize and overfit high frequent parent nodes.
(2) Visualization of Shared Attention Distribution. The attention distribution of different shared representations shown in Table 5 indicates that they have learned to focus on different source text patterns in the noisy clinical note to represent code relevant information.
(3) The Analysis of Self-distillation. As shown in Table 6, the attention weights over clinical text learned by model with the training of selfdistillation mechanism are more sharp than origin learning process. In combination with Table 5, it can be concluded that the self-distillation mechanism could help the model more focus on the desire words of clinical text.

Conclusion
This paper proposes an interactive shared representation network and a self-distillation mechanism for the automatic ICD coding task, to address the long-tail and noisy text issues. The shared representations can bridge the gap between the learning process of frequent and rare codes. And the code interaction module models the dynamic code cooccurrence characteristic, further improving the performance of tail codes. Moreover, to address the noisy text issue, the self-distillation learning mechanism helps the shared representations focus on code-related information in noisy clinical notes. Experimental results on two MIMIC datasets indicate that our proposed model significantly outperforms previous state-of-the-art methods.