ICDBigBird: A Contextual Embedding Model for ICD Code Classification

The International Classification of Diseases (ICD) system is the international standard for classifying diseases and procedures during a healthcare encounter and is widely used for healthcare reporting and management purposes. Assigning correct codes for clinical procedures is important for clinical, operational and financial decision-making in healthcare. Contextual word embedding models have achieved state-of-the-art results in multiple NLP tasks. However, these models have yet to achieve state-of-the-art results in the ICD classification task since one of their main disadvantages is that they can only process documents that contain a small number of tokens which is rarely the case with real patient notes. In this paper, we introduce ICDBigBird a BigBird-based model which can integrate a Graph Convolutional Network (GCN), that takes advantage of the relations between ICD codes in order to create ‘enriched’ representations of their embeddings, with a BigBird contextual model that can process larger documents. Our experiments on a real-world clinical dataset demonstrate the effectiveness of our BigBird-based model on the ICD classification task as it outperforms the previous state-of-the-art models.


Introduction
Real-world data in healthcare refers to patient data routinely collected during clinic encounters such as visits and hospitalization. After each clinical visit, a set of codes representing diagnostic and procedural information are submitted to various regulatory agencies (Farkas and Szarvas, 2008). The International Classification of Diseases (ICD) system is the most widely used coding system, maintained by the World Health Organization (Avati et al., 2018). Assigning the most appropriate codes is an important task in healthcare since erroneous ICD codes could seriously affect the organization's ability to accurately measure the patient outcome (Ji et al., 2020).
Contextual word embedding models (such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019)) have achieved state-of-art results in many NLP tasks. However, recent attempts of using contextual models on the ICD classification task have failed to achieve state-of-the-art results (Zhang et al., 2020) mainly due to the fact that they can only process documents that contain a small number of tokens. Advances such as the BigBird model (Zaheer et al., 2020) allows contextual models to process long documents, thus reducing the risk of losing information from the original texts.
In this paper, we present a novel model for the ICD classification task. Specifically: (i) we are the first, to the best of our knowledge, to propose the combined usage of a Graph Convolutional Network (based on the normalized point-wise mutual information) and a contextual embedding model for the ICD classification task; (ii) we introduce a novel attention layer on top of a BigBird model which has the ability to process long documents; and (iii) our experiments on a real-world clinical dataset demonstrate the effectiveness of our ICD-BigBird model on the ICD classification task as it outperforms previous state-of-the-art models.

ICD Graph Convolutional Network
A Graph Convolutional Network (GCN) (Kipf and Welling, 2017) is a neural network architecture that can capture the general knowledge about the connections between entities. Specifically, GCN builds a symmetric adjacency matrix based on a predefined relationship graph, and the representation of each node is calculated according to its neighbours.
We use a GCN to capture a more 'enriched' representation for each of the ICD codes. In order to use the ICD-GCN, we first construct the adjacency matrix A ∈ R n×n (where n is the number of unique ICD codes) to represent the connections of ICD codes by using the normalized point-wise mutual information (NPMI) (Lu et al., 2020): where i and j are different ICD codes and p(i, j) = is the number of documents that are labeled with both i and j codes, N (i) is the number of documents that are labeled with the i code and N is the total number of documents of the training set that our model was trained on. We create an edge between two codes if their NPMI value is greater than a threshold. We empirically set the threshold to 0.2 by experimenting with different threshold values.
It should be noted that we decided to create the adjacency matrix of the ICD-GCN by taking advantage of the NPMI values instead of considering the hierarchical associations of the ICD codes because we mainly focused on the task of classifying the top 50 most frequent ICD codes (Shi et al., 2017), where we found that there exists little to no hierarchical connection between these codes.
We then construct a definition (sentence) embedding matrix for all the ICD codes using their ICD-9 (sentence) definitions from the MIMIC III dataset (Johnson et al., 2016) and the pre-trained sentence transformer embedding model in (Reimers and Gurevych, 2019), which has been shown to outperform other state-of-the-art sentence embedding methods.
An updated representation of all ICD codes from the ICD-GCN is calculated as follows: where X ∈ R n×m is the definition embedding matrix, n is the number of ICD codes, m is the size of the sentence-definition embedding of each ICD code, W ∈ R m×h is the weight matrix, h is the Big-Bird's hidden dimension andÂ = D − 1 2 AD − 1 2 is the normalized symmetric adjacency matrix where Finally, we concatenate the output of the ICD-GCN with the initial embeddings of the ICD codes in order to get a richer representation of the codes (Rios and Kavuluru, 2018):

ICDBigBird Model
Assume a discharge summary has n words, the model's tokenizer generates tokens for each word in the document. Afterwards the tokens are passed through multiple attention-based layers and the model produces the final contextual representation of the document H ∈ R t×h where t = 4096 is the number of tokens and h is the BigBird's hidden dimension. We use a fully connected linear layer for the creation ofĤ which is the final embedding representation of the BigBird's embeddings: whereĤ ∈ R t×(m+h) and W 1 ∈ R h×(m+h) . Afterwards, we apply a per-label attention mechanism, in order to showcase the most relevant information to the ICD codes in the contextual representation of each document. Formally, using U ∈ R n×(m+h) which is the 'updated' ICD code sentence-definition embedding matrix, we can compute the attention as: where A ∈ R n×t . After the calculation of the attention score, the output of the attention layer is calculated as: where V ∈ R n×(m+h) . Given the 'updated' representation V , we can compute a probability for each label l by using a pooling operation and a sigmoid transformation over the linear projection of V : where W ∈ R n×(m+h) . As the ICD task is a multi label scenario, the loss function that is typically used is a multi-label binary cross entropy loss: (Mullenbach et al., 2018) 88.4 91.6 57.6 63.3 61.8 LEAM (Wang et al., 2018) 88.
where y is the ground truth label andŷ are the ICD codes that our model predicted for each document. However, due to the extremely imbalance nature of the ICD codes we chose to adopt the Label-Distribution Aware Margin (LDAM) (Cao et al., 2019). In the LDAM loss function the output value is subtracted by a label-dependent margin ∆ i before the sigmoid function: where 1(.) outputs 1 if y i =1 and ∆ i = C n 1/4 i where n i is number of instances of the i ICD code in the training data and C is constant. Thus we use the L LDAM = L BCE (y,ŷ ).

Dataset
Following previous research work in the ICD classification task (Mullenbach et al., 2018;Ji et al., 2020;Li and Yu, 2020), Our experiments on this dataset are consistent with its intended use, as it was created and shared for research purposes (as it stated in its license 1 ). Finally, we manually checked the dataset to investigate the existence of information that uniquely identifies individual people and offensive content, however, we did not find any indication of either of them. We extract the free-text discharge summaries and clinical notes, containing the 50 most frequent ICD codes, from the MIMIC III dataset and we concatenate the discharge summaries and notes from the same hospitalization admission into one single document. We use the training/validation/testing split from (Mullenbach et al., 2018;Li and Yu, 2020) for a fair comparison. The document set size of our subset of MIMIC-III is 8066 for training, 1573 for validation and 1729 for testing respectively. Following the prepossessing procedures outlined in (Ji et al., 2020), the documents are tokenized and each token is converted to lowercase. Any token that contains no alphabetic characters is removed. Instead of truncating the documents to 2500 words, we set the token size limit to 4096 for our ICDBigBird model to take full advantage of the information that can be extracted from each document as there are 1345 documents that contain more than 2500 words (with maximum, minimum and average length of 7567, 105, 1609 words respectively).

Experimental Setup
We provide the search strategy and the bound for each hyperparameter as follows: the batch size is set between 32 and 64, and the learning rate is chosen between the values 2e-5, 3e-5 and 5e-5. We set the number of training epochs between 25 and 30 epochs to allow for maximal performance. The best values are chosen based on micro-F1 scores 2 in the validation set. The final hyper-parameters selection of our ICDBigBird model is batch size 32, learning rate 2e-5, trained on 30 epochs and we empirically set the the C constant of the LDAM loss to 2. We also use the AdamW optimizer (Loshchilov and Hutter, 2019) to optimize the parameters of the model. All the contextual embedding models are implemented using the transformers library (Wolf et al., 2020) on PyTorch 1.7.1. All experiments are executed on a Tesla K80 GPU with 64GB of system RAM on Ubuntu 18.04.5 LTS.

Results
We benchmark our ICDBigBird model against existing state-of-the-art models for the top 50 most frequent ICD classification task. For all models we evaluate the micro and macro averaging F1 score, the receiver operating characteristic curve (AUC-ROC) and the precision at k codes with k=5 (P@5). In Table 1, we can observe that our model outperforms all other models in the micro and macro averaging F1 and in the P @5 score with comparable performance on the other two metrics (with the DCAN model (Ji et al., 2020) achieving the best AUC-ROC results). Finally, our model contains 110565170 parameters with average running time of 893354 sec.

Ablation Study
In order to evaluate the effect of each feature on the performance of ICDBigBird, we conduct an ablation study. The results are presented in Table  1. (i) Firstly, we investigate whether the ability of the BigBird model to process large documents can boost the performance of our model. It can be observed that contextual model architectures that 2 https://github.com/jamesmullenbach/caml-mimic can process small documents of at most 512 tokens (Bert, Biobert, Bio_ClinicalBert) cannot achieve the performance of a BigBird architecture even if these models were pre-trained on medical documents (BioBert and Bio_ClinicalBert). (ii) Furthermore, we examine the performance of the Big-Bird model when we artificially limit the length of the documents to 512 tokens (BigBird 512 tokens) which is the maximum number of tokens that the BERT model can process. We observe that the performance improvement brought by the BigBird model is lost, making the performance of the Big-Bird model equivalent to the BERT model. This experiment demonstrates that one of the main reasons for the BigBird model outperforming the BERT model is the utilization of additional information in larger documents (4096 tokens) for the ICD automatic encoding task. (iii) In addition, we examined the effect of the GCN model by testing the performance of contextual embeddings without enriching them with information from the definitions of the codes through an attention mechanism (BigBird without attention) by having an ICD classifier on top of the [cls] token and by substituting the GCN attention mechanism with the typical linear attention mechanism (Linear Attention) (Mullenbach et al., 2018). It can be observed that our model benefits from the attention mechanism as without it, it cannot achieve optimal performance. Also, the fact that the GCN graph attention mechanism achieves a better performance than a typical linear attention mechanism is a strong indication that the connections between the ICD codes can provide valuable information. (iv) Finally, we investigated the effect of using the definitions of the codes to initialize their embeddings. In our experiments a model with a random initialization of the embeddings of the codes (R. embedding) achieved sub-optimal performance and thus we can conclude that using the codes' definitions to initialize their embeddings have a positive effect on the model's performance.

Discussion-Related Work
Recent development in NLP has introduced deep learning models that can achieve optimal performance on the ICD classification task. In (Shi et al., 2017), the authors introduced a new model that used word/character embeddings and recurrent neural networks (LSTM) to generate representations of the diagnosis descriptions and of the ICD codes. In addition, the authors in (Mullenbach et al., 2018) introduced an attention based convolutional neural network (CNN) model which incorporates an attention mechanism in order to identify the most relevant segments that contain medical information.
Furthermore, prior work has explored the use of GCNs for the ICD classification task (Rios and Kavuluru, 2018;Chalkidis et al., 2020) and our attention mechanism can be viewed as an extension of the structured attention mechanism of (Cao et al., 2020). However, some of the differences between the models are that: (i) Our work uses a normalized point-wise mutual information policy to create the edges, while the model in (Cao et al., 2020) used the co-appearing values to create a weighted graph. This is a key difference in the ICD coding problem as the method in (Cao et al., 2020) does not capture the relation between two highly correlated but 'unpopular' codes. (ii) In addition, the authors in (Cao et al., 2020) created the code embedding vectors by averaging the word embeddings of its descriptor, and our work uses pre-trained sentence embedding models which have achieved better performance. (iii) Finally, the model in (Cao et al., 2020) used a Convolution Neural Network (CNN) encoder while our work used a contextual (BigBird) model to produce document embeddings.
The results of the experiments indicate that these changes are important for the ICD classification task by demonstrating that a contextual model can achieve state-of-the-art results for this task.

Conclusion and Future Work
We present the ICDBigBird model, which is a novel contextual model for the ICD coding task. ICDBig-Bird has the ability to integrate a graph embedding model that takes advantage of the relations between ICD codes with a BigBird contextual model that can process larger documents. Experiments on the MIMIC III have shown that the ICDBigBird model outperforms previous state-of-the-art models. As for future work, we plan to address the limitations of this study including (i) testing ICDBigBird in other medical datasets to examine its generalizability, strengths and limitations, (ii) experimenting on the task of classifying the full ICD code set and (iii) examining the performance of the model in datasets of other languages (Almagro et al., 2020).

Acknowledgement
We acknowledge the generous support from Amazon Research Awards, MITACS Accelerate Grant (#IT19239), Semantic Health Inc.

Ethical Consideration
The ICD coding task is crucial for making clinical, operational and financial decision in healthcare. Traditionally, medical coders review clinical documents and manually assign the appropriate ICD codes by following specific coding guidelines. Models such as our ICDBigBird could help to reduce time and cost in data extraction and reporting significantly.
However, we need to be aware of the risks of over-relying on any automatic encoding model. No matter how efficient an automatic encoding model is, it is still possible to misclassify patients' condition with erroneous ICD codes which may affect their treatment. Thus we believe that any automatic encoding model should only be used to assist, not replace the judgement of trained clinical professionals.