Medical Code Assignment with Gated Convolution and Note-Code Interaction

Medical code assignment from clinical text is a fundamental task in clinical information system management. As medical notes are typically lengthy and the medical coding system's code space is large, this task is a long-standing challenge. Recent work applies deep neural network models to encode the medical notes and assign medical codes to clinical documents. However, these methods are still ineffective as they do not fully encode and capture the lengthy and rich semantic information of medical notes nor explicitly exploit the interactions between the notes and codes. We propose a novel method, gated convolutional neural networks, and a note-code interaction (GatedCNN-NCI), for automatic medical code assignment to overcome these challenges. Our methods capture the rich semantic information of the lengthy clinical text for better representation by utilizing embedding injection and gated information propagation in the medical note encoding module. With a novel note-code interaction design and a graph message passing mechanism, we explicitly capture the underlying dependency between notes and codes, enabling effective code prediction. A weight sharing scheme is further designed to decrease the number of trainable parameters. Empirical experiments on real-world clinical datasets show that our proposed model outperforms state-of-the-art models in most cases, and our model size is on par with light-weighted baselines.


Introduction
Automatic medical code assignment is a routine healthcare task for medical information management and clinical decision support.The International Classification of Diseases (ICD) coding system, maintained by the World Health Organization (WHO), is widely used among various coding systems.Thus, the medical code assignment task is also called ICD coding.It uses all types of clinical notes to predict medical codes in a supervised manner with human-annotated codes (Perotte et al., 2014), which is formulated as a multi-class multilabel text classification problem in the medical domain.
While there are increasing works in the community in automatic medical code assignment (Prakash et al., 2017;Shi et al., 2017;Mullenbach et al., 2018;Ji et al., 2020), this task remains challenging from the perspectives of note representation and code prediction.First, medical note representation, a critical step in understanding medical notes, is formidably challenging due to the lengthy and complex semantic information in the discharge documents.There are typically thousands of tokens in a medical note due to the various diagnoses and procedures experienced by a patient.Furthermore, clinical notes also contain a vocabulary with many professional words and phrases, making it hard for a neural network model to encode and understand critical information.Second, the medical coding system has a very high and sparse dimensional label space, which renders the code prediction task incredibly difficult.For example, ICD9 and ICD10 coding systems have many labels, i.e., more than 14,000 and 68,000 codes.However, a patient typically is diagnosed with only a couple of codes over the whole coding space.
Early works for medical code assignment typically follow statistical approaches.They either employ rule-based methods (Farkas and Szarvas, 2008) or apply classification methods such as SVM and Bayesian ridge regression (Lita et al., 2008) to assign the codes.These methods are shallow and do not exploit the complex semantic information in medical notes, leading to unsatisfactory performance.Recently, Natural language processing (NLP) techniques based on deep learning have been developed (Mullenbach et al., 2018;Li et al., Figure 1: Illustration of the GatedCNN-NCI model architecture.The gating mechanism controls the information propagation.Textual features interact with each code vector in the note-code interaction module.FCN is a fully connected layer. 2018; Cao et al., 2020;Ji et al., 2020), which learn the note representation via convolutional neural networks.Specifically, CAML (Mullenbach et al., 2018), MultiResCNN (Li et al., 2018) and DCAN (Ji et al., 2020) treat ICD coding as a general text classification problem and develop complex neural encoders to learn the note representation.HyperCore (Cao et al., 2020) proposes the hyperbolic embedding to capture code hierarchy and co-occurrence.However, these approaches are still ineffective, as they do not explicitly capture the fine-grained interactions between textual elements and medical codes.These interactions naturally represent the interdependencies between the complex medical words and associated codes, and thus should be well exploited.
This paper puts forward a novel neural architecture, Gated Convolutional Neural Network with Note-Code Interaction (GatedCNN-NCI), for effective medical code assignment.Our goal is to learn rich representation from clinical notes and exploit the interactions between medical texts and clinical codes.To capture the long sequential history of clinical documents, we design a novel dilation information propagation component with a forgetting mechanism to selectively utilize the useful information for note representation learning.To tackle the large labeling space, we formulate textual notes and medical codes as a complete bipartite graph and develop a graph message passing approach to capture the explicit interaction between notes and codes.The ICD code descriptions are used as an external medical knowledge source to learn more accurate code representations that preserve the semantic relations of the codes.Considering the practical application in real-world medical institutes, especially those with limited computing resources, our architecture also prioritizes computational efficiency when designing the sub-modules.Our contributions are itemized as follows.
• We propose a CNN-based neural architecture with dilation and gating mechanism for clinical text encoding.We enhance the feature representation learning with 1) embedding injection, enhancing the deeper features of lengthy clinical notes; 2) and the gating mechanism to control the information propagation.

Related Work
Classical medical coding systems used rule-based methods (Farkas and Szarvas, 2008), studied feature selection (Medori and Fairon, 2010), and applied classification models such as SVM and Bayesian ridge regression (Lita et al., 2008 (Shi et al., 2017) and GRU network with hierarchical attention (Baumel et al., 2018).Prakash et al. (2017) used Wikipedia as a knowledge source and proposed condensed memory networks (C-MemNNs) with iterative condensation of memory representation.Although CNNs are traditionally applied in computer vision, many ICD coding methods utilize convolutional architectures.CAML (Mullenbach et al., 2018) used CNN with multiple filters and label attention.Li et al. (2018) adopted the doc2vec embedding and CNN architecture, and Bai and Vucetic (2019) incorporated online knowledge sources.The recent MultiResCNN model (Li and Yu, 2020) extensively concatenated and stacked CNNs with multi-filter convolution and residual learning.HyperCore (Cao et al., 2020) utilized hyperbolic embedding and cograph representation to capture the code hierarchy.

Problem Definition
The input clinical note with n words is denoted as x 1:n = x 1 , . . ., x n , where each x i is a word (or token).The medical coding system is the set of all possible diagnosis and procedure codes denoted as C. The medical code assignment learns a function where y ∈ R m is the medical code at discharge, m is the number of medical codes, and D is an optional external knowledge source.This paper uses the ICD coding system and naturally utilizes the official textual ICD code description as an external knowledge source.

High-level Model Architecture
The high-level model architecture of GatedCNN-NCI is illustrated in Fig. 1.Our model consists of two main components, i.e., stacked gated CNN layers for clinical note encoding and note-code interaction to fuse the external ICD code description.The stacked gated CNNs include three sub-modules, i.e., dilated convolution, embedding injection, and gating mechanism.
We use word2vec (Mikolov et al., 2013) to train word embeddings from raw tokens.Word embedding matrix of a clinical note is denoted as [w 1 , . . ., w n ] T ∈ R n×de , where d e is the dimension of word vectors.Then we input the word embeddings into stacked gated CNN layers for longrange information propagation.The stacked module uses dilated convolution as its backbone (Oord et al., 2016).To further enhance the feature learning, we inject the original embedding into each stacked layer.The gating mechanism is originated from the long short-term memory network (LSTM) (Hochreiter and Schmidhuber, 1997).We adopt the LSTM-like gate (Dauphin et al., 2017) to control the information flow.
To avoid blurry memory in higher layers, we inject the original word embeddings (Bai et al., 2019).Label interaction has been studied by Wang and Jiang (2016) and Du et al. (2019).We utilize descriptive knowledge from the ICD code descriptions and develop the note-code interaction to capture the relational match between clinical note features and ICD codes.To reduce the training cost and stabilize the training process, we also introduce a weight sharing mechanism across the stacked CNNs (Bai et al., 2019).

Dilated Convolutional Layers
We use the one-dimensional convolution with dilation as the backbone of our encoder, which takes the word embedding X ∈ R n×de as input.Dilated CNN has exhibited a significant capacity for long sequence modeling and computationally efficient for parallelism (Bai et al., 2018).Specifically, we use a 1D convolution operator Conv1D(x; f ), with a filter f : {0, . . ., k − 1} → R, to each dimension of the word vectors.Given a sequence of one-dimensional elements x ∈ R n , the onedimensional dilated convolution F d is denoted as where d is the dilation size (i.e., the space between kernel elements), s is the index of the element of the input sequence, k is the convolving kernel (aka, the filter) size, and s − d • i refers to past time steps.The dilation size of d and kernel size k control the receptive field.The 1D dilated convolution has d h output channels, i.e., for each of the d e input channels d h convolutional features are learned through the dilated Conv1D.Stacking CNN layers can be adopted to learn in-depth features.

Embedding Injection
Our hypothesis for encoding a very long clinical sequence is that the deep neural encoding architecture tends to forget important information, mainly because the clinical note contains fruitful professional expression about the patient's diagnosis.Thus, indepth features become blurry with the increase of neural layers.We propose to inject original word embedding into each intermediate layer of the proposed architecture, attempting to remind the network to reactivate the original diagnostic notes and mitigate the failure of extracting meaningful, indepth features.We denote the hidden representation at the l-th layer as H l ∈ R n×d h , where the dimension d h is the hidden dimension.Word embedding is concatenated into lth-layer hidden representation as where J l ∈ R n×(de+d h ) are the deep features enhanced with the original clues, used as the new input of the next convolutional encoding layer.We randomly initialize the H 0 matrix for the first convolutional layer.

Gating Mechanism
Embedding injection of original word vectors brings low-level features to higher-level, which may lead to difficulty in feature learning in higher layers.Thus, we develop an LSTM-style gating mechanism to control the information flow and capture a long history in the sequence.Unlike the recurrent gate such as the LSTM that controls the information flow along the time coordinate, this gating mechanism controls the flow through stacked layers' depth.The gating mechanism is depicted in Fig. 2, where σ and tanh are sigmoid and hyperbolic tangent activation functions respectively.After the embedding injection, the dilated CNN upsamples the injected signal J l into U l ∈ R n×du at the l-th layer.We divide U l into four matrices with the same dimension, i.e., I, O, G and F ∈ R n×dg , such that: Here, we have d u = 4 × d g .Then, these four matrices are fed into the LSTM-like gating module that controls what information should be propagated a a T 2 J A t s Z U T P S q 9 5 c / M / r p i a 8 9 a d c J q l B y Z a L w l Q Q E 5 P 5 1 2 T A F T I j J p Z Q p r i 9 l b A R V Z Q Z m 0 3 B h u C t v r x O W p W y d 1 W u N K 5 L 1 b s s j j y c w T l c g g c 3 U I U a 1 K E J D B C e 4 R X e n E f n x X l 3 P p a t O S e b O Y U / c D 5 / A J 7 H j N A = < / l a t e x i t > X < l a t e x i t s h a 1 _ b a s e 6 4 = " k p P T A t G M n O 2 k r F R S n N r r a 2 x D i v U = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m q o M e i F 4 8 t 2 F p o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g p r 6 x u b W 8 X t 0 s 7 u 3 R n e I U 3 5 9 F 5 c d 6 d j 0 V r w c l n j u E P n M 8 f t w e M 4 A = = < / l a t e x i t > [ < l a t e x i t s h a 1 _ b a s e 6 4 = " K l t c 6 3 5 d V q z 5 s X D p y R Y P where C l is the cell state at the l-th layer and H is the hidden state produced by the gated unit.The embedding injection trick concatenates the original word embedding X and the hidden representation H, and the dilated convolutional layer upsamples the concatenation to get the new feature U l+1 at the (l + 1)-th layer, denoted as: Gated CNNs can be stacked into a deep architecture, as shown in the general framework of Fig. 1.As a result, our model can represent a large-sized context and extract hierarchical features at each layer.Moreover, the gating mechanism can also extract important features to remember and focus, while less critical features are forgotten and ignored at each layer.

Note-Code Interaction as Message Passing
To capture the explicit note-code interaction (NCI) between the medical codes and textual mentions, we build a complete bipartite graph G = {U, V, E}, where U = {w i } n and V = {c j } m represent the words and ICD codes respectively, and E is the fully connected edge set.For simplicity, we omit the superscript of the last convolutional features U l+1 extracted by the stacked gated CNNs and denote the textual node features U as the vertex set U in the note-code bipartite graph.We incorporate the ICD code descriptions of WHO to represent the medical knowledge about ICD codes.For example, the ICD code 240 in Fig. 1 is about simple and unspecified goiter.Instead of merely using the ICD code index to represent the prediction target, we include the code description, which contains rich domain knowledge.Word embeddings of description are averaged to obtain code vectors V ∈ R m×dv , where m is the number of codes, and d v is the embedding dimension.We take the code vectors as the node features of the vertex set V .
Our novel formulation of the bipartite graph preserves the source-target matching between textual features and ICD code vectors.We utilize the graph message passing mechanism (Gilmer et al., 2017;Wu et al., 2020) to infer fine-grained clues about dependencies between textual features and code semantics.The composition function NCI : R n×du × R m×dv → R m is denoted as: where g ξ with parameter ξ is a neural message function and f θ with parameter θ is an output function.It takes the textual features of all tokens in a note and embeddings of code vectors as inputs and produces an interaction score between the note and each code.To improve the computational efficiency, we take the dot product as the message function g ξ .The explicit interaction score between token w i and code c j is calculated as where V i,: is the row vector of textual features representing the i-th word, U j,: is the row vector of ICD code matrix representing the j-th code in ICD code set.We set d u = d v and get the interaction matrix I ∈ R m×n with dot product.We use a fully connected network f θ to calculate the scores of the note-code interactions as output.Similar to the matrix factorization formulation of language models (Yang et al., 2017;Li et al., 2020), this dot-product interaction between notes and codes approximates the point-wise mutual information of note-code co-occurrence.

Parameter-efficient Weight Sharing
The embedding injection and convolutional feature concatenation make the hidden feature highdimensional.Moreover, as a result of stacking deep layers, the overall model will become cumbersome.Thus, we utilize a weight sharing mechanism (Bai et al., 2019) to decrease the number of parameters.Specifically, we share the weights of gated CNN layers across time steps and depth through neural layers.This mechanism has two benefits.First, it can decrease the number of trainable parameters because weights across the network are tied.Second, it provides a form of regularization to stabilize the training process.

Objective and Training
We formulate the ICD code assignment as a multilabel multi-class classification problem.We adopt the binary cross entropy loss denoted as: where y i ∈ {0, 1} is the ground-truth label, ŷi is the sigmoid score for prediction, and m is the number of ICD codes.We use Adam optimizer (Kingma and Ba, 2014) to train the model with backpropagation.

Experiments
In the experimental analysis on real-world datasets, we compare our proposed model with several recent strong baselines.Our code is available1 .

Datasets
This paper focuses on textual discharge summaries from a hospital stay.Specifically, we use raw notes, ICD diagnoses, and procedures for patients from two public clinical datasets, i.e., MIMIC-II and MIMIC-III2 , for experiments.Discharge summaries labeled with a set of ICD-9 diagnosis and procedure codes include descriptions of procedures performed by the physician, diagnosis notes, patient's medical history, and discharge instructions.

MIMIC-II. The first dataset of clinical notes is from the Multiparameter Intelligent Monitoring in
Intensive Care II (MIMIC-II) database (Saeed et al., 2011).We follow the standard train-test split performed by Perotte et al. (Perotte et al., 2014), where 90% and 10% of 22,815 non-empty discharge summaries are used for training and testing, respectively.

MIMIC-III.
The second dataset is an updated database from Medical Information Mart for Intensive Care III (MIMIC-III) repository (Johnson et al., 2016), containing patient admitted to Intensive Care Unit (ICU) at a US medical center during 2001 to 2012.We use the "noteevents" table in the latest version 1.4, with 58,576 hospital admissions.
Free-text discharge summaries in the MIMIC-III database are extracted to form the dataset with clinical text.The experimental evaluation considers two settings.The first one uses the full set of ICD codes.Following Shi et al. (Shi et al., 2017) and Mullenbach et al. (Mullenbach et al., 2018), an additional experiment on the subset of MIMIC-III with the top 50 frequent labels is conducted.This MIMIC-III top-50 subset has a train/dev/test split with 8,066, 1,573, and 1,729 samples.

Settings
Preprocessing We preprocess the textual documents following the preprocessing procedures developed by Mullenbach et al. (2018) and Li and Yu (2020).The NLTK package3 is utilized for tokenization, and all tokens are converted into lowercase.All words appearing in less than three training documents were replaced with "unk".We truncate all documents at the length of 2500 tokens.The word embeddings are initialized with embedding vectors pre-trained on all discharge notes with the continuous-bag-of-words (CBOW) method of word2vec (Mikolov et al., 2013).
Hyper-parameters Some standard settings follow the prior works.For example, the word embedding dimension is 100, and the dropout rate is 0.2.Adam optimizer (Kingma and Ba, 2014) is used to optimize our model parameters.For the rest hyper-parameters, the random search is utilized to search the optimal settings.The searching range or choices of specific hyper-parameters are listed in Table 1.The searching interval of learning rate is [1e −6 , 1e −2 ].Besides, we optimize for kernel size, levels of residual connections, and hidden representation dimension.
Evaluation Metrics We use area under the receiver operating characteristic curve (AUC-ROC), F1-score, and precision at k (P@k) for evaluation.We set k = 5 for MIMIC-III subset with top-50 frequent codes and k = 8 for full sets of MIMIC-II and MIMIC-III.In the multi-label classification setting, we use two averaging strategies, i.e., micro  , 200, 300, 400, 500, 600 and macro.The macro scores are obtained by averaging the respective label-wise scores across all labels.Micro scores give more weight to frequent labels by considering all labels jointly.We run the experiments for 5 times and report the mean ± standard deviation.

Baselines
We consider the following baseline models.Mul-tiResCNN (Li and Yu, 2020) and HyperCore (Cao et al., 2020) are two recent strong models with the state-of-the-art performance.Bi-GRU (Mullenbach et al., 2018) uses a simplified gated recurrent unit with bi-direction, where last hidden representations are used for classification.C-MemNN (Prakash et al., 2017) introduces an iterative condensation of memory representations and utilizes external knowledge source from Wikipedia to enhance memory networks by preserving the hierarchical structure in the memory.AttentiveL-STM (Shi et al., 2017) encodes clinical descriptions and ICD long titles jointly with characterand word-level LSTM networks and uses attention mechanism for matching important diagnosis snippets.CAML (Mullenbach et al., 2018) integrates CNNs and a label-wise attention mechanism to learn rich representations.It has a variant called DR-CAML that uses ICD code descriptions to regularized the loss function.LEAM (Wang et al., 2018) encodes two channels of inputs and leverages the compatibility between word and label embeddings to calculate attention scores.Mul-tiResCNN (Li and Yu, 2020) combines residual learning (He et al., 2016) and multiple channels concatenation with different convolutional filters, achieving good performance in most settings.Hy-perCore (Cao et al., 2020) utilizes hyperbolic embedding and co-graph representation with code hierarchy.It gains slightly better performance than the MultiResCNN.

Results
Our model performs consistently the best for frequent labels.First, it beats all models in the MIMIC-III subset with top-50 codes (columns 2-6 in Table 2).For the micro scores that give more weight to frequent labels, our model also has the best predictive metrics (columns 8&10 in Table 2 and columns 3&5 in Table 3).Moreover, our model is competitive also with the rest of the metrics: it consistently has the best P@k scores and at worst, the second best macro scores in all datasets.

MIMIC-III (Top-50 Codes)
The first experiment uses the MIMIC-III subset with top-50 codes, showing models' performance on predicting the frequent diagnosis.The results in Table 2 show that our model outperforms all the baselines in all the evaluation metrics.Significantly, our model gains a higher macro F1-score by 2% and micro F1-score by 1.6% than the state of the art.

MIMIC-III (Full Codes)
We then run our model on the MIMIC-III dataset with full codes.Our model outperforms most baselines, gaining the best scores in macro AUC-ROC, macro F1, micro F1, and precision@8.For the macro AUC-ROC, our model is ranked at the second place.

MIMIC-II (Full Codes)
In the third dataset of MIMIC-II, we also predict the full set of ICD-9 codes.Our model achieves predictive performance on par with two recent strong baselines of Mul-tiResCNN and HyperCore.We gain the best scores in micro AUC-ROC, micro F1-score, and P@8.Macro AUC-ROC and F1 scores of our model are the second best of the models compared.

Comparison with BERT
Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019) has revolutionized the NLP community recently.The pre-trained language model has been applied to different downstream NLP tasks.We compare our model's performance with the BERT model and a domain-specific variant, i.e., Clini-calBERT (Alsentzer et al., 2019) pre-trained on the clinical text of MIMIC-III.For the BERT model, we use the uncased BERT-base with a hidden dimension of 768.Because these two BERT models require the configuration of the maximum sequence length of 512, we truncate the text sequence for our model to ensure a fair comparison.BERT models have two special tokens, i.e., [CLS] and [SEP].Thus, we truncate clinical notes with a length of 510.We use Huggingface's transformer framework4 when implementing these two models.The results in Table 4 show that pretraining the language model with domain data improves the performance, and our model has better performance in most evaluation metrics.

Model Size
We compare the number of trainable parameters (Table 5) of our model with two models with qualified performance, i.e., CAML (Mullenbach et al., 2018) and MultiResCNN (Li and Yu, 2020).Hy-perCore (Cao et al., 2020) didn't publish the code or provide the values of all hyperparameters.Thus, we omit it in this comparison.Our proposed model is more efficient than the MultiResCNN in terms of the number of trainable parameters.The CAML model has the fewest parameters but performs poorly in prediction.Our model has a much better predictive performance than the CAML model, with only a slight increase in model size.CAML (Mullenbach et al., 2018) 6.2M MultiResCNN (Li and Yu, 2020) 11.9M ClinicalBERT (Alsentzer et al., 2019) 113.8MGatedCNN-NCI (Ours) 7.6M

Ablation Study
We further conduct an ablation study the investigate the effectiveness of different components of our proposed model.We evaluate two variants by removing two critical components of the proposed model.The first variant without NCI replaces the note-code interaction with max-pooling and linear projection.The second variant removes the gating mechanism that controls the information prorogation over the CNN layers.Table 6 compares the experimental results on the MIMIC-III subset with top-50 codes.The performance drops to some extent after removing these two modules, which shows the effectiveness of our proposed architectures.Moreover, the note-code interaction module has slightly more contribution than the gating mechanism.Possible explanations are that the explicit interaction perseveres the semantics of medical codes well and captures the relation between codes and notes in the embedding space.

Case Study
We conduct a case study to interpret an example prediction.Table 7 shows the predictions for a clinical note of a patient with cardiovascular diseases and diabetes.The patient also had 'dyspnea on exertion' as a symptom caused by either pneumonia or cardiac diseases.Our model and MultiResCNN predict the correct diagnosis codes: coronary atherosclerosis (ICD code: 414.01), hypertension (401.9), and diabetes (250.00).When predicting procedure codes, MultiResCNN is confused by dyspnea on exertion and incorrectly predicts pneumonia-related treatments: endotracheal intubation (96.04) and invasive mechanical ventilation (96.71).Our model correctly predicts a cardiac catheterization procedure and diagnostic interventions of heart surgery (39.61) and coronary artery bypass (36.15).Hence, our model is not misled by the ambiguous interpretation for dyspnea on exertion but learns the correct cardiac-related context, consistent with the rest of the note.

Conclusion
Medical code assignment from clinical notes is a fundamental task for healthcare information systems and diagnosis decision support.This paper proposes a novel framework with gated convolutional neural networks and note-code message passing mechanism for automated medical code assignment.Our solution can learn meaningful features from lengthy clinical documents and effectively control the deep propagation of information flow.Moreover, the message passing mechanism can enhance the ICD code space's semantics and model the note-code interaction to improve medical code prediction.Experiments show the effectiveness of our proposed method.
s l y 5 v y p V b 7 M 4 8 n A C p 3 A O H l x D F e 6 g D g 1 g M I Z n e I U 3 J 3 Z e n H f n Y 9 G a c 7 K Z Y / g D 5 / M H s X y P J Q = = < / l a t e x i t > H < l a t e x i t s h a 1 _ b a s e 6 4 = " O m n s B I c I 7 u L 9 x T 1 n n c l a c U z B 0 H k = " > A A A B 6 H i c b V D L T g J B E O z F F + I L 9 e h l I j H x R H b R R I 9 E L x w h k U c C G z I 7 9 M L I 7 O x m Z t a E E L 7 A i w e N 8 e o n e f N v H G A P C l b S S a W q O 9 1 d Q S K 4 N q 7 7 7 e Q 2 N r e 2 d / K 7 h b 3 9 g 8 O j 4 v F J S 8 e p Y t h k s

Table 1 :
Range and choices of hyper-parameter search

Table 2 :
Results on MIMIC-III with top-50 and full codes."-" indicates no results reported in the original paper.Bold text denotes the best and italic text denotes the second best.

Table 3 :
Results on MIMIC-II full codes.Bold text denotes the best and italic text denotes the second best.

Table 4 :
Comparison with BERT and ClinicalBERT using the MIMIC-III top-50 code dataset with sequence length truncated at 510.

Table 5 :
Number of trainable parameters

Table 7 :
Case study on a clinical note with cardiacrelated diseases (bold, in green).Dyspnea on exertion (italic, in red) can be caused by cardiac-or pneumoniarelated diseases.note old male with multiple cardiac risk factors and dyspnea on exertion . . ., he then underwent further workup which included a cardiac catheterization that revealed significant coronary artery disease. he was then transferred for surgical evaluation".