A Two-Stage Decoder for Efficient ICD Coding

Clinical notes in healthcare facilities are tagged with the International Classification of Diseases (ICD) code; a list of classification codes for medical diagnoses and procedures. ICD coding is a challenging multilabel text classification problem due to noisy clinical document inputs and long-tailed label distribution. Recent automated ICD coding efforts improve performance by encoding medical notes and codes with additional data and knowledge bases. However, most of them do not reflect how human coders generate the code: first, the coders select general code categories and then look for specific subcategories that are relevant to a patient's condition. Inspired by this, we propose a two-stage decoding mechanism to predict ICD codes. Our model uses the hierarchical properties of the codes to split the prediction into two steps: At first, we predict the parent code and then predict the child code based on the previous prediction. Experiments on the public MIMIC-III data set show that our model performs well in single-model settings without external data or knowledge.


Introduction
Medical records and clinical documentation contain critical information about patient care, disease progression, and medical operations.After a patient's visit, medical coders process them and extract key diagnoses and procedures according to the International Classification of Diseases (ICD) system (WHO, 1948).Such codes are used for predictive modeling of patient care and health status, for insurance claims, billing mechanisms, and other hospital operations (Tsui et al., 2002).
Although the healthcare industry has seen many innovations, many challenges related to manual operations still remain.One of these challenges is manual ICD coding, which requires understanding long and complex medical records with a vast vocabulary and sparse content.Coders must select a small subset from a continuously expanding set of ICD codes (from around 15,000 codes in ICD 9 to around 140,000 codes in ICD 10 (WHO, 2016)).Therefore, manual ICD coding may result in errors and cause revenue loss or improper allocation of care-related resources.Thus, automated ICD coding has received attention not only from the industry but also from the academic community.
Before the rise of deep learning methods, automated ICD coding methods applied rules or decision tree-based methods (Farkas and Szarvas, 2008;Scheurwegs et al., 2017).The focus has now changed to neural networks using two strands of approaches.The first encodes medical documents using pretrained language models (Li and Yu, 2020;Liu et al., 2021), adapts pretrained language models to make them suitable for the clinical domain (Lewis et al., 2020) or injects language models with medical knowledge such as taxonomy, synonyms, and abbreviations of medical diseases (Yang et al., 2022;Yuan et al., 2022).The second improves the representation of pretrained language models, by capturing the relevance between the document and the label metadata such as their descriptions (Mullenbach et al., 2018;Vu et al., 2020;Kim and Ganapathi, 2021;Zhou et al., 2021), cooccurrences (Cao et al., 2020), hierarchy (Falis et al., 2019;Vu et al., 2020;Liu et al., 2022), or thesaurus knowledge, such as synonyms (Yuan et al., 2022).Although these approaches are supposed to alleviate problems specific to medical coding such as special vocabulary, a large set of labels, etc., they fall short.
Intuitively, human coders generate the code in two stages: first, the coders select the general codes and then look for specific subcategories that are relevant to a patient's condition.The advantage of adapting this approach to neural networks is that at each stage of the prediction, we deal with a smaller output space and we can have more confidence when predicting the next stage.Therefore, in this paper, we introduce a simple two-stage decoding framework1 for ICD coding to mimic the processes of human coders.Our approach leverages the hierarchical structure of the ICD codes to decode, i.e., having parent-child relationships.The first stage predicts the parent codes; the second stage uses the document representation and the predicted parent codes to predict the child codes.Experiments with MIMIC-III data sets demonstrate the effectiveness of our proposed method.In particular, our simple method outperforms models that use external knowledge and data.Since our models (Figure 1) are based on LSTMs, they require less computing power and can be trained faster than other larger models.
2 Two-stage Decoding Framework ICD codes follow a hierarchical structure.In this work, we consider characters before the dot (.) in the ICD code as the parent label and the code that has to be predicted as the child label.For example, for the child label 39.10 about Actinomycosis of lung, its parent code is 39 representing Actinomycotic infections.Let P and L represent the sets of parent nodes and child codes for a medical note x, respectively.It is worth noting that if we know the child codes, we can use the above definition to find the corresponding parent codes.This means that knowing L is equivalent to knowing both L and P. Then the probability of the child labels is: This factorization allows us to compute the prediction scores of the parent codes first, and then, conditioned on them and the document, we can obtain the prediction score of the child codes.Therefore, we can model the ICD coding task using a decoder framework where we generate parent labels before predicting child labels.In this case, we adapt the decoder framework to the multilabel problem setting, where at each decoding stage, we predict multiple labels at once, instead of one label at a time like a standard decoder.

Model Architecture
We now describe the components of our parsing model: the document encoder, the first decoding stage for the parent code, and the second decoding stage for the child code.Document Encoder Given a medical note of n tokens x = (x 1 , . . ., x n ), we embed each token in the document in a dense vector representation.Subsequently, the token representations are passed to a single-layer BI-LSTM encoder to obtain the contextual representations [h 1 , h 2 , . . ., h n ].Finally, we obtain the encoding matrix H ∈ R n×d .
First Decoding Stage At this stage, similar to Vu et al. (2020), we take the embedding of all parent labels P ∈ R |L P |×de to compute the attention scores and obtain the label-specific representations as: where S(•), σ(•), rds(•) denote row-wise softmax, sigmoid, reduce sum in last dimension operations; W ∈ R de×d are the weight parameters to perform linear transformations and V ∈ R |L P |×d is the weight matrix of a label-wise fully connected layer which yields the parent label logits where ⊙ is element-wise product.
Second Decoding Stage At this stage, we take the label embeddings of all child labels L ∈ R |L|×de and the probabilities of predicted parent labels from the previous stage as input, and obtain the label-specific representations as per: s(L, P ) = L tanh(W P ⊙ (P (P|x)) T ) att(L, P ) = S(s(L, P ))P where we perform a 'soft' embedding of the parent labels by taking the element-wise product between matrix W P ∈ R de×|L P | with the sigmoid probabilities of parent labels.V LH , V LP ∈ R |L P |×d are the weight matrices of two label-wise fully connected layers that compute the child label logits.
Training Objective & Inference The total training loss is the sum of the binary cross-entropy losses to predict the parent and child labels: For inference, we assign a child label to a document if the corresponding parent label score and the child label score are greater than predefined thresholds.

Experiment Settings
Setup We conduct experiments on the data set MIMIC-III (Alistair et al., 2016).Following the previous work Joint LAAT (Vu et al., 2020), we consider two versions of MIMIC-III dataset: MIMIC-III-Full consisting of the complete set of 8,929 codes and (MIMIC-III-50) consisting the 50 most frequent codes.Similarly to Yang et al. (2022), we use macro and micro AUC and F1, as well as precision@k (k = 8 for MIMIC-III-Full and k = 5 for MIMIC-III-50).For both data sets, we train with one single 16GB Tesla P100 GPU.We detail relevant training hyperparameters and the statistics of the data sets in the Appendix.
We compare our models with recent state-of-theart work using the results from Yang et al. (2022).Among them, Joint LAAT is most similar to our work because it uses a similar attention mechanism and considers both parent and child labels; therefore, we use it as a comparison in ablation studies.We run our models five times with the same hyperparameters using different random seeds and report the average scores.

MIMIC-III-Full
From the result shown in Table 1, we see that our model achieves a micro F1 of 58.4%, the highest among "single" models that do not rely on external data/knowledge.Specifically, our model outperforms Joint LAAT by about 2.5%, 0.2%, 0.9%, 0.3%, 0.9% in macro AUC, micro AUC, micro F1, macro F1 and precision@8 respectively.In particular, our model is on par with MSMN (Yuan et al., 2022) which uses code synonyms collected from Bodenreider (2004)  Improvements of other models (Huang et al., 2022;Yang et al., 2022) most likely stem from the use of external information in form of knowledge injected into pre-trained language modeling.We leave the integration of such information into our proposed model architecture for future work.

MIMIC-III-50
From the results on the righthand side of Table 1, our model produces a micro-F1 of 71.83%, the highest among single models.Specifically, our model surpasses Joint LAAT (Vu et al., 2020) with nearly 1.0%, 2.0%, 0.4% absolute improvement in micro F1, macro F1, and precision@5, respectively.In particular, the macro-F1 of our model is on par with the much more complex state-of-the-art method KEPTLongformer (Yang et al., 2022).This demonstrates the ability of our model to be adapted to classification problems with a large or small number of labels while having competitive results in both cases.

Ablation Study
To evaluate the effectiveness of our model, we conduct an ablation study on the MIMIC-III-Full set, comparing it with Joint LAAT.Rather than integrating parent label prediction scores as supplementary features with the child label representation, as done in the Joint LAAT method, we allow child label representations to attend to both parent label and document representations.We show that this approach drives performance improvements in two aspects: parent label prediction and performance on labels grouped by frequency of appearance.absolute in macro F1, micro F1, and Precision@8, which naturally yields in better child label prediction performance reported in previous sections.But even considering only the case where both models predict parent labels correctly, our approach still achieves a micro F1 score of 65.5%, outperforming Joint LAAT with a micro F1 score of 65.0%.This demonstrates that both parent code and child code prediction benefit from our approach.

Performance in Label Frequency Groups
To understand more about our prediction of the model, we divide medical codes into five groups based on their frequencies in MIMIC-III-Full: 1 − 10, 11 − 50, 51 − 100, 101 − 500, > 500 like Wang et al. (2022).We list the statistics of all groups in the Appendix.We compare the micro F1 between different groups in Figure 2. Overall, we outperform Joint LAAT in all groups.The relative improvements are most noticeable in the rare-frequency group (25% relative improvement in the 1 − 10 group, vs 2% or less in other cases).A possible explanation for this is that the parent label space is smaller than the full label space, which results in more training samples per parent label, allowing to learn better representations.As the parent label representation is used to compute child label representations, low-frequency child labels can thus benefit from representations learned from their high-frequency siblings.

Conclusion
In this paper, we have presented a novel, simple but effective two-stage decoding model that leverages the hierarchical structure of the ICD codes to decode from parent-level codes to child-level codes.
Experiments on the MIMIC-III data set show that our model outperforms other single-model work and achieves on-par results with models using external data/knowledge.Our ablation studies validate the effectiveness of our model in predicting the code hierarchy and codes in different frequency groups.In future work, we intend to integrate our decoder with a better document or label representation to further improve performance.

Limitations
As established, medical coding is an important task for the healthcare industry.Efforts toward its successful automation have wide-ranging implications, from increasing the speed and efficiency of clinical coders while reducing their errors, saving expenses for hospitals, and ultimately improving the quality of care for patients.However, with this goal in mind, our study presented here should be contextualized by the two main limitations that we identify and outline below.
As with other data-driven approaches, the evaluation performance discussed in our paper is limited by the choice of the (static) MIMIC-III data set.This data set could be seen as lacking diversity, as it only features a fraction of all possible ICD-9 codes and contains medical notes collected in English from a specific department of patients belonging to a specific demographic.While our approach does not make any explicit assumptions about the nature of the data other than the hierarchy of labels, in absence of formal guarantees, we cannot make rigorous statements about the efficacy of our (or indeed any related) approaches on clinical data gathered in different settings, such as other languages, countries or departments.
The second limitation is of a more practical nature, since 2015 the ICD-9 coding system is being phased out in favor of the more expressive ICD-10 system, thus ICD-9 coding has limited applications in practice.However, as with its predecessor, the ICD-10 codes are organized in an even richer hierarchy, which should enable the successful application of our proposed approach.We do not number it but it is on page 5, after the conclusion A2.Did you discuss any potential risks of your work?

Appendix
Our work is about improving the ICD coding task with a new mechanism.There may be no potential risks of our work.
A3. Do the abstract and introduction summarize the paper's main claims?
The abstract and the introduction are at the beginning of the paper  etc. for the data that you used / created?Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.
It is in section 3: Experiments C Did you run computational experiments?
Left blank.
C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?Left blank.

Figure 1 :
Figure 1: The architecture of our proposed two-stage decoding model.Encoder (left side): the Medical note is embedded and passed through a bidirectional LSTM.Decoder (right side): The first level parent label decoding uses parent label embeddings and attention over the note.The second-level child label predictions are made by attending to the parent label and the note.

Table 1 :
Yang et al. (2022)IC-III-Full and MIMIC-III-50 test sets.All our experiments are run five different random seeds and we report the mean results.The results of other models, except PLM-ICD, are collected fromYang et al. (2022).For PLM-ICD, we follow the authors' instructions to reproduce the results.

Table 2 :
Table 2 compares the results of parent label prediction.Our model outperforms Joint LAAT by 0.9%, 0.9%, and 0.7% Parent prediction results on MIMIC-III-Full.
Table 3 depicts the dataset statistic of the MIMIC-III-Full and MIMIC-III-50 datasets, Table 4 details the choice of hyperparameters of our bestperforming model and Table 5 outlines the characteristics of the label frequency groupings for the Ablation study in Section 3.3.

Table 3 :
Statistics of MIMIC-III dataset under full codes and top-50 codes settings.

Table 4 :
Hyper-parameters used for training MIMIC-III full and MIMIC-III 50.

Table 5 :
Label Frequency Distribution A4. Have you used AI writing assistants when working on this paper?Did you discuss the license or terms for use and / or distribution of any artifacts?It is in section 3: Experiments B3.Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified?For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?It is in section 3: Experiments B4.Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?It is in section 3: Experiments B5.Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.? It is in section 3: Experiments B6.Did you report relevant statistics like the number of examples, details of train / test / dev splits,