Clinical Outcome Prediction from Admission Notes using Self-Supervised Knowledge Integration

Outcome prediction from clinical text can prevent doctors from overlooking possible risks and help hospitals to plan capacities. We simulate patients at admission time, when decision support can be especially valuable, and contribute a novel *admission to discharge* task with four common outcome prediction targets: Diagnoses at discharge, procedures performed, in-hospital mortality and length-of-stay prediction. The ideal system should infer outcomes based on symptoms, pre-conditions and risk factors of a patient. We evaluate the effectiveness of language models to handle this scenario and propose *clinical outcome pre-training* to integrate knowledge about patient outcomes from multiple public sources. We further present a simple method to incorporate ICD code hierarchy into the models. We show that our approach improves performance on the outcome tasks against several baselines. A detailed analysis reveals further strengths of the model, including transferability, but also weaknesses such as handling of vital values and inconsistencies in the underlying data.


Introduction
Clinical professionals make decisions about patients under strong time constraints. The patient information at hand is often unstructured, e.g. in the form of clinical notes written by other medical personnel in limited time. Clinical decision support (CDS) systems can help in these scenarios by pointing towards related cases or certain risks. Clinical outcome prediction is a fundamental task of CDS systems, in which the patient's development is predicted based on data from their Electronic Health Record (EHR). In this work we focus on textual EHR data available at admission time. Figure 1 shows a sample admission note with highlighted parts that -according to medical doctorsmust be considered when evaluating a patient.
Encoding clinical notes with pre-trained language models. Neural models need to extract relevant facts from such notes and learn complex relations between them in order to associate certain clinical outcomes. Pre-trained language models such as BERT (Devlin et al., 2019) have shown to be able to both extract information from noisy text and to capture task-specific relations in an end-to-end fashion (Tenney et al., 2019;van Aken et al., 2019). We thus base our work on these models and pose the following questions: • Can pre-trained language models learn to predict patient outcomes from their admission information only?
• How can we integrate knowledge about outcomes that doctors gain from medical literature and previous patients?
• How well would these models work in clinical practice? Are they able to interpret common risk factors? Where are they failing?
Simulating patients at admission time. Existing work on text-based outcome prediction focuses on progress notes after a certain time of a patient's hospitalisation (Huang et al., 2019). This is mostly due to a lack of publicly available admission notes and poses some problems: 1) Doctors might miss specific outcome risks early in admission and 2) progress notes already contain information about clinical decisions made on admission time (Boag et al., 2018). We propose to simulate newly arrived patients by extracting admission notes from MIMIC III discharge summaries. We are thus able Figure 1: Admission to discharge sample that demonstrates the outcome prediction task. The model has to extract patient variables and learn complex relations between them in order to predict the clinical outcome. to give doctors hints towards possible outcomes from the very beginning of an admission and can potentially prevent early mistakes. We can also help hospitals in planning resources by indicating how long a patient might stay hospitalised.
Integrating knowledge with specialised outcome pre-training. Gururangan et al. (2020) recently emphasized the importance of domainand task-specific pre-training for deep neural models. Consequently we propose to enhance language models pre-trained on the medical domain with a task-specific clinical outcome pre-training. Besides processing clinical language with idiosyncratic and specialized terms, our models are thus able to learn about patient trajectories and symptom-disease associations in a self-supervised manner. We derive this knowledge from two main sources: 1) Previously admitted patients and their outcomes. This knowledge is usually stored by hospitals in unlabelled clinical notes and 2) Scientific case reports and knowledge bases that describe diseases, their presentations in patients and prognoses. We introduce a method for incorporating these sources by creating a suitable pre-training objective from publicly available data.
Contributions. We summarize the major contributions of this work as follows: 1) A novel task setup for clinical outcome prediction that simulates the patient's admission state and predicts the outcome of the current admission.
2) We introduce self-supervised clinical outcome pre-training, which integrates knowledge about patient outcomes into existing language models. 3) We further propose a simple method that injects hierarchical signals into ICD code prediction. 4) We compare our approaches against multiple baselines and show that they improve performance on four relevant outcome prediction tasks with up to 1,266 classes. We show that the models are transferable by applying them to a second public dataset without additional fine-tuning. 5) We present a detailed analysis of our model that includes a manual evaluation of samples conducted by medical professionals.

Related Work
Using clinical notes for outcome prediction. Boag et al. (2018) studied the predictive value of clinical notes with simple approaches such as bagof-words. Recent work increasingly applies neural models to compensate for the noisy nature of the data and the complexity of patterns. Hashir and Sawhney (2020) (2019) and Suresh et al. (2018) further showed that outcome prediction benefits from a multitask setup. In contrast to earlier work we apply neural models to admission notes in an admission to discharge setup.
Pre-trained language models for the clinical domain. While pre-trained language models are successful in many areas of NLP, there has been little work on applying them to the clinical domain (Qiu et al., 2020). Alsentzer et al. (2019) and Huang et al. (2019) both pre-trained BERT-based models on clinical data. They evaluated their work on readmission prediction and other NLP tasks. We are the first to evaluate pre-trained language models on multiple clinical outcome tasks with large label sets. We further propose a novel pre-training objective specifically for the clinical domain.
Prediction of diagnoses and procedures. The majority of work on diagnosis and procedure prediction covers either single diagnoses (Liu et al., 2018;Choi et al., 2018) or coarse-grained groups (Peng et al., 2020;Sushil et al., 2018). We argue that models should predict diseases and procedures in a fine-grained manner to be beneficial for doctors. Thus we use all diagnosis and procedure codes from the data for our outcome prediction tasks.
ICD coding vs. outcome prediction. There is a variety of work in the related field of automated ICD coding (Xie et al., 2018;Falis et al., 2019). Zhang et al. (2020) recently presented a model able to identify up to 2,292 ICD codes from text. However, ICD coding differs from outcome prediction in the way that diseases are directly extracted from text rather than inferred from symptom descriptions and patient history. We further discuss this distinction in Section 6.

Clinical Admission to Discharge Task
Clinical outcome prediction can be defined in different ways. We approach the task from a doctor's perspective and predict the outcome of a current admission from the time of the patient's arrival to the hospital unit. We describe our setup as follows.

Clinical Notes from MIMIC III
As our primary data source, we use the freelyavailable MIMIC III v1.4 database (Johnson et al., 2016). It contains de-identified EHR data including clinical notes in English from the Intensive Care Unit (ICU) of Beth Israel Deaconess Medical Center in Massachusetts between 2001 and 2012. We focus our work on discharge summaries in particular and the outcome information associated with an admission. Similar to previous work, we filter out notes about newborns and remove duplicates.

Creating Admission Notes from Discharge Summaries
The state of a patient is commonly summarized in an ongoing document, which finally concludes in    Table 1 the notes contain about 400 words on average. The selection of admission sections as well as the resulting structure of the notes were verified by medical doctors. This newly created admission dataset enables us to make predictions on the outcome of a current admission. At inference time, doctors can then use the model's predictions on textual data from newly arrived patients.

Outcome Prediction Tasks
We select four relevant tasks for outcome prediction in consultation with medical professionals. All tasks take admission notes as input.
Diagnosis prediction. A main goal of clinical outcome prediction is to support medical professionals in the process of differential diagnosis. We thus take all diagnoses associated with an admission into account and frame the task as an extreme multi-label classification. Diagnoses are encoded as ICD-9 codes in the MIMIC III database. Following Choi et al. (2017), we group ICD-9 diagnosis codes from the database from 4-into 3-digit codes to reduce complexity while still obtaining granular suggestions. This results in a total of 1,266 diag- nosis codes, which are distributed over our dataset splits as shown in Table 2. The labels are powerlaw distributed with a long tail of very rare codes.
Procedure prediction. Procedures are either diagnostics or treatments applied to a patient during a stay. Similarly to diagnosis prediction, this is an extreme multi-label task. We again group the ICD-9 codes from the MIMIC III database into 3digit codes. In total there are 711 procedure codes labelled in the database in a power law distribution similar to the diagnosis codes.
In-hospital mortality prediction. Predicting a patient's mortality risk is a fundamental part of the triage process. In-hospital mortality in particular describes whether a patient died during the current admission and is a binary classification task. The percentage of deceased patients in the data is around 10% (see Table 3). As some notes contain direct indications of mortality such as patient deceased within the admission sections, we apply an additional filter for those terms.
Length-of-stay prediction. The duration of an ICU stay is an important information for hospitals in order to plan allocations of resources. We group patients into four major categories regarding their length of stay: Under 3 days, 3 to 7 days, 1 week to 2 weeks, more than 2 weeks. These categories were recommended by medical doctors in order to make the results as useful as possible in clinical practice. Table 3 shows the samples per class.

Integrating Clinical Knowledge Into Language Models
We propose clinical outcome pre-training, a way to integrate knowledge about clinical outcomes into pre-trained language models. We further introduce an additional step to incorporate ICD code hierarchy into our multi-label classification tasks. Training objective. Our proposed training objective ( Figure 2) is strongly related to the Next Sentence Prediction (NSP) task introduced by Devlin et al. (2019). In NSP the model gets two sentences as an input and predicts whether the second follows the first sentence. This way models such as BERT learn relations between sentences. We convert this setting so that the model instead learns relations between admissions and outcomes. From common sections in patient notes, we create two categories: Sections that are created at admission A and sections that are created after admission, e.g. at discharge time D. Given a patient note N , we split it into sections A N ∈ A and D N ∈ D. We remove all other sections. We then sample token sequences from these sections to get t N,1...k ∈ A N and t N,1...k ∈ D N , where k is randomly set between 30 and 50 tokens. We then train the model to maximize P (Same P atient|X N N ) and P (Other P atient|X N M ) with with M being a randomly sampled document from the same batch and Enc referring to the BioBERT encoding. As in the original NSP setting, we apply negative sampling (X N M ) for 50% of examples. We apply the same strategy on medical articles and case reports, so that A represents sections describing symptoms and risk factors, and D represents sections that describe outcomes of a disease or case.
Data sources. We create the pre-training dataset from multiple public sources. To integrate knowledge that doctors gain from previous patients and medical literature, we create two groups of sources: 1) Patients, which includes 32,721 discharge summaries from the MIMIC III training set, 5,000 publicly available medical transcriptions from the MT-Samples website 2 and 4,777 clinical notes from the i2b2 challenges 2006-2012 3 (Uzuner et al., 2007(Uzuner et al., , 2008(Uzuner et al., , 2010a(Uzuner et al., ,b, 2011(Uzuner et al., , 2012Sun et al., 2013b,a).
2) Articles, composed of 9,335 case reports from PubMed Central (PMC), 2,632 articles from  (Schneider et al., 2020) and filter for section headings that indicate case reports. As i2b2 notes are less well-structured in comparison to MIMIC III discharge summaries, we use a classifier as proposed by Rosenthal et al. (2019) to determine which section a sentence belongs to. The classifier is trained on an annotated set of i2b2 notes and then applied to all other notes.

ICD+: Incorporation of ICD Hierarchy
Medical knowledge in ICD labels. Diagnosis and procedure prediction requires the model to predict ICD-9 codes in a multi-label manner. ICD-9 codes are hierarchically ordered into associated groups. Figure 3 shows the code hierarchy for Malignant hypertensive renal disease with the ICD-9 code 403.0. The diagnosis has two parent groups namely Hypertension renal disease and Diseases of the circulatory system. Diagnoses or procedures in the same group often share similar medical characteristics, therefore hierarchical relations of a labelled code can be valuable information. This medical information is currently not integrated into the model. The same holds for words describing the ICD-9 codes, that often represent further important signals, such as the words renal or malignant.  Enhancing training with useful additional signals. We propose a simple method, ICD+, to incorporate both associated groups and words into the model weights: Instead of only classifying 3digit codes (as mentioned in 3.3), we let the model additionally predict the 4-digit codes and the bag of associated words with a code and its parent groups. In order to create the bag of words per code, we use the descriptions of ICD-9 codes from MIMIC III and remove all stop words. As shown in Figure  3, the ICD+ method assigns eight additional labels to the example diagnosis and therefore supplies the model with further information about the diagnosis during training. By increasing the amount of labels per sample, we integrate relevant medical knowledge and enable the model to learn implicit relations between codes and code groups that share certain words. We evaluate the effectiveness of ICD+ in Section 5.

Training Clinical Outcome Representations
We pre-train the CORe model on top of BioBERT weights 4 . We then fine-tune the model separately on the four outcome tasks. We use the same training regimen for both pre-training and fine-tuning: We tokenize the texts with WordPiece tokenization and truncate them to 512 tokens, due to the limited context length of the pre-trained models. We use early stopping and tune hyperparameters as described in Appendix C.

Baseline Models
In the following, we introduce the baseline models that we evaluate on the novel outcome prediction tasks. In order to understand the abilities of pretrained language models we compare their performance against more traditional approaches. The first three models (BOW, word embeddings, CNN) are trained using the hyperparameters proposed by the authors for outcome prediction tasks. The language models are fine-tuned the same way as the CORe model.
Bag-of-Words. Boag et al. (2018) shows that a simple bag-of-words (BOW) approach can outperform more complex models on tasks like mortality prediction. We thus include their approach in our evaluation. We adopt their training setting except that we consider 200 instead of 20 top tf-idf words in order to make the model converge.
Pre-trained word embeddings. Boag et al. (2018) further propose the use of pre-computed word embeddings that were trained on MIMIC III data. We use the same setting as for the BOW approach and fit a support vector machine classifier on the clinical outcome tasks.
Convolutional Neural Network (CNN). Si and Roberts (2019) built a neural network for mortality prediction with two hierarchical convolutional layers at the word and sentence levels and then aggregated it to a patient level representation. We follow their approach to evaluate the model on our four admission to discharge tasks.   Table 4 shows performances in (macro-averaged) area under the receiver operating characteristic curve (AUROC). We report scores of the CORe model trained only on Articles, Patients and in a combined training setting CORe All. We evaluate diagnosis and procedure prediction both with and without the ICD+ method on BioBERT and the CORe models. In both scenarios we evaluate on 3-digit ICD codes only, in order to maintain comparability between the methods.

Results on MIMIC III Admission Notes
Pre-trained models outperform baselines. We see that the evaluated pre-trained language models clearly outperform the BOW, word embeddings and CNN approaches. We further observe that the CORe models improve scores on all tasks in comparison to the baseline models, except for Dis-chargeBERT that reaches a higher score in mortality prediction -probably affected by its exposure to the test data. This shows that even though the language models are trained on similar data (e.g. PubMed and/or clinical notes), the specific outcome pre-training improves the model's ability to predict clinical outcome targets. Pre-training on Patients and Articles achieve similar improvements over the baselines, while the combined training is the most effective. An exception is the procedure prediction, where pre-training on Patients achieves the highest score. A probable reason is that procedures are documented in more detail in clinical notes, especially since our selection of medical articles focuses on diseases rather than procedures.
Predicting mortality risk is easier than length of stay. We see that the models reach higher scores in the binary mortality task than in length of stay prediction. Even a simple BOW approach can reach a relatively high score, which indicates that most of the notes contain clear hints towards an increased mortality risk. On the other hand, the length of stay task is difficult due to the many factors that can contribute to the length of a patient's stay after the admission, including nonclinical factors such as the patient's insurance situation (Khosravizadeh et al., 2016).
ICD hierarchy improves diagnosis and procedure predictions. Table 4 shows an ablation test without the ICD+ method (in parentheses). We see that both the BioBERT model and the CORe models improve when incorporating code hierarchy and relations through ICD+ into the training process. This is especially visible for ICD procedures, where the hierarchical and textual information, e.g. that a Nephropexy is an operation on the kidney can add important signals during training.  Table 5: Results on i2b2 diagnosis prediction task (5 classes) in % AUROC. The models reach similar results as on the MIMIC III data, indicating their transferability to other data sources without additional fine-tuning.

Model Transferability: Cross-Verification on i2b2 Clinical Notes
In order to verify that the fine-tuned models are transferable to ICU data from other sources, we apply it to data from the i2b2 De-identification and Heart Disease Risk Factors Challenge . We convert the clinical notes to admission notes as further described in Appendix B.2, which results in 1,118 samples labelled with up to five ICD-9 codes.
Models generalize to i2b2 data. We apply our MIMIC III-based models to predict diagnosis codes for the i2b2 notes without further fine-tuning. We then evaluate based on whether the predictions contain the five mentioned ICD-9 codes. The results in macro-averaged % AUROC are shown in Table  5. Even though the clinical notes differ from the MIMIC III notes in structure and writing style, the tested models are mostly able to identify the conditions. The scores are comparable to the MIMIC III results, which shows that the models are able to generalise on data from different sources such as other hospitals.

Discussion and Findings
Clinical outcome prediction is a sensitive task. We therefore conduct an extensive analysis on the CORe All model including a manual error analysis by medical doctors on 20 randomly chosen samples to understand how the model would perform in clinical practice. 5 % AUROC All Diagnoses 83.54 Diagnoses Mentioned in Text 87.10 Diagnoses Not Mentioned in Text 82.35 Table 6: Analysis of the impact of directly mentioned diagnoses on the diagnosis prediction task. Mentioned diagnoses are detected more reliably. Though on unmentioned diagnoses, scores only see a small decrease compared to the overall score.

A Closer Look at the Model's Abilities
Does the model mainly extract already present diagnoses? We observe that a majority of coded diseases are already mentioned in the admission text. This is mainly due to chronic diseases (e.g. diabetes mellitus) or to conditions that were identified prior to the ICU admission (e.g. in the emergency ward). We want to know if our model is also able to predict diagnoses that are not mentioned in the text. We annotate the admission texts with ICD-9 diagnosis codes with the methodology described by Searle et al. (2020). We then evaluate on codes that were explicitly mentioned in the text and those that were not. Table 6 shows that the model indeed extracts many diagnoses directly from the text and thus reaches a higher score on mentioned diagnoses. On the other hand, we see that the performance on non-mentioned diagnoses does drop only slightly, indicating that the model has also learned to predict non-mentioned diagnoses.
How does age and gender impact predictions? Age and gender are common risk factors with significant impact on the potential clinical outcome of a patient. We want our models to learn that impact without overestimating it. We test the model's behaviour by switching age and gender throughout 20 random samples and analyse how the mortality prediction changes. For each sample we manually switch the age mention and iterate over it from 18 until [**Age over 90**] 6 . Figure 4 shows that the analysed samples show a high variation in mortality risk and that age only impacts the prediction partially. In all cases the prediction increases with age -as expected from a medical perspective. We also observe some peaks without a medical reason that are caused by the mortality of certain age groups in the original data (black dotted line). This demonstrates how the model does not follow medical reasoning but merely statistic observations. We similarly switch the gender mention and all pronouns in the texts and observe that mortality prediction for male patients is increased by 5% on average, consistent with medical rationale.
Where is the model failing?

There is no Ground Truth in Clinical Data
Incomplete and inconsistent labels. Our error analysis reveals that 60% of the analysed samples are partially under-coded. They contain indicators for a diagnosis or procedure but miss the corresponding ICD-9 code. This is consistent with results from Searle et al. (2020) showing that MIMIC III is up to 35% under-coded. Additionally we find that procedures that are almost always performed in the ICU such as Puncture of vessel are often coded inconsistently. While a doctor can infer these labels with medical common sense, they pose a challenge to our models. We therefore suggest a critical view towards the data and welcome additional clinical datasets to compensate for noisy labels.
Multiple possible outcomes. 85% of analysed samples contain false positive predictions that the doctors still consider medically reasonable. This demonstrates that there are many possible clinical pathways and that some might not be foreseeable at admission time. We also see many cases in which the information in the clinical note is not sufficient and therefore allows multiple interpretations. For future work, we propose including further EHR data as suggested by Khadanga et al. (2019) to extend the patient representation in these scenarios.

Conclusion
We reframe the task of clinical outcome prediction to consider the admission state of a patient and support doctors in their initial decision process. We show that current state-of-the-art language models outperform selected baselines on this task and present methods for further improvement: Outcome pre-training enables our models to learn from unlabelled sources and ICD+ incorporates hierarchical and textual ICD representations into our models. For future work, we suggest considering pre-trained language models with larger context sizes (Beltagy et al., 2020;Zaheer et al., 2020) and languages other than English (Reys et al., 2020). We further encourage work on semantic encoding of negated terms and numerical data from clinical text.
A Distribution of Diagnosis and Procedure Labels Figure 5 and Figure 6 show the distributions of labels in the diagnosis and procedure prediction training sets. Both distributions follow the power law with a long tail of rare codes.

B.1 Admission Notes From Discharge Summaries
We use MIMIC III discharge summaries that contain aggregated information about a patient such as doctor's assessments, relevant lab values, medications, and the patient's history. In order to filter the documents by admission sections, we first split all discharge summaries into sections with simple pattern matching. Together with clinical professionals, we then evaluated discharge summaries and identified sections that are known at admission time. We remove all other sections and thus hide information about the further hospital course and discharge of a patient. We exclude notes that do not contain any of the admission sections. We further apply a patient-wise split into train, validation and test set with a 70/10/20 ratio.

B.2 Converting i2b2 Data into Admission Discharge Task
The i2b2 De-identification and Heart Disease Risk Factors Challenge  introduced a dataset that contains clinical notes and discharge summaries annotated  based on risk factors and disease indicators. We convert the data into an admission to discharge task by selecting five of the annotated conditions which correspond to ICD-9 codes as our labels, namely Hypertension (401), Hyperlipidemia (272), Coronary artery disease (414), Diabetes mellitus (250) and Obesity (278). Just like the MIMIC III diagnosis task, samples are annotated in a multi-label fashion. In order to convert the clinical notes to admission notes, we use the dataset from Rosenthal et al. (2019) that contain section labels per sentence. We then exclude sections that are not known at admission time concurrent to Section 3.2.

C Hyperparameter Setting
We use the following setting for pre-training and fine-tuning of the introduced Transformer-based models: We use early stopping and apply a random search for tuning the following hyperparameters on the

D Results on Top 10 Diagnoses and Procedures
Figures 7 and 8 show the % AUROC scores of our CORe All model on the most frequent labels within the diagnosis and procedure prediction tasks. Figure 7 show that many chronic diseases such as Essential Hypertension or Chronic ischemic heart disease are among the most common within the MIMIC III dataset and present with relatively high AUROC values. We also observe that very specific codes such as Diabetes mellitus and Bypass Anastomosis are predicted more easily compared to more general codes such as Other and unspecified anemias. Figure 8 further shows the negative influence of inconsistent labeling on standard procedures such as Puncture of Vessel.