Literature-Augmented Clinical Outcome Prediction

We present BEEP (Biomedical Evidence-Enhanced Predictions), a novel approach for clinical outcome prediction that retrieves patient-specific medical literature and incorporates it into predictive models. Based on each individual patient's clinical notes, we train language models (LMs) to find relevant papers and fuse them with information from notes to predict outcomes such as in-hospital mortality. We develop methods to retrieve literature based on noisy, information-dense patient notes, and to augment existing outcome prediction models with retrieved papers in a manner that maximizes predictive accuracy. Our approach boosts predictive performance on three important clinical tasks in comparison to strong recent LM baselines, increasing F1 by up to 5 points and precision@Top-K by a large margin of over 25%.


Introduction
Predicting the medical outcomes of hospitalized patients holds the promise of enhancing clinical decision making. With the advent of electronic health records (EHRs), more clinical data has become available to train AI models for outcome prediction (Rajkomar et al., 2018;Hashir and Sawhney, 2020). In particular, language models pretrained on biomedical and/or clinical text are demonstrating increasing proficiency when fine-tuned for the task of predicting outcomes such as in-hospital mortality or length of stay (van Aken et al., 2021).
In this work, we explore a novel approach for improving clinical outcome prediction by dynamically retrieving relevant medical literature for each patient, and incorporating this literature into language models (LMs) trained for outcome prediction from clinical notes. This is in contrast to existing outcome prediction work that uses  only clinical notes (Boag et al., 2018;Hashir and Sawhney, 2020). Recent LM-based approaches van Aken et al. (2021) have designed pretraining schemes over corpora of clinical notes and general biomedical literature. This is in contrast to our work, where we directly incorporate a literature retrieval mechanism into our outcome prediction model, by finding papers relevant to specific patient cases. Our approach, named BEEP (Biomedical Evidence-Enhanced Predictions), is broadly inspired by Evidence Based Medicine (EBM)a leading paradigm in modern medical practice which calls for finding the "current best evidence" to support optimal clinical decisions for each individual patient (Sackett et al., 1996).
Our setting presents unique challenges. First, our approach requires retrieving literature based on noisy EHR notes containing multitudes of information (e.g., medical history, ongoing treatments), unlike orthogonal efforts on extracting and summarizing scholarly information related to well-formed questions (e.g., the efficacy of ACE inhibitors in adult patients with type-2 diabetes) (Wallace, 2019;Lehman et al., 2019;DeYoung et al., 2020DeYoung et al., , 2021.
In addition, as our end task is predicting patient outcomes, another challenge lies in aggregating the retrieved literature in a way that maximizes prediction accuracy. Toward these challenges, we make the following key contributions: • Literature-Augmented Model. As illustrated in Figure 1, for each ICU patient and each target outcome to be predicted (e.g., mortality), our model retrieves papers from PubMed, encoded and fused together with the ICU admission note for making a final prediction. We present several architectures for retrieving papers and for aggregating and combining them with clinical notes. We make our code, cohort selection, paper identifiers and models publicly available. • Adding Literature Boosts Results. For evaluation, we measure both overall performance and precision/recall@Top-K, to account for the realworld scenario where "alarms" are only raised for high-confidence predictions to avoid alarm fatigue (Sendelbach and Funk, 2013). BEEP provides substantial improvements over baselines, with strong gains in overall classification performance and precision@Top-K. For example, we improve F1 by up to 5 points and precision@Top-K by a large margin of over 25%. • Exploring Patient-Specific Retrieval. We explore a range of sparse and dense retrieval approaches, including language models, for the complex and underexplored task of retrieving relevant literature based on a patient's noisy, information-dense clinical note. Our final retrieval module employs a retrieve-rerank approach that effectively retrieves helpful literature, as shown in our analysis (section 5).
We hope our work opens new research directions for automatically scanning literature for patientspecific evidence, and combining it with EHR information to boost accuracy of medical predictive models. Finally, our work raises the more general prospect of building predictive models that can dynamically learn to retrieve literature for optimizing task accuracy, in medicine and other related areas.

Related Work
Patient-Specific Literature Retrieval. Since 2014, the Text REtrieval Conference (TREC) has organized a series of challenges to advance research in this area. The TREC Clinical Decision Support (CDS) tracks focused on evaluating systems on the task of retrieving biomedical articles relevant for answering generic clinical questions about patient medical records (e.g., identifying potential diagnoses, treatments, and tests) (Simpson et al., 2014;Roberts et al., 2015Roberts et al., , 2016. TREC CDS 2014 and 2015 used short case reports as idealized representations of medical records due to the lack of available de-identified records. TREC 2016 shifted to using real-world medical records from the Medical Information Mart for Intensive Care (MIMIC) database (Johnson et al., 2016). 2 In our work, our focus is on predicting clinical outcomes using ICU admission notes and patient-specific retrieved literature. Ueda et al. (2021) use contextualized representations on more structured retrieval tasks not involving clinical notes (Voorhees et al., 2021), leaving open the question of how large pretrained language models (LMs) would fare on long, noisy EHR text. We explore this by experimenting with LMs for retrieval based on EHR text. Clinical Outcome Prediction. The idea of using automated outcome prediction for assisting clinical triage, workflow optimization, and hospital resource management has received much interest recently, especially given the conditions of the COVID-19 pandemic (Li et al., 2020). Predictive models based on structured (e.g., lab results) and unstructured (e.g., nursing notes) information have been built for key clinical outcomes including mortality (Jain et al., 2019;Feng et al., 2020), length of hospital stay (van Aken et al., 2021), readmission (Jain et al., 2019, sepsis (Feng et al., 2020), prolonged mechanical ventilation (Huang et al., 2020, and diagnostic coding (Jain et al., 2019;van Aken et al., 2021). Increasingly, models have leveraged unstructured text from notes since they can contain key information for outcome prediction (Boag et al., 2018;Jin et al., 2018). Most recently, van Aken et al. (2021) attempted this using large pretrained LMs. Our work compares the performance of a broader range of state-of-the-art pretrained language models on outcome prediction tasks.

BEEP: Literature-Enhanced Clinical Predictive System
Task & Approach Overview. Our goal is to improve models for clinical outcome prediction Figure 2: Complete system pipeline, unpacking the high-level overview seen in Figure 1. For a given patient ICU admission note, the literature retrieval module first retrieves relevant biomedical abstracts from a clinical outcome-specific index, then reranks a top-ranked subset of abstracts. The outcome prediction module aggregates information from these reranked abstracts and fuses it with the admission note to make the final prediction from EHR notes by augmenting them with relevant biomedical literature. BEEP consists of two main stages: (i) literature retrieval, and (ii) outcome prediction. We also briefly experiment with a formulation that trains both jointly (details in section 4). Given a patient EHR note Q and a clinical outcome of interest y, the first stage is to identify a set of biomedical abstracts Docs(Q) = {D 1 , ..., D n } from PubMed 3 that may be helpful in assessing the likelihood of the patient having that outcome. The next stage is to augment the input to an EHR-based outcome prediction model with these retrieved abstracts (Q ∪ Docs(Q)) and predict the final outcome. Figure 1 provides a high-level illustration of BEEP, and Figure 2 unpacks it with more detail. Next, we describe our system's main components.

Literature Retrieval Module
Our literature retrieval module consists of three components: (i) an index of biomedical abstracts pertaining to the outcome of interest, (ii) a retriever that retrieves a ranked list of abstracts relevant to the patient note from the index, and (iii) a reranker that reranks retrieved abstracts using a stronger document similarity computation model. For the retriever, we experiment with both sparse and dense models. We follow the standard retrieve-rerank approach, which has been shown to achieve good balance between efficiency and retrieval performance (Dang et al., 2013), and has recently also proved useful for large-scale biomedical literature search (Wang et al., 2021). In the retrieval step, we prioritize efficiency, using models that scale well to large document collections but are not as accurate, to return a set of top documents. In the reranker step, we prioritize retrieval performance by running a computationally expensive but more accurate model on the smaller set of retrieved documents.

Outcome-Specific Index Construction
Since we are interested in identifying information related to a specific outcome for a patient, we begin by constructing an index of all abstracts from PubMed relevant to that outcome to limit search scope. To gather all abstracts relevant to a clinical outcome, we first identify MeSH (Medical Subject Heading) terms associated with the outcome by performing MeSH linking on the outcome descriptions using scispaCy (Neumann et al., 2019). These associated MeSH terms are then used as queries to retrieve abstracts. 4 For some MeSH terms that are too broad (e.g., "mortality"), we include additional qualifiers (e.g., "human") to make sure we do not gather articles that are not relevant to our overall patient cohort. Appendix A lists the final set of queries used for all clinical outcomes considered in this work. Abstracts retrieved via this process are used to construct the outcome-specific index.

Sparse Retrieval Model
The sparse retrieval model returns top-ranked abstracts based on cosine similarity between TF-IDF vectors of MeSH terms for the query (clinical note) and the documents (outcome-specific abstracts). MeSH terms from abstracts are extracted by running scispaCy MeSH linking over the abstract text. PubMed MeSH tagging is done only at the abstract level, and does not reflect actual term frequency in the text, requiring our extraction step. However, extracting MeSH terms from clinical notes requires a more elaborate pipeline, due to two major issues: • Entity type and boundary issues: Offthe-shelf entity extractors like scispaCy and cTAKES (Savova et al., 2010) extract some entity types that are uninformative for relevant literature retrieval, e.g., hospital names, references to family members, etc. They also have a tendency to ignore important qualifiers. For example, given a sentence containing the entity "right lower extremity pain", both extractors returned "extremity" and "pain" as separate entities. • Negated entities: Clinical notes have a high density of negated entities (up to 50% of (Chapman et al., 2001)). These entities must be identified and discarded prior to literature retrieval to avoid retrieving articles about symptoms and conditions that are not exhibited by the patient.
To handle these issues, we train an entity extraction model that focuses on problems, tests, and treatments with empirically good coverage of important qualifiers (Uzuner et al., 2011). We then filter negated entities with negation detection (Harkema et al., 2009) and perform entity linking to MeSH terms. For more information and implementation details see Appendix B.

Dense Retrieval Model
We add a dense retrieval model to complement the sparse retriever, an approach that has shown promise in recent work (Gao et al., 2021). Our dense retrieval model maps clinical notes (queries) and biomedical abstracts (documents) to a shared dense low-dimensional embedding space. Computing similarity between these encoded vectors allows for softer matching beyond surface form. For dense retrieval, we use a BERT-based bi-encoder model. We use a bi-encoder to support scaling to large document collections, as opposed to crossencoder models which are much slower (e.g., (Gu et al., 2021)). We use PubmedBERT (Gu et al., 2021) as the encoder and train our bi-encoder using the dataset from the TREC 2016 clinical decision support task (Roberts et al., 2016). For more details, see Appendix B. Our bi-encoder achieves mean precision@10 score of 45.67 on TREC 2016 data in 5-fold cross-validation, comparable to stateof-the-art results (Das et al., 2020).

Reranker Model
The reranker model takes a subset of top-ranked documents from both the sparse and dense retrieval models and rescores them. We use a BERT-based cross-encoder model for reranking, prioritizing ranking performance over efficiency on this smaller subset. Given a query clinical note Q and an abstract document D i , we run a PubmedBERT-based encoder over the concatenation of both ([CLS] Q [SEP] D i [SEP]) to compute an embedding E QD i . This embedding is run through a linear layer to produce a relevance score, trained using crossentropy loss with respect to document relevance labels from the TREC 2016 dataset. Our crossencoder achieves a mean precision@10 score of 48.33 on TREC 2016 in 5-fold cross-validation, which is also comparable to state-of-the-art performance on TREC CDS 2016(Das et al., 2020.
From the top-ranked documents returned by the reranker, the top k are selected 5 to be passed alongside the patient clinical note to the outcome prediction module, which we describe next.

Outcome Prediction Module
The goal of this module is to compute an aggregate representation from the set of top k abstracts relevant to the clinical note, and then predict the outcome of interest using this aggregate representation and the note representation.

Aggregation Strategies
Let Docs(Q) = D 1 , ..., D k be the set of relevant abstracts retrieved for clinical note Q and BERT(X) be the encoder function that returns an embedding E X given a document X. We experiment with four different strategies to compute an aggregate literature representation for Docs(Q), which we denote by LR(Q). Averaging. Averaging encoder representations: Weighted Averaging. Weighted average of encoder representations: where weights w i are the relevance scores computed by the reranker. The final outcome is computed by concatenating note representation BERT(Q) with LR(Q) and running this through a linear layer. We also concatenate the note embedding with each abstract (E QD i = [BERT(Q); BERT(D i )]), run outcome prediction and aggregate output probabilities as follows. Soft Voting. Averaging per-class probabilities from k outcome prediction runs:

Experiments & Results
We test our system on the task of predicting clinical outcomes from patient admission notes. Predicting outcomes from admission notes can help with early identification of at-risk patients and assist hospitals in resource planning by indicating how long patients may require hospital/ICU beds, ventilators etc. (van Aken et al., 2021).

Clinical Outcomes
We evaluate our system on three clinical outcomes: • PMV: Prolonged mechanical ventilation prediction, identifying whether a patient will require ventilation for >7 days (Huang et al., 2020). • MOR: In-hospital mortality prediction, identifying whether a patient will survive their current admission (van Aken et al., 2021). • LOS: Length of stay prediction is the task of identifying how long a patient will need to stay in the hospital. We follow van Aken et al. (2021) and group patients into four major categories based on clinician recommendations: <3 days, 3-7 days, 1-2 weeks, and >2 weeks.
PMV and MOR are binary classification tasks, while LOS is a multi-class classification task. We predict these outcomes from patient admission notes extracted from the MIMIC III v1.4 database (Johnson et al., 2016), which contains de-identified EHR data including clinical notes in English from the Intensive Care Unit (ICU) of the Beth Israel Deaconess Medical Center in Massachusetts between 2001 and 2012. Admission notes are constructed by filtering discharge summary documents from MIMIC to only retain the following sections typically known at admission: Chief complaint, (History of) Present illness, Medical history, Admission medications, Allergies, Physical exam, Family history and Social history. Notes that do not contain any of these sections are excluded. For PMV, we follow the cohort selection process from Huang et al. (2020), and include all patients who were above 18 years of age and were on mechanical ventilation for at least 2 days with more than 6 hours each day. Patients transferred from other hospitals, organ donors, and patients with neuromuscular disease, head and neck cancer, and extensive burns, which always lead to PMV and may act as confounds, were excluded. For MOR and LOS, we follow the same cohort selection process as van Aken et al. (2021), and include all patients except newborns and remove duplicate admissions. Following these cohort selection processes results in the data splits shown in Table 1b. Table 1b also shows the numbers of relevant PubMed articles for all three clinical outcomes.

Selecting the Encoder Language Model
Since the encoder used for outcome prediction needs to produce representations for both clinical notes and relevant abstracts, we choose language models that have been pretrained on both biomedical and clinical text. We evaluate the following models on outcome prediction (without literature augmentation) to choose a suitable encoder:  Table 2: Performance of baseline and literature-augmented outcome prediction models on all clinical outcomes. We note that LOS is a multiclass target; we observe substantial gains in 2/4 of the classes (  , 1993).
Note that in this experiment, we predict clinical outcomes from patient admission notes only, without incorporating literature. We also use weighted cross-entropy loss to manage class imbalance (see Appendix B). Table 5 in the Appendix shows the performance of the above language models on the validation sets for all clinical outcomes. We select the top-performing language models BLUEBERT and UMLSBERT for our remaining experiments. 6

Literature Augmentation Results
We provide two sets of results: for overall performance, and for high-confidence predictions.
Overall Performance. Table 2 shows the overall performance of our literature-augmented outcome prediction system on all three clinical outcomes. We test our system using both UMLSBERT and BLUEBERT as encoders, as well as all four literature aggregation strategies. We report three metrics for each setting: (i) area under the receiver operating characteristic (AUROC), (ii) micro-averaged F1 score, and (iii) macro-averaged F1 score. From Table 2, we observe that incorporating literature leads to performance improvements on two of three 6 We also experiment with CORe but observe consistently lower scores (  Evaluating High-Confidence Predictions. In addition to standard evaluation, we evaluate the top 10% high-confidence predictions per class for all models (precision/recall@TOP-K), informative for two key reasons. First, when using automated outcome prediction systems in a clinical setting, it is reasonable to only consider raising alarms for high-confidence positive predictions to avoid alarm fatigue (Sendelbach and Funk, 2013). Second, high-confidence predictions for both positive and negative classes can be used to reliably assist with hospital resource management (e.g., predicting future ventilation and hospital bed needs). Tables 3a and 10 show the precision/recall-@TOP-K scores for all models on prolonged mechanical ventilation, mortality, and length of stay prediction. In Table 3a, we see that our literatureaugmented models achieve much higher precision scores than the baseline (∼9-12 points higher in most cases) for the PMV negative class. We also see higher precision scores than the baseline for the positive class (∼5-9 points higher in most cases). This is a strong indicator that our literatureaugmented pipeline might offer more utility for PMV detection in a clinical setting than using EHR notes only. Table 3b shows similarly encouraging trends for mortality prediction. The mortality prediction dataset is the most skewed of the three datasets, and therefore we do not see much performance difference across models on the negative class. However, on the positive class, our literatureaugmented models show dramatic increase in precision. In particular, BLUEBERT-based literature models show an increase in precision of ∼22-27 points, at the expense of only ∼6-7 point drop in recall relatively to non-literature models. 7 This also indicates that literature-augmented mortality prediction might be more precise and reliable in a clinical setting than using clinical notes alone. From Table 10 (Appendix H), we can see that for LOS prediction, our models show clear gains (∼2-5 points) on classes 1 and 2 (i.e., 3-7 days and 1-2 weeks), and minor gains for some variants on class 3 (>2 weeks). We also perform an alternate evaluation in which we only score predictions from our literature-augmented models that show a relative confidence increase of at least 10% over the baseline prediction, presented in Appendix H.
Learning To Retrieve Using Outcomes. BEEP trains separate models for literature retrieval and outcome prediction. Inspired by Lee et al. (2019), 7 Note that since the MOR class is rare, a larger recall drop could still translate to a small number of incorrect cases only we develop a learning-to-retrieve (L2R) formulation that trains both jointly to ensure that the retriever can learn from outcome feedback. However, our L2R model does not improve performance over BEEP (results in Table 7 in Appendix E). We provide discussion for potential reasons in Appendix E. This is an interesting direction for future work.

Analysis and Discussion
Given BEEP's improved performance, we further assess the utility of retrieved literature and cases where adding literature is particularly helpful.
Diversity of retrieved literature. As a preliminary analysis, we evaluate the diversity of the abstracts retrieved for admission notes in our datasets, as a proxy for the degree to which literature is personalized to specific patient cases. For the 100 most frequently retrieved abstracts for each clinical outcome, Figures 4a, 4b, and 4c in Appendix H show proportions of patient notes for which these abstracts are judged as relevant by our retrievererank pipeline. From these histograms, we see a stark difference for LOS which is much less diverse than both PMV and MOR, indicating that the literature retrieved for length of stay prediction may be less personalized to patient cases than the literature retrieved for other outcomes. We leave to future work exploration of diversifying retrieved papers across patients and examining the effect on outcome prediction performance. 8 Qualitative examination of retrieved literature. We qualitatively examine literature retrieved for cases in which our model shows large confidence increases over the baseline to determine its utility in making the right prediction. We study increases in both directions, i.e. cases in which adding literature resulted in a confidence increase in either the correct outcome label (good) or incorrect outcome label (bad). For each clinical outcome, a bio-NLP expert looked at the top 5 cases from each category based on the magnitude of confidence increase (total 10 cases per outcome). For each case, the expert looks at the top 5 abstracts retrieved for the case (total 50 abstracts per outcome) and assigns each abstract to one of 8 categories we define for categorizing degree of relevance and type of evidence provided, including retrievals considered helpful  Table 4: Qualitative examples of retrieved literature that is helpful for increasing prediction confidence of the correct outcome. Case 1 shows an example of retrieved literature that strongly matches patient condition and provides direct evidence linking it to the outcome of interest. Case 2 shows an example with indirect evidence, in which retrieved literature lists outcome indicators not present in the patient. Case 3 shows an example of retrieved literature describing a link between patient's ongoing treatment and outcome of interest. green: patient characteristics; blue: outcome of interest; red: known indicators of the outcome measure not present in the patient.  Table 4 (evidence type column; more in Appendix).
As seen in Table 4, for helpful categories, retrieved literature matches patient characteristics (especially current condition) and includes evidential links between outcome of interest and patient conditions/treatment. In the first case, the retrieved abstract provides evidence that patients with cirrhosis have high mortality in the first 48 hours of intubation, entails the patient might not undergo prolonged ventilation. In the second case, the abstract lists comorbidities associated with in-hospital mortality (outcome of interest), but none are present in the patient under consideration, which can be taken as weak indication that the patient may survive. Similarly, for the third case, the retrieved abstract mentions that cirrhotic patients may have longer hospital stays if they are on mechanical ventilation. This matches our patient's treatment history since she has cirrhosis and was briefly intubated and extubated, before experiencing shortness of breath again. Given this, the patient might have a longer length of stay. Conversely, unhelpful retrieved literature often does not match patient characteristics or may not contain evidence relevant to the outcome. See more example explanations in Appendix I. Figure 3 presents the distribution of helpful and unhelpful categories for both kinds of cases for all outcomes. We can see that for correct outcome cases from both PMV and mortality, retrieved literature is more frequently assigned to one of the helpful categories, while for incorrect outcome cases, retrieved literature is more frequently assigned to one of the unhelpful categories. For LOS, unhelpful categories dominate both types of cases, especially prevalent in incorrect outcomes.

Conclusion
In this paper, we introduced BEEP, a system that automatically retrieves patient-specific literature based on intensive care (ICU) EHR notes and uses the literature to enhance clinical outcome prediction. On three challenging tasks, we obtain substantial improvements over strong recent baselines, seeing dramatic gains in top-10% precision for mortality prediction with a boost of over 25%.
Our hope is that this work will open new research directions into bridging the gap between AI-based clinical models and the Evidence Based Medicine (EBM) paradigm in which medical decisions are based on explicit evidence from the literature. An interesting direction is to incorporate evidence identification and inference (Wallace, 2019; DeYoung et al., 2020) directly into our retrieval and predictive models. Another important question to explore relates to the implications our approach has on increasing the interpretability of clinical AI models.

Acknowledgements
This work was supported in part by NSF Convergence Accelerator Award #2132318. The authors would like to thank the members of the Semantic Scholar team, and the anonymous reviewers for their helpful feedback on this work.

Ethical Concerns
Incorporating outcome prediction models into a medical decision-making pipeline effectively will require these technologies to adhere to standards set by the core principles of medical ethics: beneficence, non-maleficence, autonomy, and justice (Beauchamp et al., 2001). These requirements may raise the following concerns when deploying outcome prediction models in clinical settings: • Out-of-Cohort Generalization: The extent to which outcome prediction models generalize to patient cohorts that may not have been present in their training data is unclear. If model accuracy is significantly lower on "out-of-cohort" patients, using inaccurate/uncertain predictions during decision making may violate the requirement that any application of technology must be beneficent and non-maleficent to individual patients. Our proposed technique can partly mitigate the generalization issue by identifying additional supporting evidence from literature, which may be better tailored to individual patient char-acteristics, instead of using only cohort-level evidence. However, biomedical literature can also have blind spots, with certain cohorts and disease combinations being under-studied, and even literature-augmented prediction may not be sufficiently accurate. • Algorithmic Biases: Since outcome prediction models are trained on historical health data, existing inequities in healthcare access may translate into models continuing to perpetuate unintentional discrimination against patients from under-served demographics. For example, models might predict poorer outcomes (e.g., high mortality, poor response to treatment, etc.) for specific demographics that have historically had worse outcomes due to poor access to care. Such issues are a clear violation of the justice requirement, and must be tackled before deployment. • Informed Consent: Lastly, if outcome prediction models are used in clinical settings, patients and their caregivers must be made aware of their use, since the principle of autonomy emphasizes that patients must be provided all relevant medical information to support autonomous decision making. The black-box nature of these models raises another issue: how can we help patients/caregivers understand and interpret outcome predictions to further support their autonomy in decision making? We hope that literatureaugmented prediction techniques can partly ease this by using evidence snippets from literature that contributed to the model's prediction as explanations.

A PubMed Queries Per Outcome
Following are the MeSH terms that we use to retrieve literature from PubMed to construct the outcome-specific index for each clinical outcome under consideration: • Prolonged Mechanical Ventilation (PMV): "Respiration, Artificial". We also query using the terms "Ventilation, Mechanical" and "Ventilator Weaning" but do not find any new results.
• In-Hospital Mortality (MOR): "Hospital Mortality", "Mortality+Humans+Risk Factors". Note that the "+" operator is interpreted as AND by PubMed search.
• Length of Stay (LOS): "Length of Stay". All other MeSH terms from the tagger are aliases of this term.

B Implementation Details
Entity Extraction. First, we extract entities from clinical notes using a model trained on the i2b2 2010 concept extraction dataset (Uzuner et al., 2011). This dataset consists of clinical notes annotated with three types of entities: problems, tests, and treatments. These entity types cover the pertinent medical information that can be used to retrieve abstracts relevant to a clinical note. Moreover, the i2b2 guidelines require annotators to include all qualifiers within an entity span, so training a model on these annotations should bias it towards including pertinent entity qualifiers. Our entity extraction model uses a BERT-based language model to compute token representations, followed by a linear layer to predict entity labels. We use ClinicalBERT (Alsentzer et al., 2019) as the the language model to train our i2b2 entity extractor. Table 6  MeSH Linking. Finally, the set of filtered entities is linked to MeSH terms using scispaCy. Entities not linked to MeSH terms are discarded. MeSH terms linked in clinical notes and abstracts are used to compute TF-IDF vectors for the sparse retrieval model. Bi-Encoder Given a query clinical note Q and an abstract document D i , a BERT-based encoder is used to compute dense embedding representations E Q and E D i . A scoring function S is defined as the Euclidean distance between query and document embeddings: Documents closest to the query vector in the embedding space are returned as top-ranked results.
The bi-encoder is trained using a triplet loss function defined as follows: Here D + i is an abstract more relevant to the clinical note Q than D − i and m is a margin value. We use PubmedBERT (Gu et al., 2021) as the encoder and train our bi-encoder using the dataset from the TREC 2016 clinical decision support task (Roberts et al., 2016). 9 This dataset consists of 30 de-identified EHR notes, along with ∼1000 PubMed abstracts per note marked for relevance. We select relevant abstracts per note as positive candidates (D + i ), and irrelevant abstracts for the same note as negative candidates (D − i ). Outcome prediction module training. We use a weighted cross-entropy loss function to handle class imbalance. Given a dataset with N total examples, c classes and n i examples in class i, class weights are computed as follows: We use Adam optimizer, treating initial learning rate as a hyperparameter. All models are implemented in PyTorch, and we use Huggingface implementations for all pretrained language models.

C Hyperparameter Tuning
We do a grid search over the following hyperparameter values for each aggregation:

D Computing Infrastructure
Our experiments were carried out on 2 AWS p3.16xlarge instances, which are 8-GPU machines with 16 GB RAM per GPU. All our experiments can be run on a single 16 GB GPU.

E Results from Learning To Retrieve Model
Given a note Q, we first obtain a set of top 100 relevant abstracts (Docs(Q) = {D 1 , ..., D 100 }) from the BEEP retrieve-rerank pipeline. The retriever component is then defined as follows: BERT Q (X) and BERT D (X) are the query and document encoder functions. Based on retriever scores S retr , we select the top k abstracts and perform outcome prediction using the same structure as the BEEP outcome prediction module. We also add the following early update loss term to the outcome loss for the retriever component: where y j is set to 1 if using document D j alongside Q results in a confidence increase in the correct outcome (as per BEEP) and 0 otherwise. Our L2R model does not improve performance over BEEP (results in Table 7). We speculate that this may partly be due to the fact that the heuristic we use to assign y j values in early update loss is not as accurate as the one used by Lee et al. (2019) (directly checking for presence of the answer in a document, for the reading comprehension task). Table 7 presents results for the learning-toretrieve model on all clinical outcomes using UMLSBERT as the encoder. From the table, we can see that while L2R improves performance over a notes-only baseline, its performance is comparable to BEEP. As mentioned earlier, we speculate that this may partly be attributed to the fact that the heuristic we use to assign y j values in early update loss is not as accurate as the one used by Lee et al. (2019) (directly checking for presence of answer in document, for the reading comprehension task). We believe that experimenting with other sources of supervision to generate y j values and weighting mechanisms to better combine outcome and early update losses might lead to larger improvements, but we leave those to future work.

F Literature-Augmented Outcome
Prediction with CORe

G Literature-Only Outcome Prediction
To quantitatively test the quality of the retrieved literature, we run an ablation study in which we predict the clinical outcome using only the literature retrieved for a specific patient case, without incorporating any information from the patient clinical note. Table 9 shows the results for this ablation study, using both BLUEBERT and UMLSBERT encoders. From this table, we can see that while removing the clinical note leads to performance drops, especially on mortality and length of stay, the retrieved literature does have some predictive ability. We take this as indication that the retrieved literature contains some clinical indicators associated with the outcome, that are also present in the patient's clinical note.

H Analyzing High Confidence Increases Over Baseline
Finally, we also examine an alternate way of using high-confidence predictions made by our mod-els. We run both baseline and literature-augmented systems, and only consider predictions from the literature-augmented system that show a high increase in confidence, such as > 10% increase relative to the baseline predictions for the same cases .  Tables 11a and 11b show the precision scores of all models on prolonged mechanical ventilation and mortality in this setting. We can see that precision scores in this setting are fairly high, especially for the negative class in mortality prediction. Most averaging variants also do well on the positive class in mortality prediction.

I Examples of Literature For Incorrect Outcome Cases
We categorize examples into the following: 1. Patient condition and outcome directly related 2. Patient history and outcome related 3. Known outcome indicators not present in patient 4. Ongoing treatment and outcome related 5. No cohort match 6. No/weak condition match 7. Condition-outcome pair not studied 8. No evidence for outcome/Weak evidence for direct relationship between patient condition and outcome From table 12, we can see that retrieved literature from unhelpful categories often does not match patient characteristics. The first case discusses a patient who has had an ICD firing incident, but the retrieved literature discusses ICD implantation therapy. While related, there is no discussion of the impact of ICD firing on various clinical outcomes.
For the second case, we see that the retrieved article discusses strokes in general, without matching any of the patient's indications or demographic characteristics. Moreover, the outcome of interest (mortality) is mentioned briefly, but links between the outcome and patient conditions are not studied. Finally, the third case provides an example of a common phenomenon we observe. There are a fair number of review articles retrieved that do not have strong evidential statements in the abstract. For the third case, the retrieved abstract discusses the need for early triage/transfer (which could lead to low length of stay), but then do not provide any conclusive evidence. Stroke is indicated by an abrupt manifestation of neurologic deficits secondary to an ischemic or hemorrhagic insult to a region of the brain...ranked as the third leading cause of death in the United States...report shows that despite the use of antithrombotic and/or antiplatelet aggregating drugs, the key to stroke management is primary prevention.  Table 12: Qualitative examples of retrieved literature that is categorized as unhelpful for cases where adding literature increases confidence in incorrect outcome. Case 1 shows an example of retrieved literature that has a weak match with patient condition, but no evidence linking condition to outcome. Case 2 shows an example in which retrieved literature does not match patient case or contain evidence for outcome. Case 3 shows an example of a review article that again does not match patient case or provide outcome evidence. : Proportion of admission notes associated with the 100 most highly retrieved abstracts for each clinical outcome. From these graphs, we can see that frequently-retrieved abstracts for LOS are associated with a larger proportion of cases from the dataset, than frequently retrieved abstracts for PMV and MOR (indicative of lower literature diversity in LOS).