Attentive Multiview Text Representation for Differential Diagnosis

We present a text representation approach that can combine different views (representations) of the same input through effective data fusion and attention strategies for ranking purposes. We apply our model to the problem of differential diagnosis, which aims to find the most probable diseases that match with clinical descriptions of patients, using data from the Undiagnosed Diseases Network. Our model outperforms several ranking approaches (including a commercially-supported system) by effectively prioritizing and combining representations obtained from traditional and recent text representation techniques. We elaborate on several aspects of our model and shed light on its improved performance.


Introduction
Electronic Health Records (EHRs) (Dick et al., 1997) contain a wealth of documented information and insights about patients health and well-being. However, it is difficult to effectively process such data due to complex terminology, missing information, and imprecise clinical descriptions (Friedman et al., 2013;Rajkomar et al., 2019). In addition, an especially challenging class of diseases are orphan or rare diseases (Kodra et al., 2012;Walley et al., 2018), which are diverse in symptoms and affect a smaller percentage of the population.
In this paper, we investigate how well Natural Language Processing (NLP) algorithms could reproduce the performance of clinical experts in the task of differential diagnosis-the process of distinguishing a particular disease from others that present similar clinical features, given medical histories (descriptions) of individual patients. We formulate this task as a ranking problem where the aim is to find the most probable diseases given medical histories of patients (Dragusin et al., 2013).
We develop a novel pairwise ranking algorithm that combines different views of patient and disease descriptions, and prioritizes effective views through an Attentive Multiview Neural Model (AMNM). We research this problem using data from the Undiagnosed Diseases Network (UDN) (Gahl et al., 2015;Ramoni et al., 2017) 1 , which includes concise medical history of patients and their corresponding diseases in the Online Mendelian Inheritance in Man (OMIM) dataset (Amberger et al., 2015). 2 All diagnoses-mappings between each patient and corresponding diseases-are provided by a team of expert clinicians from the UDN.
The contributions of this paper are as follows: • illustrating the impact of NLP in detecting the nature of illness (diagnosis) in patients with rare diseases in a real-world setting, and • a novel neural approach that effectively combines and prioritizes different views (representations) of inputs for ranking purposes.
Our Attentive Multiview Neural Model employs traditional and recent representation learning techniques and outperforms current pairwise neural ranking approaches through effective data fusion and attention strategies. We conduct several experiments to illustrate the utility of different fusion techniques for combining patient (query) and disease (document) representations. 3

Method
In many domains, entities can be represented from multiple views. For example, a patient can be represented by demographic data, medical history, diagnosis codes, radiology images, etc. We propose a neural model to effectively prioritize important views and combine them for ranking purposes.
Figure 1 shows our model, which comprises of three major components: (a): an attention network that estimates and weights the contribution of each view in the ranking process, (b): a fusion network the utilizes intra-view feature interactions to effectively combine query-document representations, and (c): a softmax layer at the end that estimates the query-document relevance scores given their combined representations. We first formulate the problem and then explain these components.

Problem Statement
Let (q , d ) and (q , d ) denote two different views of the same query and document (throughout the paper, we think of queries and documents as clinical descriptions of patients and diseases respectively). 4 These views can be obtained using traditional (Robertson and Walker, 1994) or recent (Devlin et al., 2019) representation learning techniques applied to textual descriptions or codified data of queries and documents. For example, q and d can indicate representations of the texts of a query and a document, and q and d can indicate representations of the medical concepts and codes associated with the same query and document. Our task is to determine a relevance score between each given query and document. Toward this goal, we effectively prioritize and combine these representations through Attention and Fusion neural networks, which are described blow.

Attention Model
We develop an attention sub-network to explicitly capture the varying importance of views by assigning attentive weights to them. Specifically, given the embedding vectors of a query q i ∈ R l and a document d i ∈ R m in the ith view, we use a Feedforward network, i.e. function f (.) in Figure 1, to estimate the vector a that captures attention weights across views as follows: where W q ∈ R n×l and W d ∈ R n×m are weight matrices to transform the query and document representations into the same underlying space of dimension n, b q ∈ R n and b d ∈ R n are the trainable bias vectors for the query and document respectively and ϕ(.) is the ReLU function. The 4 Our model can incorporate any number of views; we only illustrate two views here for simplicity.  softmax activation function transforms the attention weights to [0, 1] range. Assuming that the query-document pair of the more influential view are more similar in the underlying shared space (estimated by dot product in (1)), a captures attention weights of different views.

Fusion Model
Previous learning to rank approaches often concatenate query and document representations to combine their corresponding features (dos Santos et al., 2015;Amiri et al., 2016). There are a few approaches that explicitly capture feature interactions between queries and documents (Severyn and Moschitti, 2015;Echihabi and Marcu, 2003). We extend these fusion techniques and compare them. Given the attention weights from (1), we develop a fusion sub-network, function g(.) in Figure 1, to capture the intra-view feature interactions for query and document representations of each view. Our fusion network takes as input the attentive embeddings of each view, i.e. (α × q, α × d), and combines them through one of the following tensor fusion operations: where g dot , g outer , and g conv denote the dot product, outer product, and one-dimensional (1D) convolution with average pooling. In contrast to g dot , g outer and g conv are considerably more expensive operations but may better encode feature interactions. The output of function g is flattened and considered as the intra-view embedding.
Finally, we obtain the overall fused representation for each view by concatenating its intra-view and attentive embeddings. The representations of all views are then fed into a softmax to estimate the relevance between queries and documents.

Experiments
Data: Our data includes medical histories of 257 patients provided by the the Undiagnosed Diseases Network (UDN 5 ) (Gahl et al., 2015;Ramoni et al., 2017), as well as general descriptions (including clinical features) of more than 9K diseases available in the Online Mendelian Inheritance in Man (OMIM) dataset (Amberger et al., 2015). The UDN is a nationwide program that improves the level of diagnosis for individual patients (with severe clinical conditions) whose signs and symptoms have been intractable to diagnosis (Kobren et al., 2021;Amiri et al., 2021). To the best of our knowledge, this dataset is the largest available dataset for investigation on rare disease patients. The relevance judgment between patients and diseases is provided by a team of expert clinicians at the UDN. The total number of positive patient-disease pairs is 4, 746, where the number of unique diseases among these pairs is 1, 131; note that different patients can match with the same disease. We split the patients into training (80%), validation (10%), and test (10%) sets. In addition, for each positive pair in the training set, we create a negative pair for the same patient through random sampling of diseases. At test time, we create all the possible patient-disease pair combinations (more than 218K pairs) and use the estimated confidence scores of the classifier to rank all diseases against each test patient. In terms of views, we consider the texts of medical histories and diseases as the first view, and medical concepts and codes extracted from histories by QuickUMLS (Soldaini and Goharian, 2016) as the second view. 5 Access to phenotypic and genomic UDN data can be granted by submitting an online access request at dbGaP: https://www.ncbi.nlm.nih.gov/projects/ gap/cgi-bin/study.cgi?study_id=phs001232. v1.p1.
We note the concept and code view provides a higher level and more general semantic distinctions by grouping semantically-similar terms, while text view encodes other elements of semantics such as negation, hedging, etc.
Baselines: We consider the following baselines: • BM25 (Robertson et al., 1995): An unsupervised approach that effectively predicts relevance based on term frequency, inverse document frequency, and document length.
• SVMs (Cortes and Vapnik, 1995): We develop TF/IDF weighted ngrams (n=[1-2]) as features for the text and code/concept views, and conduct exhaustive search over hyperparameters for best performance on validation data. Such features were found effective on clinical texts by previous work (Howes et al., 2012;Reuber et al., 2009).
• BERT (Devlin et al., 2019): An attentive bidirectional language model that estimates the relevance between queries and documents by generating contextual representations, jointly conditioned on left and right contexts. We use BERT models developed for clinical text (Alsentzer et al., 2019). 6 • SVM rank (Joachims, 2002): An extension of SVMs to ranking problems which adaptively sorts documents based on their relevance to each query through empirical risk minimization. As features, we use relevance scores or probability predictions generated by the above baselines as well as additional features (unigram overlap and IDF-weighted unigram overlap) (Yu et al., 2014) to better establish the relevance between queries and documents.
• PhenoTips (Girdea et al., 2013): This commercial tool is currently used at the UDN to assist diagnostic efforts. It utilizes external sources such as the Human Phenotype Ontology (Köhler et al., 2017) and Orphanet data 7 to rank candidate diseases according to their ontology-based similarity to phenotypic descriptions of patients. PhenoTips employs advanced statistical modeling to differentiate candidate diseases, accounts for disorder frequencies in the general population according to Orphanet, supports negative phenotypes-symptoms that were not observed in the patient-and utilizes both code and text views.  (1) and (2), we set the dimension of the shared space between query and document representations to n = 100. In addition, for the CNN fusion model, see (2), we use 250 filters and kernel size of 3. Further details are provided in the supplementary materials.
Evaluation Metrics: We employ Mean Average Precision (MAP), Precision at rank K (P@K), and Precision-Recall curve implemented in trec eval 8 to compare competing systems. We use t-test for significance testing and asterisk mark (*) to indicate significant difference at ρ = 0.01.

Experimental Results
We report the performance of single and multiview models separately to ease comparison between views. The overall MAP and P@K, ∀K ∈ {5, 10}, performance of baselines for each view are reported in Table 1. The results show that BERT outperforms the other baselines across almost all measures. We attribute the poor performance of BM25 and SVMs to considerable difference in the underlying word/concept distribution in query and document spaces which can't be effectively addressed through lexical features (Burgun and Bodenreider, 2001;Pedersen et al., 2007). 9 In addition, BERT (code view) shows lower performance than BERT (text view). We conjecture that this results could be explained through the following points: (a): BERT is a strong language model and is robust in retrieving noun hypernyms or in completions involving shared category or role reversal (Ettinger, 2020), and (b): replacing medical concepts in text with their preferred concepts (see footnote 6) makes the original text less coherent, which can adversely affect the performance of BERT. Table 2 shows the performance of SVM rank with combined features across views, PhenoTips, and our Attentive Multiview Neural Model (AMNM) with different fusion functions. AMNM combines traditional and recent representation learning techniques by using BERT representations for text view, and BERT and SVMs representations for code view. All model combinations except 8 https://trec.nist.gov/trec_eval/ 9 For example, these models can't effectively match a query containing "congestive heart failure" to relevant documents containing "cardiac decompensation," "pulmonary edema," and "ischemic cardiomyopathy."  for AMNM bert-svms (g conv ) lead to significant improvement against the best performing baseline-BERT (text view) in Table 1. AMNM bert-bert (g dot ) improves the best baseline by 3.4, 1.7 and 5.8 points in MAP, P@5 and P@10 respectively; the corresponding improvement for AMNM bert-svms (g dot ) is 2.9, 5.8 and 6.2 points respectively. We note that AMNM bert-svms (g dot ) leads to considerably higher P@{5,10}, metrics that have a pivotal role in practical use of search systems. In addition, PhenoTips shows comparable MAP to BERT but has considerably lower P@{5,10}. 10 The fusion functions g dot (dot product) and g outer (outer product) outperform the more expensive fusion function g conv (one-dimensional convolution). The lower performance of g conv could be attributed to average pooling, which assumes different input dimensions equally contribute to the final representation and relevance. As a result, it may fail to eliminate noisy features or prioritize important ones.

Model Analysis
We discuss how and why AMNM achieves its improved performance through the following experiments; see supplementary materials for details: Prediction Variance Across Views: The Pearson correlation between the Average Precision of BERT (text view) and BERT (code view) on individual test queries (patients) is 0.87, which indicates less performance variation across views at query level. This is while the corresponding correlation between BERT (text view) and SVMs (code view) is only 0.34. The lack of diversity in the performance of BERT across these views could be a source of improvement in AMNM bert-svms .
Attention Function: Given test examples (more than 218K patient-disease pairs), our attention subnetwork is expected to assign a higher attentive weight to the view that better estimates the corresponding relevance score. To estimate the accuracy of this sub-network, we separately apply the trained BERT (text view) and SVMs (code view) models to generate their corresponding ranked lists of diseases for test patients. Then, for each relevant patient-disease pair, we evaluate our attention function in AMNM bert-svms by measuring whether it assigns a higher attentive weight to the better view-the view that positions the relevant disease at a higher rank compared to the other view. The results show that (a): our attention sub-network is 57.7% accurate in prioritizing better views, (b): BERT (text view) outperforms SVMs (code view) on 64.7% of relevant patient-disease pairs in terms of relative ranks, and our attention network accurately assigns higher weight to BERT on 88.6% of these examples, and (c): on the remaining 35.3% of examples that SVMs (code view) outperforms BERT (text view) in terms of relative ranks, our attention network assigns higher weight to SVMs in only 0.9% of these examples. Improving this percentage could boost the performance of our model and is the subject of our future work.

Related Work
The National Institutes of Health established the Undiagnosed Diseases Network (UDN) (Gahl et al., 2015;Ramoni et al., 2017) to facilitate research on undiagnosed and rare diseases. The UDN is a network of 12 clinical sites, and application to the UDN is open to all individuals who complete the application form and submit a referral letter from a health care professional (Kobren et al., 2021). A committee of experts in a review session reviews each UDN application and makes admission decisions. Walley et al. (2018) investigated major factors that may determine application outcomes of the UDN, which has been found effective in developing computational models for predicting admission outcomes (Amiri et al., 2021). In (Dragusin et al., 2013), authors developed a search engine for rare diseases, named FindZebra 11 , which was based on information retrieval techniques available in Indri search engine (Strohman et al., 2005). In addition, previous work developed experimental setup to evaluate and compare search engines such as Google or Bing in predicting relevant diseases to given phenotypes (Shenker, 2014), employed medical anthologies and information content techniques (Köhler et al., 2009), leveraged collaborative filtering (Shen et al., 2017) and ensemble techniques (Jia et al., 2018) for this purpose.
Our work departs from previous research by investigating a multiview approach to undiagnosed patients, where we show effective attention and fusion techniques lead to better pairwise ranking for differential diagnosis.

Conclusion and Future Work
Given electronic health records of patients, we develop an attentive multiview text representation model to assist clinical experts by ranking the most probable and relevant diseases. Accurate and timely diagnosis is especially important for critically ill patients as it assists specialists to distinguish, prioritize, and accelerate treatment for such patients. Our work can be improved by (a): enriching the feature space through patient-and diseasespecific information such patient demographic information and clinical synopsis of diseases, (b): improving model's attention mechanism, and (c): tackling differences in word distributions across patients (queries) and diseases (documents).

Ethics and Broader Impact Statement
This investigation included a small cohort of diagnosed patients in the Undiagnosed Diseases Network (UDN). The UDN is a network of 12 clinical sites, and application to the UDN is open to all individuals who complete the application form and submit a referral letter from a health care professional; a committee of experts in a review session reviews each UDN application and makes admission decisions. We included all data with no exclusions during the data analysis and manual review, except for cases with missing data or formatting issues. The population will therefore reflect the gender, race, ethnicity, age, and health status of the participating patients. In addition, all results have been presented in aggregate and no attempt have been made to identify individuals or facilities. However, during the course of this research and beyond that, there is a potential risk of loss of patient privacy and confidentiality. We have made and will make every effort to protect human subject information and minimize the likelihood of this risk (all authors with access to the data have successfully completed an education program in the protection of human subjects and privacy protection). In addition, our work is transformational in nature and its broader impacts are first and foremost the potential to improve the well-being of individual patients in the society (individuals who often find themselves on a protracted journey from one specialist to another without diagnosis even in this era of genomic sequencing), and support clinicians in their diagnostic efforts.