Automated Disease Normalization with Low Rank Approximations

While machine learning methods for named entity recognition (mention-level detection) have become common, machine learning methods have rarely been applied to normalization (concept-level identification). Recent research introduced a machine learning method for normalization based on pairwise learning to rank. This method, DNorm, uses a linear model to score the similarity between mentions and concept names, and has several desirable properties, including learning term variation directly from training data. In this manuscript we employ a dimensionality reduction technique based on low-rank matrix approximation, similar to latent semantic indexing. We compare the performance of the low rank method to previous work, using disease name normalization in the NCBI Disease Corpus as the test case, and demonstrate increased performance as the matrix rank increases. We further demonstrate a significant reduction in the number of parameters to be learned and discuss the implications of this result in the context of algorithm scalability.


Introduction
The data necessary to answer a wide variety of biomedical research questions is locked away in narrative text. Automating the location (named entity recognition) and identification (normalization) of key biomedical entities (Doğan et al., 2009;Névéol et al., 2011) such as diseases, proteins and chemicals in narrative text may reduce curation costs, enable significantly increased scale and ultimately accelerate biomedical discovery (Wei et al., 2012a).
Named entity recognition (NER) techniques have typically focused on machine learning methods such as conditional random fields (CRFs), which have provided high performance when coupled with a rich feature approach. The utility of NER for biomedical end users is limited, however, since many applications require each mention to be normalized, that is, identified within a specified controlled vocabulary.
The normalization task has been highlighted in the BioCreative challenges (Hirschman et al., 2005;Lu et al., 2011;Morgan et al., 2008), where a variety of methods have been explored for normalizing gene names, including string matching, pattern matching, and heuristic rules. Similar methods have been applied to disease names (Doğan & Lu, 2012b;Kang et al., 2012;Névéol et al., 2009) and species names (Gerner et al., 2010;Wei et al., 2012b), and the MetaMap program is used to locate and identify concepts from the UMLS MetaThesaurus (Aronson, 2001;Bodenreider, 2004).
Machine learning methods for NER have provided high performance, enhanced system adaptability to new entity types, and abstracted many details of specific rule patterns. While machine learning methods for normalization have been explored (Tsuruoka et al., 2007;Wermter et al., 2009), these are far less common. This is partially due to the lack of appropriate training data, and also partially due to the need for a generalizable supporting framework.
Normalization is frequently decomposed into the sub-tasks of candidate generation and disambiguation Morgan et al., 2008). During candidate generation, the set of concept names is constrained to a set of possible matches using the text of the mention. The primary difficulty addressed in candidate generation is term variation: the need to identify terms which are semantically similar but textually distinct (e.g. "nephropathy" and "kidney disease"). The disambiguation step then differentiates between the different candidates to remove false positives, typically using the context of the mention and the article metadata.
Recently, Leaman et al. (2013a) developed an algorithm (DNorm) that directly addresses the term variation problem with machine learning, and used diseasesan important biomedical entityas the first case study. The algorithm learns a similarity function between mentions and concept names directly from training data using a method based on pairwise learning to rank. The method was shown to provide high performance on the NCBI Disease Corpus (Doğan et al., 2014;Doğan & Lu, 2012a), and was also applied to clinical notes in the ShARe / CLEF eHealth task (Suominen et al., 2013), where it achieved the highest normalization performance out of 17 international teams (Leaman et al., 2013b). The normalization step does not consider context, and therefore must be combined with a disambiguation method for tasks where disambiguation is important. However, this method provides high performance when paired with a conditional random field system for NER, making the combination a step towards fully adaptable mention recognition and normalization systems.
This manuscript adapts DNorm to use a dimensionality reduction technique based on low rank matrix approximation. This may provide several benefits. First, it may increase the scalability of the method, since the number of parameters used by the original technique is proportional to the square of the number of unique tokens. Second, reducing the number of parameters may, in turn, improve the stability of the method and improve its generalization due to the induction of a latent "concept space," similar to latent semantic indexing (Bai et al., 2010). Finally, while the rich feature approach typically used with conditional random fields allows it to partially compensate for out-of-vocabulary effects, DNorm ignores unknown tokens. This reduces the ability of the model to generalize, due to the zipfian distribution of text (Manning & Schütze, 1999), and is especially problematic in text which contains many misspellings, such as consumer text. Using a richer feature space with DNorm would not be feasible, however, unless the parameter scalability problem is resolved.
In this article we expand the DNorm method in a pilot study on feasibility of using low rank approximation methods for disease name normalization. To make this work comparable to the previous work on DNorm, we again employed the NCBI Disease Corpus (Doğan et al., 2014). This corpus contains nearly 800 abstracts, split into training, development, and test sets, as described in Table 1. Each disease mention is anno-tated for span and concept, using the MEDIC vocabulary (Davis et al., 2012), which combines MeSH® (Coletti & Bleich, 2001) and OMIM® (Amberger et al., 2011). The average number of concepts for each name in the vocabulary is 5.72. Disease names exhibit relatively low ambiguity, with an average number of concepts per name of 1.01.

Methods
DNorm uses the BANNER NER system (Leaman & Gonzalez, 2008) to locate disease mentions, and then employs a ranking method to normalize each mention found to the disease concepts in the lexicon (Leaman et al., 2013a). Briefly, we define to be the set of tokens from both the disease mentions in the training data and the concept names in the lexicon. We stem each token in both disease mentions and concept names (Porter, 1980), and then convert each to TF-IDF vectors of dimensionality | |, where the document frequency for each token is taken to be the number of names in the lexicon containing it . All vectors are normalized to unit length. We define a similarity score between mention vector and name vector , ( ), and each mention is normalized by iterating through all concept names and returning the disease concept corresponding to the one with the highest score.
In previous work, ( ) , where is a weight matrix and each entry represents the correlation between token appearing in a mention and token appearing in a concept name from the lexicon. In this work, however, we set to be a low-rank approximation of the form , where and are both | | matrices, being the rank (number of linearly independent rows), and | | (Bai et al., 2010). For efficiency, the low-rank scoring function can be rewritten and evaluated as ( ) ( ) ( ) , allowing the respective and vectors to be calculated once and then reused. This view provides an intuitive explanation of the purpose of the and matrices: to convert the sparse, high-dimensional mention and concept name vectors ( and ) into dense, low dimensional vectors (as and ). Under this interpretation, we found that performance improved if each and vector was renormalized to unit length.
This model retains many useful properties of the original model, such as the ability to represent both positive and negative correlations between tokens, to represent both synonymy and polysemy, and to allow the token distributions between the mentions and the names to be different. The new model also adds one important additional property: the number of parameters is linear in the number of unique tokens, potentially enabling greater scalability.

Model Training
Given any pair of disease names where one ( ) is for , the correct disease concept for tion , and the other, , is for , an incorrect concept , we would like to update the weight matrix so that . Following Leaman et al. (2013a), we iterate through each 〈 〉 tuple, selecting and as the name for and , respectively, with the highest similarity score to , using stochastic gradient descent to make updates to . With a dense weight matrix , the update rule is: if , then is updated as ( ( ) ( ) ), where is the learning rate, a parameter controlling the size of the change to W. Under the low-rank approximation, the update rules are: if , then is updated as ( ) , and is updated as ( ) , noting that the updates are applied simultaneously (Bai et al., 2010). Overfitting is avoided using a holdout set, using the average of the ranks of the correct concept as the performance measurement, as in previous work.
We initialize using values chosen randomly from a normal distribution with mean 0 and standard deviation 1. We found it useful to initialize as , since this causes the representation for disease mentions and disease names to initially be the same.
We employed an adaptive learning rate using the schedule , where is the iteration, is the initial learning rate, and is the discount (Finkel et al., 2008). We used an initial learning rate of . This is much lower than reported by Leaman et al. (2013a), since we found that higher values caused the training to found that higher values caused the training to diverge. We used a discount parameter of , so that the learning rate is equal to one half the initial rate after five iterations.

Results
Our results were evaluated at the abstract level, allowing comparison to the previous work on DNorm (Leaman et al., 2013a). This evaluation considers the set of disease concepts found in the abstract, and ignores the exact location(s) where each concept was found. A true positive consists of the system returning a disease concept annotated within the NCBI Disease Corpus, and the number of false negatives and false positives are defined similarly. We calculated the precision, recall and F-measure as follows: We list the micro-averaged results in Table 2.

Discussion
There are two primary trends to note. First, the performance of the low rank models is about 10%-15% lower than the full rank model. Second, there is a clear trend towards higher precision and recall as the rank of the matrix increases. This trend is reinforced in Figure 1, which shows the learning curve for all models. These describe the performance on the holdout set after each iteration through the training data, and are measured using the average rank of the correct concept in the holdout set, which is dominated by a small number of difficult cases. Using the low rank approximation, the number of parameters is equal to | |. Since is fixed and independent of | |, the number of parameters is now linear in the number of tokens, effectively solving the parameter scalability problem. Table 3 lists the number of parameters for each of the models used in this study.   There are two trade-offs for this improvement in scalability. First, there is a substantial performance reduction, though this might be mitigated somewhat in the future by using a richer feature seta possibility enabled by the use of the low rank approximation. Second, training and inference times are significantly increased; training the largest low-rank model ( ) required approximately 9 days, though the full-rank model trains in under an hour.
The view that the and matrices convert the TF-IDF vectors to a lower dimensional space suggests that the function of and is to provide word embeddings or word representationsa vector space where each word vector encodes its relationships with other words. This further suggests that one way to provide higher performance may be to take advantage of unsupervised pre-training (Erhan et al., 2010). Instead of initializing and randomly, they could be initialized using a set of word embeddings trained on a large amount of biomedical text, such as with neural network language models (Collobert & Weston, 2008;Mikolov et al., 2013).

Conclusion
We performed a pilot study to determine whether a low rank approximation may increase the scalability of normalization using pairwise learning to rank. We showed that the reduction in the number of parameters is substantial: it is now linear to the number of tokens, rather than proportional to the square of the number of tokens. We further observed that the precision and recall increase as the rank of the matrices is increased.
We believe that further performance increases may be possible through the use of a richer feature set, unsupervised pre-training, or other dimensionality reduction techniques including feature selection or L 1 regularization (Tibshirani, 1996). We also intend to apply the method to additional entity types, using recently released corpora such as CRAFT (Bada et al., 2012).