A Neural Transition-based Joint Model for Disease Named Entity Recognition and Normalization

Disease is one of the fundamental entities in biomedical research. Recognizing such entities from biomedical text and then normalizing them to a standardized disease vocabulary offer a tremendous opportunity for many downstream applications. Previous studies have demonstrated that joint modeling of the two sub-tasks has superior performance than the pipelined counterpart. Although the neural joint model based on multi-task learning framework has achieved state-of-the-art performance, it suffers from the boundary inconsistency problem due to the separate decoding procedures. Moreover, it ignores the rich information (e.g., the text surface form) of each candidate concept in the vocabulary, which is quite essential for entity normalization. In this work, we propose a neural transition-based joint model to alleviate these two issues. We transform the end-to-end disease recognition and normalization task as an action sequence prediction task, which not only jointly learns the model with shared representations of the input, but also jointly searches the output by state transitions in one search space. Moreover, we introduce attention mechanisms to take advantage of the text surface form of each candidate concept for better normalization performance. Experimental results conducted on two publicly available datasets show the effectiveness of the proposed method.


Introduction
Disease is one of the fundamental entities in biomedical research, thus it is one of the most searched topics in the biomedical literature (Dogan et al., 2009) and the internet (Brownstein et al., 2009). Automatically identifying diseases mentioned in a text (e.g., a PubMed article or a health webpage) and then normalizing these identified mentions to their mapping concepts in a standardized disease vocabulary (e.g., with primary name, synonyms and definition, etc.) offers a tremendous opportunity for many downstream applications, such as mining chemical-disease relations from the literature (Wei et al., 2015), and providing much more relevant resources based on the search queries (Dogan et al., 2014), etc. Examples of such disease vocabularies includes MeSH (http://www.nlm.nih.gov/mesh/) and OMIM (http://www.ncbi.nlm.nih.gov/omim).
Previous studies Lou et al., 2017;Zhao et al., 2019) show the effectiveness of the joint methods for the end-to-end disease recognition and normalization (aka linking) task to alleviated the error propagation problem of the traditional pipelined solutions (Strubell et al., 2017;Leaman et al., 2013;Xu et al., 2016. Although TaggerOne  and the discrete transition-based joint model (Lou et al., 2017) successfully alleviate the error propagation problem, they heavily rely on hand-craft feature engineering. Recently, Zhao et al. (Zhao et al., 2019) proposes a neural joint model based on the multi-task learning framework (i.e., MTLfeedback) which significantly outperforms previous discrete joint solutions. MTL-feedback jointly shares the representations of the two sub-tasks (i.e., joint learning with shared representations of the input), however, their method suffers from the boundary inconsistency problem due to the separate decoding procedures (i.e., separate search in two different search spaces). Moreover, it ignores the rich information (e.g., the text surface form) of each candidate concept in the vocabulary, which is quite essential for entity normalization.
In this work, we propose a novel neural transition-based joint model named NeuJoRN for disease named entity recognition and normalization, to alleviate these two issues of the multi-task learning based solution (Zhao et al., 2019). We transform the end-to-end disease recognition and normalization task as an action sequence prediction task. More specifically, we introduce four types of actions (i.e., OUT, SHIFT, REDUCE, SEG-MENT) for the recognition purpose and one type of action (i.e., LINKING) for the normalization purpose. Our joint model not only jointly learns the model with shared representations, but also jointly searches the output by state transitions in one search space. Moreover, we introduce attention mechanisms to take advantage of text surface form of each candidate concept for better linking action prediction.
We summarize our contributions as follows.
• We propose a novel neural transition-based joint model, NeuJoRN, for disease named entity recognition and normalization, which not only jointly learns the model with shared representations, but also jointly searches the output by state transitions in one search space.
• We introduce attention mechanisms to take advantage of text surface form of each candidate concept for normalization performance.
• We evaluate our proposed model on two public datasets, namely the NCBI and BC5CDR datasets. Extensive experiments show the effectiveness of the proposed model.

Task Definition
We define the end-to-end disease recognition and normalization task as follows. Given a sentence x from a document d (e.g., a PubMed abstract) and a controlled vocabulary KB (e.g., MeSH and OMIM) which consists of a set of disease concepts, the task of end-to-end disease recognition and normalization is to identify all disease mentions M = {m 1 , m 2 , ..., m |M | } mentioned in x and to link each of the identified disease mention

Neural Transition-based Joint Model
We first introduce the transition system used in the model, and then introduce the neural transitionbased joint model for this task.

Transition System
We propose a novel transition system, inspired by the arc-eager transition-based shift-reduce parser (Watanabe and Sumita, 2015;Lample et al., 2016), which constructs the output of each given sentence x and controlled vocabulary KB through state transitions with a sequence of actions A. We define a state as a tuple (σ, β, O), which consists of the following three structures: • stack (σ): the stack is used to store tokens being processed.
• buffer (β): the buffer is used to store tokens to be processed.
• output (O): the output is used to store the recognized and normalize mentions.
We define a start state with the stack σ and the output O being both empty, and the buffer β containing all the tokens of a given sentence x. Similarly, we define an end state with the stack σ and buffer β being both empty, and the output O saving the recognized and normalized entity mention. The transition system begins with a start state and ends with an end state. The state transitions are accomplished by a set of transition actions A, which consume the tokens in β and build the output O step by step.
As shown in Table 1, we define 5 types of transition actions for state transitions, and their logics are summarized as follows: • OUT pops the first token β 0 from the buffer, which indicates that this token does not belong to any entity mention.
• SHIFT moves the first token β 0 from the buffer to the stack, which indicates that this token is part of an entity mention. • REDUCE pops the top two tokens (or spans) σ 0 and σ 1 from the stack and concatenates them as a new span, which is then pushed back to the stack.
• SEGMENT-t pops the top token (or span) σ 0 from the stack and creates a new entity mention σ t 0 with entity type t, which is then added to the output.
• LINKING-c links the previous recognized but unnormalized mention σ t 0 in the output with its mapping concept with id c and updates the mention with σ t,c 0 . Table 2 shows an example of state transitions for the recognition and normalization of disease mentions given a sentence "Most colon cancers arise from mutations" and a controlled vocabulary MeSH. State 0 is the start state where φ denotes that the stack σ and output O are initially empty, and the buffer β is initialized with all the tokens of the given sentence. State 9 is the end state where φ denotes that the stack σ and buffer β are finally empty, and colon cancers disease,D003110 in the output O denote that the mention "colon cancers" is a disease mention and is normalized to the concept with id D003110 in MeSH. More specifically, state 5 creates a new disease mention colon cancers disease and add it to the output. State 6 links the previous recognized but unnormalized disease mention in the output with its mapping concept with id D003110 in MeSH.

Action Sequence Prediction
Based on the introduced transition system, the endto-end disease recognition and normalization task becomes a new sequence to sequence task, i.e., the action sequence prediction task. The input is a sequence of words x n 1 = (w 1 , w 2 , ..., w n ) and a controlled vocabulary KB, and the output is a sequence of actions A m 1 = (a 1 , a 2 , ..., a m ). The goal of the task is to find the most probable output action sequence A * given the input word sequence x n 1 and KB, that is Formally, at each step t, the model predicts the next action based on the current state S t and the action history A t−1 1 . Thus, the task is models as (2) where a t is the generated action at step t, and S t+1 is the new state according to a t .
Let r t denote the representation for computing the probability of the action a t at step t, thus where w a and b a denote the learnable parameter vector and bias term, respectively, and A(S t ) denotes the next possible valid actions that may be taken given the current state S t . Finally, the overall optimization function of the action sequence prediction task can be written as

Dense Representations
We now introduce neural networks to learn the dense representations of an input sentence x and each state in the whole transition process to predict the next action.
Input Representation We represent each word x i in a sentence x by concatenating its character-level word representation, non-contextual word representation, and contextual word representation: where v char i denotes its character-level word representation learned by using a CNN network (Ma and Hovy, 2016), v w i denotes its non-contextual word representation initialized with Glove (Pennington et al., 2014) embeddings, which is pre-trained on 6 billion words from Wikipedia and web text, and ELMo i denotes its contextual word representation initialized with ELMo (Peters et al., 2018). We can also explore the contextual word representation from BERT (Devlin et al., 2018) by averaging the embeddings of the subwords of each word. We leave it to the future work.
We then run a BiLSTM (Graves et al., 2013) to derive the contextual representation of each word in the sentence x.
The buffer β t is represented with BiL-STM (Graves et al., 2013) to represent the words in the buffer: The stack σ t and the actions A t are represented with StackLSTM (Dyer et al., 2015): We classify all the actions defined in Table 1 into two categories corresponding to two different purposes, i.e., the recognition and normalization purposes. OUT, SHIFT, REDUCE, SEGMENT-t are used for the recognition purpose, and LINKING-c is used for the normalization purpose. As shown in Figure 1(a) and 1(b), we define two different state representations for predicting the actions in different purposes.
Specifically, for predicting the actions in the recognition purpose, we represent the state as where ReLU is an activation function, W and d denote the learnable parameter matrix and bias term, respectively, and • s 0 t and s 1 t denote the first and second representations of the stack σ.
• b 0 t denotes the first representation of the buffer β.
• a −1 t denotes the last representation of the action history A.
For predicting the actions in the normalization purpose, we represent the state as where ReLU is an activation function, W and d denote the learnable parameter matrix and bias term, respectively, and • l m and r m denotes the left-side and right-side context representations by (i) first applying attention with the concept representation c to highlight the relevant parts in mentions' local context, and (ii) then applying max-pooling operation to aggregate the reweighted representations of all the context words.
• m and c are the representations of the mention and candidate concept by applying CoAttention mechanism (Tay et al., 2018;Jia et al., 2020).
• c denotes the candidate concept representation by (i) first run a BiLSTM (Graves et al., 2013) to derive the contextual representation of each word in the candidate concept, and (ii) then applying max-pooling operation to aggregate the representations of all concept words.
• a −1 t denotes the last representation of the action history A.

Search and Training
Decoding is the key step in both training and test, which is to search for the best output structure (i.e., action sequence) under the current model parameters. In this work, we use two different search strategies with different optimizations.
Greedy Search For efficient decoding, a widelyused greedy search algorithm  can be adopted to minimize the negative loglikelihood of the local action classifier in Equation (3,8,9).

Beam Search
The main drawback of greedy search is error propagation ). An incorrect action will fail the following actions, leading to an incorrect output sequence. One solution to alleviate this problem is to apply beam search. In this work, we use the Beam-Search Optimization (BSO) method with LaSO update (Wiseman and Rush, 2016) to train our beam-search model, where the max-margin loss is adopted.

Datasets
We use two public available datasets in this study, namely NCBI -the NCBI disease corpus (Dogan et al., 2014) and BC5CDR -the BioCreative V CDR task corpus (Li et al., 2016b). NCBI dataset contains 792 PubMed abstracts, which was split into 692 abstracts for training and development, and 100 abstracts for testing. A disorder mention in each PubMed abstract was manually annotated with its mapping concept identifier in the MEDIC lexicon. BC5CDR dataset contains 1,500 PubMed abstracts, which was equally split into three parts for training, development and test, respectively. A disease mention in each abstract is manually annotated with the concept identifier to which it refers to a controlled vocabulary. In this study, we use the July 6, 2012 version of MEDIC, which contains 7,827 MeSH identifiers and 4,004 OMIM identifiers, grouped into 9,664 disease concepts. Table3 show the overall statistics of the two datasets.
To facilitate the generation of candidate linking actions, we perform some preprocessing steps of each candidate mention and each concept in KB with the following strategies: (i) Spelling Correction -for each candidate mention in the datasets, we replace all the misspelled words using a spelling check list as in previous work (D'Souza and Ng, 2015;. (ii) Abbreviation Resolution -we use Ab3p (Sohn et al., 2008) toolkit to detect and replace the abbreviations with their long forms within each document and also expand all possible abbreviated disease mentions using a dictionary collected from Wikipedia as in previous work (D'Souza and Ng, 2015;. (iii) Numeric Synonyms Resolutions -we replace all the numerical words in the mentions and concepts to their corresponding Arabic numerals as in previous work (D'Souza and Ng, 2015;. We generate candidate linking actions (i.e., candidate concepts) for each mention with the commonly used information retrieval based method, which includes the following two steps. We first index all the concept names and training mentions with their concept ids. Then, the widely-used BM25 model provided by Lucene is employed to retrieve the top 10 candidate concepts {c i } 10 i=1 for each mention m.

Evaluation Metrics and Settings
Following previous work Lou et al., 2017;Zhao et al., 2019), we utilize the evaluation kit 1 for evaluating the model performances. We report F1 score for the recognition task at the mention level, and F1 score for the normalization task at the abstract level.
We use the AdamW optimizer (Loshchilov and Hutter, 2019) for parameter optimization. Most of the model hyper-parameters are listed in Table 4. Since increasing the beam size will increase the decoding time, we only report results with beam size 1, 2, and 4. Table 5 shows the overall comparisons of different models for the end-to-end disease named entity recognition and normalization task. The first part shows the performance of different pipelined methods for the task. DNorm (Leaman et al., 2013) is a traditional method, which needs feature engineering. IDCNN (Strubell et al., 2017) is a neural model based on BiLSTM-CRF, which requires few effort of feature engineering. The second part shows the performance of different joint models for the task. TaggerOne (Leaman et al., 2013) is a joint solution based on semi-CRF. Transition-based Model (Lou et al., 2017) is a joint solution based on discrete transition-based method. Both of these two models rely heavily on feature engineering. MTL-feedback (Zhao et al., 2019) is neural joint solution based on multi-task learning. NeuJoRN is our neural transition-based joint model for the whole task.

Main results
From the comparisons, we find that (1) IDCNN does not perform well enough although it relies few efforts of feature engineering. (2) All the joint models significantly outperform the pipelined methods.
(3) The deep-learning based joint models significantly outperform the traditional machine learning based methods. (4) Our proposed NeuJoRN outperforms MTL-feedback by at least 0.57% and 0.59% on the recognition and normalization tasks, respectively. Table 6 shows the comparisons of different search strategies of our proposed NeuJoRN. From the results, we find that (1) The methods based on beam search strategies outperforms the greedy search strategy, which indicates that the beam search solutions could alleviate the error propagation problem of the greedy search solution.

Effectiveness of different search strategies
(2) The model with beam size 4 achieves the best performance. The larger the beam size, the better the performance, however the lower the decoding speed.
(3) Our greedy search based solution doesn't outperform the MLT-feedback method. Table 7 shows the effectiveness of the proposed attention mechanisms. When we remove the attention mechanism for representing the left-side and right-side local context, the performance dropped a little bit. However, when we remove the CoAttention mechanism, which is used for directly modeling the matching between the mention and candidate concept, the performance dropped significantly. This group of comparisons indicates that importance of the matching between the mention and candidate concept for the entity normalization task.  (Leaman et al., 2013) 0.7980 0.7820 -0.8064 IDCNN (Strubell et al., 2017) 0.7983 0.7425 0.8011 0.8107 TaggerOne (Leaman et al., 2013) 0.8290 0.8070 0.8260 0.8370 Transition-based Model (Lou et al., 2017) 0   (Leaman et al., 2013;Xu et al., 2015Xu et al., , 2016 transform this task as a sequence labeling task, and conditional random fields (CRF) based methods are widely adopted to achieve good performance. However, these methods heavily rely on hand-craft feature engineering. Recently, neural models such as BiLSTM-CRF based methods (Strubell et al., 2017; and BERT-based methods (Kim et al., 2019) have achieved state-of-theart performance.

Effectiveness of attention mechanisms
Disease Named Entity Normalization DNEN has also been widely studied in the literature. Most studies assume that the entity mentions are predetected by a separate DNER model, and focus on developing methods to improve the normaliation accuracy (Lou et al., 2017), resulting in developing rule-based methods (D'Souza and Ng, 2015), machine learning-based methods (Leaman et al., 2013;, and recent deep learning-based methods Ji et al., 2020;Wang et al., 2020;Vashishth et al., 2021;Chen et al., 2021). However, the pipeline architecture which performs DNER and DNEN separately suffers from the error propagation problem. In this work, we propose a neural joint model to alleviate this issue.
Joint DNER and DNEN Several studies Lou et al., 2017;Zhao et al., 2019) show the effectiveness of the joint methods to alleviated the error propagation problem. Although TaggerOne  and the discrete transition-based joint model (Lou et al., 2017) successfully alleviated the error propagation problem, they heavily rely on hand-craft feature engineering. Recently, Zhao et al. (Zhao et al., 2019) propose a neural joint model based on the multi-task learning framework (i.e., MTL-feedback) which significantly outperforms previous discrete joint solutions. However, their method suffers from the boundary inconsistency problem due to the separate decoding procedures (i.e., separate search in two different search spaces). Moreover, it ignores the rich information (e.g., the text surface form) of each candidate concept in the vocabulary, which is quite essential for entity normalization. In this work, we propose a neural joint model to alleviate these two issues.
Transition-based Models Transition-based models are widely used in parsing and translation (Watanabe and Sumita, 2015;Wang et al., 2018;Meng and Zhang, 2019). Recently, these models are successfully applied to information extraction tasks, such as joint POS tagging and dependency parsing (Yang et al., 2018), joint entity and relation extraction (Li and Ji, 2014;Li et al., 2016a;Ji et al., 2021). Several studies propose discrete transition-based joint model for entity recognition and normalization (Qian et al., 2015;Ji et al., 2016;Lou et al., 2017). In this work, we propose a neural transition-based joint model for disease named entity recognition and normalization.

Conclusions
In this work, we proposed a novel neural transitionbased joint model for disease named entity recognition and normalization. Experimental results conducted on two public available datasets show the effectiveness of the proposed method. In the future, we will apply this joint model to more different types of datasets, such as the clinical notes, drug labels, and tweets, etc.