An End-to-End Progressive Multi-Task Learning Framework for Medical Named Entity Recognition and Normalization

Medical named entity recognition (NER) and normalization (NEN) are fundamental for constructing knowledge graphs and building QA systems. Existing implementations for medical NER and NEN are suffered from the error propagation between the two tasks. The mispredicted mentions from NER will directly influence the results of NEN. Therefore, the NER module is the bottleneck of the whole system. Besides, the learnable features for both tasks are beneficial to improving the model performance. To avoid the disadvantages of existing models and exploit the generalized representation across the two tasks, we design an end-to-end progressive multi-task learning model for jointly modeling medical NER and NEN in an effective way. There are three level tasks with progressive difficulty in the framework. The progressive tasks can reduce the error propagation with the incremental task settings which implies the lower level tasks gain the supervised signals other than errors from the higher level tasks to improve their performances. Besides, the context features are exploited to enrich the semantic information of entity mentions extracted by NER. The performance of NEN profits from the enhanced entity mention features. The standard entities from knowledge bases are introduced into the NER module for extracting corresponding entity mentions correctly. The empirical results on two publicly available medical literature datasets demonstrate the superiority of our method over nine typical methods.

Medical named entity recognition (NER) and normalization (NEN) are fundamental for constructing knowledge graphs and building QA systems. Existing implementations for medical NER and NEN are suffered from the error propagation between the two tasks. The mispredicted mentions from NER will directly influence the results of NEN. Therefore, the NER module is the bottleneck of the whole system. Besides, the learnable features for both tasks are beneficial to improving the model performance. To avoid the disadvantages of existing models and exploit the generalized representation across the two tasks, we design an end-to-end progressive multi-task learning model for jointly modeling medical NER and NEN in an effective way. There are three level tasks with progressive difficulty in the framework. The progressive tasks can reduce the error propagation with the incremental task settings which implies the lower level tasks gain the supervised signals other than errors from the higher level tasks to improve their performances. Besides, the context features are exploited to enrich the semantic information of entity mentions extracted by NER. The performance of NEN profits from the enhanced entity mention features. The standard entities from knowledge bases are introduced into the NER module for extracting corresponding entity mentions correctly. The empirical results on two publicly available medical literature datasets demonstrate the superiority of our method over nine typical methods.

Introduction
To dig into the large amount of electronic medical records, there has been an increasing interest in applying information extraction to them. These techniques can generate tremendous benefit for corresponding research and applications, such as med- * Corresponding author. ical knowledge graph (Wu et al., 2019) and QA systems (Lamurias and Couto, 2019). Among the medical text mining tasks, medical named entity recognition and normalization are the most fundamental tasks. Named entity recognition tries to find the boundaries of mentions from the medical texts. And named entity normalization maps mentions extracted from the medical text to standard identifiers, such as MeSH and OMIM . The initial pipeline implementations for medical NER and NEN have a main limitation: error extractions from NER cascade into NEN which result in normalization errors. Besides, the mutual use between recognition and normalization is not utilized in the pipeline models. To alleviate the limitations and achieve a higher performance, some researchers focused on jointly modeling these two tasks.  proposed a joint scoring function for medical NER and NEN. Lou et al. (2017) casted the output construction process of the two tasks as a state transition process to perform medical named entity recognition and normalization. To capture the semantic features of two tasks,  proposed a multi-task learning framework with an explicit feedback strategy for medical NER and NEN.
As shown in Figure 1, there are two common frameworks: pipeline and parallel multi-task framework. The former one is formulated to maximize the posterior probabilities p(y NER |x ) and p(y NEN |m, e ) where x is the medical text, m is the medical mentions extracted by a recognition model, e is the standard entity, y NER and y NEN are the labels. The latter one tries to maximize the posterior probabilities p(y NER , y NEN |x ) . Both of these are struggled with the bottleneck that is named entity recognition. In the above frameworks, the NER module is trained to memorize the medial mentions in the training set. However, the medical mentions are various and there is a gap between the training and test set. It is natural that the unseen mentions in training set are hard to recognize during the testing phase. Therefore, the conventional frameworks do not gain more ideal generalization ability.
To overcome the disadvantage mentioned above, we reconsidered the process of medical named entity recognition and normalization. The ultimate goal is to map the extracted medical mentions to the standard entity base. Therefore, the target standard entity base can be regarded as a dictionary. The initial process of NEN and NER can be reconsidered as detecting whether the medical text contains the candidate standard entity and finding the mentions should be replaced. Based on this idea, we propose an end-to-end progressive multi-task learning framework for medical named entity recognition and normalization (E2EMERN 1 ). Compared with ordinary multi-task learning, progressive multi-task learning focuses on the aggregation logic of tasks' specific features (Hong et al., 2020). A difficult target is divided into a few tasks that are interconnected through the combination of features. To take full advantage of the data attributes, we propose the framework including three tasks with progressive difficulty extended from the conventional NER and NEN tasks. The low-level task is the traditional NER which tries to extract all entities in the medical text. The mid-level task is defined to iden-1 When ready, the code will be published at https:// github.com/zhoubaohang/E2EMERN tify whether there exist medical mentions in the text that should be mapped to the candidate standard entity. The high-level task combines the first two level tasks, and targets to extract the mentions which should be mapped to the candidate standard entity.
Unlike the existing frameworks, E2EMERN exploits the progressive tasks to learn the fine-grained representations. The mid-level and high-level tasks facilitate the framework learning the corresponding features between the medical mentions and standard entities. The low-level task can gain the supervised signals from the higher level tasks to extract medical mentions corresponded to standard entities in the knowledge bases more exactly. Our contributions in this manuscript can be summarized as follows: 1. We reconsider the process of the NER and NEN tasks, and firstly propose to exploit the three tasks with progressive difficulty to train the end-to-end medical named entity recognition and normalization framework.
2. The experimental results on two medical benchmarks demonstrate that our framework outperforms the existing medical named entity recognition and normalization models. And we conducted detailed analysis on the framework to represent its superiority.
2 Related Work

Medical Named Entity Recognition and Normalization
Medical named entity recognition and normalization are two basic tasks for the medical text mining. The conventional pipeline frameworks contains the NER model and NEN one separately (Vázquez et al., 2008;Sahu and Anand, 2016;Zhou et al., 2020). NER models extract medical mentions in texts and then NEN models map these mentions to standard entity identifiers. To reduce the error propagation in the pipeline frameworks, some researchers proposed to model NER and NEN jointly. Leaman et al. (2015) combined two traditional machine learning models as an ensemble NER and NEN model. And to learn the joint probability distribution of the NER and NEN tasks, a semi-markov based model was proposed by . However, traditional methods depend on the human-based feature engineering. With the development of the deep learning, recurrent neural networks (RNN) have replaced human effort and been utilized to extract features of raw texts.  designed an RNN-based network architecture with feedback strategy to model the two tasks jointly. Recently, the pre-trained models, such as BERT (Devlin et al., 2019), BioBERT (Lee et al., 2020), make impressive progress in the natural language processing (NLP) area. Xiong et al. (2020) used BERT as the base module and proposed a machine reading comprehension framework to solve the NER and NEN problems jointly.

Sequence Labeling
Named entity recognition can be regarded as a sequence labeling problem. Sequence labeling was explored extensively as a basic task in NLP.
Probabilistic graphical models, such as: hidden markov model (Xiao et al., 2005) and conditional random fields (CRF) (Lafferty et al., 2001) are the typical methods to solve the problem. With deep learning modules gradually replacing manual feature engineering, long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) network stacked with CRF (Xu et al., 2008) has been a benchmark model for sequence labeling (Lample et al., 2016). Some researchers utilized multi-task learning to model relevant NLP tasks and gained better performances on these tasks including sequence labeling (Aguilar et al., 2017;Cao et al., 2018). Besides, the attributes of the data themselves are used to design the multi-task learning model. Considering whether sentences contain entities,  proposed the multitask learning model to predict whether input data have entities and then extract corresponding entities. Kruengkrai et al. (2020) exploited sentencelevel labels and token-level labels to propose a joint model supporting multi-class classification.

Short Text Matching
Named entity normalization is formulated as a short text matching problem. The information retrieval method, such as: BM25 (Robertson et al., 1994), is a universal model to solve this problem. With the development of neural language model, text semantic is exploited to model the similarity between two short texts. The distributed representations of texts, such as: Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014), are utilized to calculate the similarity distance between two texts. Some medical named entity normalization models are based on this method Zhou et al., 2020). Considering local texts are more important than global ones, some researchers utilized convolution neural networks (CNN) to extract local features and exploited interactive attention mechanism to match the semantic similarity of two texts (Yin et al., 2016;.

Methodology
We introduce the notations about NER and NEN before getting into the details of the framework.
as a training set with N s samples, where X i is the medical text and y i is the NER label. Given a sentence with N w words, the medical text can be formulated as X = {x 1 , x 2 , . . . , x N w } and the NER label is y = {y 1 , y 2 , . . . , y N w }. To solve the NER task, we try to maximize the posterior probability p(y |X ). According to the NER label, we can extract the medical mentions {m i } N m i=1 from the medical text, where N m is the number of the mentions. For NEN task, we need to map each mention m to a standard entity e in the entity base We formulate the object of NEN task as a posterior probability p(e |m, B ), and e is the standard entity which the mention m should be mapped to.

Progressive Tasks
With the help of NER and NEN, we can map medical mentions in the raw texts to the corresponding standard entities. Traditional pipeline implementations for the two tasks are composed of the individual NER and NEN models. The simple partitioning of the two models leads to the error propagation between them. Considering the correlation between the two tasks,  proposed the parallel task framework to improve the performance of the model. However, the intuitive feedback strategy for the output layers of two tasks is not beneficial to modeling the fine-grained features between two tasks. The above implementations lack thinking about the learning process. The process of human learning often goes from easy to difficult (Xu et al., 2020). Especially for the correlated tasks, humans can dig into the hidden knowledge and extract them from the easy tasks for completing the hard ones. Based on this idea, we reconsider the process of conventional NER and NEN tasks, and propose three correlated tasks with progressive difficulty. As shown in Figure 2, we take a medical text from the real dataset NCBI (Dogan et al., 2014)  Input medical text: Familial Mediterranean fever is a recessi ve disorder.

Implicit
Determine whether the text contains the standard entity:

Yes
Input medical text: Familial Mediterranean fever is a recessi ve disorder.

Extraction
Extract mentions correlated to the standard entity: Familial Mediterranean fever Figure 2: The end-to-end progressive multi-task learning framework for medical named entity recognition and normalization. The left part is the implementation details of the framework. The right part is the real example to describe the three progressive tasks.
as an example to describe the tasks. The medical text is "Familial Mediterranean fever is a recessive disorder" and its corresponding NER label is "B-Disease I-Disease I-Disease O O B-Disease I-Disease". Among the tokens, medical mentions "Familial Mediterranean fever" and "recessive disorder" are mapped to the standard entity identifiers "D010505" and "D030342" respectively. Low-level task is defined to memorize all medical mentions seen in training set. Given the medical text mentioned above, this task needs to predict the NER label and extract the mentions "Familial Mediterranean fever" and "recessive disorder". Similar to the process of human learning vocabulary, the low-level task forces the framework to learn the medical mentions indiscriminantly. However, the final target is to map mentions to standard entities. We should continue to bridge the gap between medical mentions in raw texts and standard entities in the database.
Mid-level task targets to determine whether medical texts implicit the query standard entities. With the above medical text and the standard entity "D010505" as input, this task should inference the text contains this entity. Through this task, the framework establishes the coarse-grained relationship between the mentions with contexts and the query standard entities. However, the mentions are incomplete correspondence to the query standard entities. Because there is more than one mention in the raw text which should be extracted and mapped to the corresponding standard entities. We need to specify which mention in the text should be mapped to the input standard entity.
High-level task is proposed to extract the mentions which should be mapped to the query standard entity. After acquiring the above medical text and the standard entity "D030342", this task should extract the mention "recessive disorder". If the input text contains no mention which should be mapped to the query entity, the output of this task is empty. The effect of this task is the same as that of NEN, but it is harder than NEN. To accomplish the high-level task, we need to build on the first two tasks. The low-level task provides the representations of the medical mentions with contexts which is beneficial to locating them in raw texts. The midlevel task forces the model to learn the correlated features between mentions with standard entities. With the help of two pre-tasks, the high-level task can be accomplished in an effective way.

Implementation Details
We build on the progressive tasks to implement the framework E2EMERN as shown in Figure 2. Considering the logic of feature aggregation and the strategies for training different tasks, we need to give detailed explanations by the level of tasks.
For a given sentence X = {x 1 , x 2 , . . . , x N w }, we need to map it to the dense vector representations. With the impressive performances of pretrained models, we utilize BERT (Devlin et al., 2019) as feature extractors to acquire the distributed representations of sentences. The BERT architecture is composed by the transformer networks and its weights are trained with large number of corpus. The feature extraction process is simplified as BERT(X) = {h 1 , h 2 , . . . , h N w }, where h ∈ R 1024×1 . The low-level task is defined as the same as NER, and we utilize the NER labels as the target. The sentence features {h i } N w i=1 are fed into the softmax layer, and we can compute the prediction probabilities of low-level task as: where W l and b l are trainable parameters. For training, we utilize the cross-entropy loss as the objective function. The loss function of low-level task is defined as follows: The sample for the mid-level task is defined as a tuple (X, e, y m ). If the text X contain the mentions which should be mapped to the entity e, y m is assigned 1 otherwise 0. To bridge the gap between the mentions and standard entities in the mid-level task, we need also to extract the features of standard entities. The standard entity e is described with the specific name and some medical contents. We feed the name (or contents) of the entity into the BERT and perform the average pooling on the output of BERT. The feature vector of i-th standard entity in the database is defined as h e i . Considering the words of mentions in raw texts are more correlated to the standard entity, we adopt the attention mechanism  to focus on the local words of sentences. The attention weighted average feature can be calculated as: h a = N w i=1 α i x i . And the attention score α is defined as: where s(x i , h e ) = W a [x i ; h e ] + b a . W a and b a are trainable weights in the attention module. After acquiring the entity-attention feature h a and standard entity feature h e , we can calculate the prediction probabilitiesŷ m = σ(W m [h e ; h a ] + b m ) where σ is the sigmoid function. The loss function for the mid-level task is formulated as the cross-entropy: (2) We define the tuple (X, e, y h ) as the sample for the high-level task where y h = {y h i } N w i=1 . Given that the medical text X is "Familial Mediterranean fever is a recessive disorder." and standard en- implicit the medical mentions while the entity attention feature h a contains clearer locations of the corresponding mentions. Therefore, we propose the gate mechanism to focus on the fine-grained feature dimensions. The formulation of the gate mechanism is G(H, Considering the semantic difference between the mentions and corresponding standard entities, we exploit the gate mechanism to fuse the standard entity feature with the sentence feature. The fusion sentence feature is formulated as: G(H, H a ))+H e G (H, H a ) where is the element-wise production, and H e = [h e ; . . . ; h e ] ∈ R 1024×N w . We feed the fusion feature into the softmax layer to predict the probabilitiesŷ h i = softmax(W h h f i + b h ). As the same as the low-level task, we utilize the crossentropy loss function as follows:

Training Process
For the framework, we denote the training sample as (X, y, e, y m , y h ). According to the definitions of the three tasks, we can generate the task labels corresponding to the input sentence. The example is shown in Figure 3. Given the medical text X, the label y for the low-level task is the same as the original NER label. We use the standard entities which the mentions {m i } N m i=1 should be mapped to as the input entity e respectively. The high-level task label y h is based on y, and it only keeps the original labels of y which are correlated to the input e. Besides, we adopt the negative sampling strategy to select the standard entity which is not related to the input sentence X as the input entity e.
To tackle the three level tasks at once, we introduce two hyper-parameters to sum Eqn. 1, Eqn. 2 and Eqn. 3. The overall loss function for the framework is defined as follows: where λ and µ are hyper-parameters for balancing different task losses. After generating samples, we feed them into the model and then calculate the loss according to Eqn. 4. Following the backpropagation method, we update the weights of the networks with the acquired loss. After every epoch of training, we re-sample the training samples for better generalization of the model.

Datasets and Experiment Settings
We compare our framework with the existing methods on two medical benchmark datasets. Table 1 presents the detailed statistical information of the two datasets. There are 798 public medical abstracts in the NCBI dataset (Dogan et al., 2014). Each medical mention in the text is annotated with MeSH/OMIM identifiers. BC5CDR dataset  contains 1500 public medical abstracts which are also annotated with MeSH identifiers. We split each abstract into sentence samples with an average of 40 words according to the ends of sentences. The padding char is used for filling the unequal length samples to the fixed length. During the training process, we first train the model on the training set and test it on the development set for searching the best hyper-parameters. Then, we fix the best hyper-parameters and train the model on the set composed of the training and development sets. Before the model is trained to the searched maximum number of epochs, we take the F1 score as the reported result when the loss gets the lowest. In our experiments, we set the hyper-parameters λ, µ and learning rate to 0.125, 0.1 and 1e-5 respectively. To train the model, we use the ADAM (Kingma and Ba, 2015) algorithm to update the weights. And all experiments are accelerated by the two NVIDIA GTX 2080Ti devices.

Compared Methods
To represent the effectiveness of our framework, we adopt the competitive models as the compared methods including traditional machine learning methods and impressive deep learning models.
Dnorm (Leaman et al., 2013) is the pipeline model for medical NER and NEN. It utilizes the TF-IDF feature to learn the bilinear mapping matrix for the normalization task. LeadMine (Lowe et al., 2015) considers Wikipedia as dictionary features for normalizing the medical mentions. TaggerOne  is the semi-Markov based model for jointly modeling medical NER and NEN. Transition-based model (Lou et al., 2017) consists of the state transformation function for the output of NER and NEN.
To reduce human feature engineering, researchers focus on the deep learning for modeling NER and NEN. IDCNN (Strubell et al., 2017) was proposed with an improved CNN module for NER. MCNN (Zhao et al., 2017) was composed of the multiple-label CNN modules for better performances on NER. CollaboNet (Yoon et al., 2019) exploited the multi-source datasets for training the multi-task model and gained better results on all benchmark datasets. MTL-MERN  consists of the NER and NEN parallel framework and utilizes the feedback strategy to improve the performances on two tasks.
With the impressive performance of pre-trained models, BioBERT (Lee et al., 2020) is built on the BERT (Devlin et al., 2019) and trained with a large medical corpus. And it achieves state-ofthe-art results on medical NER datasets. Therefore, we use the BioBERT as the feature extractor and compare it with our framework.

Experimental Results
We compare E2EMERN with the baseline methods on the named entity recognition and normalization. The detailed experiment results on NCBI and BC5CDR are shown in Table 2. The first Method NCBI BC5CDR
four in the table is the traditional machine learning methods. Among them, the joint models, such as TaggerOne and Transition-based Model, outperform the pipeline ones including Dnorm and Lead-Mine. When deep learning was introduced into the pipeline frameworks, IDCNN can make a progress over conventional methods, such as Dnorm. Compared with MCNN, CollaboNet utilizes the multisource dataset as input and performs multi-task learning to improve the performances on NER task. MTL-MERN takes full advantage of multi-task learning and deep semantic representations and outperforms the above methods. By virtue of the dynamic language features, BioBERT can better model the language semantics and outperform the above NER models.
Compared with baseline methods, E2EMERN can always achieve the best results on NER and NEN. The NER results of E2EMERN increase by 1% ∼ 2% over BioBERT. Because our framework takes full advantage of the correlation between NER and NEN. Unlike the simple strategy of MTL-MERN, E2EMERN consists of three progressive tasks that are well-designed for modeling the fine-grained features between medical mentions in raw texts and standard entities. The standard entity information of NEN is introduced into the NER module by the mechanisms in our framework. With the help of the dynamic language features and progressive multi-task learning, the framework can extract the medical mentions more exactly and map them to standard entities. And the semantic correlation between medical mentions and standard entities is built on the three progressive tasks from low to high. The rich semantics captured by the progressive tasks are beneficial to NER and NEN.

Further Discussion
To dig into the framework, we conduct the detailed analysis for presenting it in different aspects. The ablation study is conducted to present the effectiveness of the mechanisms proposed in the framework. Besides the supervised learning, our framework exploits the standard entity information in the NER task and is potential in a zero-shot scenario compared with BioBERT. We conduct the case study to analyze the prediction results and visualize the attention mechanism to prove its effectiveness.

Ablation Study
As shown in Table 2, we conduct the ablation study to present the effectiveness of the progressive tasks and different mechanisms. When free from completing the mid-or high-level tasks, E2EMERN gains worse results on NER and NEN. The progressive tasks improves the ability of the framework to learn the multi-grained features between original texts and standard entities. Besides, we replace the gate and attention mechanisms with the simple feature concatenation strategy as compared methods. When removed the attention mech-  Text4:  male sprague dawley rats were treated with  betaine  (  100  ,  200 , and 400 mg / kg ) orally for 40 days . Ground Truth: Table 3: The case study results on NCBI and BC5CDR. "Text1" and "Text2" are from NCBI, and the other two are from BD5CDR. "Text2" and "Text4" are the unseen samples from the test set of two datasets. The standard entities coupled with each text are the input of E2EMERN.
anism, E2EMERN achieves worse results on two tasks. It proves that the supervised signals from mid-level task are beneficial to the low-task. And the entity-attention feature generated by the mechanism contributes to the high-level task. E2EMERN without the gate mechanism gains the worse results on NEN. Because the mechanism aggregates the features from lower level tasks which provides the multi-grained information between mentions and standard entities. The ablation study proves the importance of the two mechanisms to E2EMERN.

Results on Unseen Samples
We conduct the statistic analysis on the test set of NCBI and BC5CDR. As shown in Figure 4, there are about 40% ∼ 50% samples contain the words or medial mentions which do not appear in the training set. Therefore, we need to evaluate the generalization ability of models on the unseen samples. We compare E2EMERN with BioBERT on the unseen samples in the test set. To a certain extent, our framework can outperform the existing stateof-the-art NER model. Compared with BioBERT, E2EMERN introduces the standard entity base into the framework. The fine-grained location information of medical mentions from the high-level task is propagated to the low-level task. With the help of standard entity information and progressive multi-task learning, E2EMERN can gain the better generalization ability on unseen samples.

Case Study
We present the case study results in Table 3. Compared with BioBERT, our framework can extract the medical mentions which BioBERT can not extract. We draw the label results of E2EMERN with the heat map. As the color deepens, the importance of the token in the sentence increases. The visualization results prove that the attention mechanism in E2EMERN focuses on the tokens which make of medical mentions. Although "Text2" and "Text4" are unseen samples, E2EMERN can also extract the mentions in them. The token "convulsions" is paid more attention than "seizures" in "Text3". But convulsion is the symptom of seizures. With the help of medical correlation between them, E2EMERN can extract the token "seizures" as medical mention. To some extent, the effectiveness of E2EMERN can be proved by the case study.

Conclusion
In this paper, we reconsider the process of NER and NEN and propose the end-to-end progressive multitask learning framework for medical named entity recognition and normalization. Compared with existing methods, the framework consists of three tasks with progressive difficulty which contributes to modeling the fine-grained features between medical mentions in raw texts and standard entities. Furthermore, the detailed analysis of E2EMERN proves its effectiveness. Considering the medical area is various, we will try to adapt the framework to the cross domain problem.