Neural Entity Recognition with Gazetteer based Fusion

Incorporating external knowledge into Named Entity Recognition (NER) systems has been widely studied in the generic domain. In this paper, we focus on clinical domain where only limited data is accessible and interpretability is important. Recent advancement in technology and the acceleration of clinical trials has resulted in the discovery of new drugs, procedures as well as medical conditions. These factors motivate towards building robust zero-shot NER systems which can quickly adapt to new medical terminology. We propose an auxiliary gazetteer model and fuse it with an NER system, which results in better robustness and interpretability across different clinical datasets. Our gazetteer based fusion model is data efficient, achieving +1.7 micro-F1 gains on the i2b2 dataset using 20% training data, and brings + 4.7 micro-F1 gains on novel entity mentions never presented during training. Moreover, our fusion model is able to quickly adapt to new mentions in gazetteers without re-training and the gains from the proposed fusion model are transferable to related datasets.


Introduction
Named entity recognition (NER) (Lample et al., 2016;Ma and Hovy, 2016) aims to identify text mentions of specific entity types. In clinical domains, it's particularly useful for automatic information extraction, e.g., diagnosis information and adverse drug events, which could be applied for a variety of downstream tasks such as clinical event surveillance, decision support (Jin et al., 2018), pharmacovigilance, and drug efficacy studies.
We have witnessed a rapid progress on NER models using deep neural networks. However, applying them to clinical domain (Bhatia et al., 2019) is hard due to the following challenges: (a) accessibility of limited data, (b) discovery of new drugs, procedures and medical conditions and the (c) need for building interpretable and explainable models. Motivated by these, we attempt to incorporate external name or ontology knowledge, e.g., Remdesivir is a DRUG and COVID-19 is a Medical Condition, into neural NER models for clinical applications.
Recent work on leveraging external knowledge can be categorized into two categories -Gazetteer embedding and Gazetteer models. Recent work has primarily focused on gazetteer embeddings.  feed the concatenation of BERT output and gazetteer embedding into Bi-LSTM-CRF. Peshterliev et al. (2020) use self-attention over gazetteer types to enhance gazetteer embedding and then concatenate it with ELMO, char CNN and GloVe embeddings. By contrast, the basic idea of gazetteer model is to treat ontology knowledge as a new clinical modality. Magnolini et al. (2019) combine outputs of Bi-LSTM and gazetteer model and feed them into CRF layer. Liu et al. (2019a) apply hybrid semi-Markov conditional random field (HSCRF) to predict a set of candidate spans and rescore them with a pre-trained gazetteer model.
In this paper, we combine the advantages of both worlds. Unlike the work of Peshterliev et al. (2020), we build self-attention over entity mentions and their context rather than over different gazetteer types. For example, Take Tylenol 3000 (NUM) mg (METRIC) per day, in which Tylenol is more likely to be a DRUG given NUM, METRIC in context. Moreover, we study two fusion methods to integrate information from two modalities.
• Early fusion. Similar to Magnolini et al. (2019), NER model and gazetteer model apply a shared tagger, as shown in Fig. 1a • Late fusion. For better interpretability and flexibility, we allow NER and gazetteer models to apply separate taggers and fuse them before taking softmax, as shown in Fig. 1b  Unlike the work of Liu et al. (2019a), NER and gazetteer models are jointly learned end-to-end.
Our contributions are as follows.
(1) We propose to augment NER models with an auxiliary gazetteer model via late fusion, which provides better interpretability and flexibility. Interestingly, the NER model can preserve the gains even if the gazetteer model is unplugged at inference time.
(2) Our thorough analysis shows that the fusion model is data efficient, explainable and is able to quickly adapt to novel entity mentions in gazetteers. (3) Experiments show that the fusion model consistently brings gains cross different clinical NER datasets.

Approach
2.1 NER model NER is a sequence tagging problem by maximizing a conditional probability of tags y given an input sequence x. We first encode x into hidden vectors and apply a tagger to produce output y.

Gazetteer model
We embed gazetteers into E ∈ R M ×K×d , where M is the number of gazetteers (e.g, drugs, medical condition), K is the number of gazetteer labels (e.g, B-Drug, E-Drug), and d is the embedding size. We is the gazetteer label of token x t in gazetteer j. In order to model the association of name knowledge between entity mentions and their contexts, we compute context-aware gazetteer embedding using scaled dot-product self-attention Similar to the NER model, we apply a tagger to produce output y

Fusion: NER + gazetteer
To better use information from both modalities, we investigate two different fusion methods to combine information from NER and gazetteer.
• Early fusion. In Fig. 1a, we concatenate r t with g t , and feed it into a shared tagger • Late fusion. In Fig. 1b, we directly fuse o r t and o g t by performing element-wise max pooling 3 Experiments

Experimental setup
LM pre-training. We continue to pre-train RoBERTa base (L=12, H=768, A=12) (Liu et al.,   consists of 1,500 de-identified, annotated clinical notes with medications (Med) and medical conditions (DS). We follow i2b2 challenge guidelines for data annotation.
We extract medical condition and drug dictionaries from UMLS (Bodenreider, 2004) (ontology knowledge graph) based on graph as well semantic meanings. We followed different steps to prune the dictionaries based on different medical ontologies such as RxNorm for medication (∼100k concepts), ICD-10 CM and SNOMED for medical conditions (∼500k concepts). We employ Inside, Outside, Begin, End and Singleton (IOBES) format for both tags and gazetteers 1 . We minimize the cross-entropy loss during training and report micro-F 1 score at test time. We use RoBERTa mimic as NER encoder and parameterize Taggers via Multi-layer Perception (MLPs). We use BertAdam optimizer, learning rate 5e −5 , and dropout 0.1. We tune hyper-parameters d ∈ [2, 12] (best:8) and w ∈ [2, 10] (best:5) on validation set. 1 We do string matching for gazetteers by following (Chiu and Nichols, 2016). For example, if A, B and AB are all in gazetteers, we'll label AB as AB. The basic idea is to start from bigger spans, so we first check for ABC, if not found then AB, if not found then A and B.  3.2 Results.
We report overall results in Table 1. We observe that incorporating name knowledge consistently boost performance on all datasets by 0.18 ∼ 0.59 micro-F 1 gains. Overall, two fusion methods achieve comparable results.

Analysis
We investigate the effectiveness of late fusion on handling three challenges: novel entity mentions, little data access and interpretability.

Novel entity mentions
New drugs and medical condition come out very frequently. For example, "remdesivir"and "Baricitinib" for COVID-19. To investigate the effect of late fusion on unseen entity mentions, we focus on answering questions: whether it can generalize well on unseen entity mentions, and whether it is able to correct prediction once novel entity names are added into gazetteer without re-training?
Zero-shot. We report results on unseen entity mentions not presented in train and validation sets.
"One"-shot in gazetteer. We evaluate the ability of late fusion to quickly adapt to non-stationary gazetteers, e.g., specialists might add new entity mentions into gazetteers or give feedback when models make incorrect prediction. For this analysis, we split entity mentions in training set into two parts: 70% labelled and 30% in gazetteer, and compare models: where R is NER model and G is gazetteer model.  In Table 3, we observe that G plays two roles: (1) R > R 0 . G can regularize R to gain better generalization ability, and (2) R 0 G > R 0 and RG > R. Besides serving as a regularizer, G provides extra information at test time.
Moreover, we evaluate late fusion by varying the number of unseen entity mentions included in gazetteers. In Fig. 2, without re-training models, late fusion can adapt to new mentions and obtain linear gains, which enables effective user feedback.
Overall, the ability to detect and adapt to novel entity mentions, without re-training models, is useful with accelerated growth in drug development as well as in practical settings where entity extraction is one of the components to build knowledge graph and search engines (Wise et al., 2020;Bhatia et al., 2020). For example, linking new drugs discovered in clinical trails of COVID-19 to standardized codes in ICD-10 2 or SNOMED 3 .

Limited data access
Typically, data accessible to use in the clinical domain is quite limited. In this section, we focus on evaluating fusion model in low-resource settings as well as investigate whether the gain is transferable across related datasets. Here we present results with late fusion methodology.
Low-resource setting We evaluate late fusion by reducing training data size from 100% to 20%. Fig.  3 shows late fusion gains more when less training Figure 3: Accuracy vs. Training data size on i2b2 Med. We randomly sample 20%, 40%, · · · , 100% of training data and report micro-F1 score averaged over 3 random seeds. Transfer learning. To verify the generalization ability of late fusion, we train models on one dataset and report evaluation on another data source. We re-train models on i2b2 Med and DCN Med using common entity types: Dosage, Medication, Frequency, and Mode. Table.4 shows that the gains from gazetteer enhanced fusion models are preserved in i2b2 → DCN and DCN → i2b2.

Interpretability
Explainable and controllable models are very important for clinical applications. Unfortunately, it is extremely challenging for deep neural networks. We illustrate two qualitative examples in Table.5. Late fusion models are trained on i2b2 Med using 20% training data.
(1) Late fusion correctly predicts flare as I-R (Reason) since COPD flare is a Medical Condition.
(2) By looking into individual predictions from R and G, we notice that correct prediction is caused by name knowledge in gazetteers.
Overall, late fusion provides us a tool for diagnosis system: to answer questions whether NER or gazetteer model failed and explain why mentions belong to a particular entity type.
We studied fusion methods to improve NER system by leveraging name knowledge from gazetteers. We did a thorough analysis on the effectiveness of fusion methods on handling limited data and nonstationary gazetteers. In addition, we demonstrated that fusion models are explainable and can be used to improve NER systems. Future research should extend our approach to structured knowledge to further improve NER system and gain better interpretability.