TEBNER: Domain Specific Named Entity Recognition with Type Expanded Boundary-aware Network

To alleviate label scarcity in Named Entity Recognition (NER) task, distantly supervised NER methods are widely applied to automatically label data and identify entities. Although the human effort is reduced, the generated incomplete and noisy annotations pose new challenges for learning effective neural models. In this paper, we propose a novel dictionary extension method which extracts new entities through the type expanded model. Moreover, we design a multi-granularity boundary-aware network which detects entity boundaries from both local and global perspectives. We conduct experiments on different types of datasets, the results show that our model outperforms previous state-of-the-art distantly supervised systems and even surpasses the supervised models.


Introduction
Named Entity Recognition (NER) is the task of detecting mentions from text and classifying them into predefined types. It is a fundamental task in the field of natural language processing (NLP), which can facilitate many other tasks, such as entity linking (Fang et al., 2020), machine translation (Gekhman et al., 2020), and question answering (Li et al., 2020a). However, most of existing NER methods require large amounts of manually annotated texts for training supervised models, which is difficult to implement in the specific domain because domain-expert annotation is expensive and time-consuming.
To alleviate the label scarcity problem, distant supervision methods (Fries et al., 2017;Shang et al., 2018b;Liang et al., 2020) have been applied to automatically generate labeled data and recognize entities. Given a raw corpus and a dictionary, above methods firstly label entities by exact string matching, and then use the annotated dataset to train the * Corresponding authors: Yanan Cao and Yuhai Lu. well-designed neural models to recognize entities. Although the human effort is reduced, the labels generated by the string matching method pose two challenges.
The first challenge is incomplete annotations. Because most of existing dictionaries have limited coverage on domain entities, just using the given dictionary will make many out-of-dictionary entities unmatched and generate a large number of false-negative labels. By analyzing several commonly used datasets (e.g., BC5CDR, NCBI, MeSH), we find that the original dictionary only covers about 50% of domain entities, which may weaken the performance of subsequent NER model. To increase the number of label entities, previous works (Fries et al., 2017; attempt to expand the dictionary by heuristic rules. However, these rule-based methods are difficult to migrate to other domains. So, how to extend the dictionary with more general pattern is the first problem we need to solve. The second challenge is the difficulty of recalling new entities. Actually, even for the supervised models, new entities that have not been annotated are also difficult to be recalled (Shang et al., 2018b) because of the limited model capability. Most of previous NER methods, such as sequential label models (Chiu and Nichols, 2016;Ma and Hovy, 2016), and boundary detection models (Wang et al., 2018;Li et al., 2020b), only utilize the context information to recognize entities. However, the tight internal connection among entities and the global statistical features in domain corpus, which could contribute to identifying entities, are usually ignored by previous methods. Therefore, how to recall new entities with multi-granularity information is the second problem we need to solve.
To address two issues mentioned above, we propose a new distantly supervised method named TEBNER (Type Expanded Boundary-aware NER) in specific domains. To expand the original dictio-nary, we try to extract high-quality phrases from the raw corpus and view them as potential entities. Considering that these mined phrases lack corresponding type information and even contain noisy results, we use an entity typing model to classify and filter them based on their context information. Then these typed phrases are added to the original dictionary to resolve the incomplete annotation problem. For the purpose of recalling more new entities, we design multi-granularity boundary labeling strategies, which can capture boundary information from different perspectives. Specifically, we utilize the token interaction tagger to find the internal connection between entity tokens, the sequence labeling strategy to distinguish explicit entity boundaries in sentence and global statistical features of the whole corpus to recall potential entities. After getting the boundary results, we reuse the trained entity typing model to further classify entities and filter the noise results. In this way, we will get the trade-off between recall and precision for new entity detection.
In summary, the main contributions of this paper are listed as follows: • We propose a novel dictionary extension method, which rely on semantic context, neither on ambiguous strings nor on artificial rules. Experiments show that our dictionary extension method significantly improves the quality of distantly supervised annotations.
• We propose a multi-granularity boundaryaware network which integrates the information at word, sentence and corpus level. Experiments show that fusing different granularity boundary results can significantly improve the recall rate of NER model.
• We conduct extensive experiments on three benchmark datasets and our TEBNER model achieves the best performance with dictionaries only and no human efforts. On several datasets, our approach is even better than the supervised models.

Related Work
As a fundamental task, named entity recognition (NER) has drawn much attention of researchers. Most previous approaches model the NER problem as a sequence labeling task and use popular architectures like NN-CRF (Ma and Hovy, 2016;Chiu and Nichols, 2016). Recently, to recognize nested entities, many studies also propose to detect each entity boundaries individually (Wang et al., 2018;Zheng et al., 2019;Li et al., 2020b). With the burgeoning popularity of pre-training methods, large-scale language models such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019) are also utilized in NER task, yielding state-of-the-art performances. However, all of above supervised models require a great quantity of manually annotated data, which are usually labor-intensive and time-consuming. To alleviate human efforts, distant supervision methods are widely used in NER task (Shang et al., 2018b;Cao et al., 2019;Xue et al., 2020;Liang et al., 2020;Lison et al., 2020). For example, Shang et al. (Shang et al., 2018b) marks out-of-dictionary phrases as potential entities with a special "unknown" type and propose a neural model AutoNER with a token interaction tagger. However, these untyped phrases are less helpful to identify new entities. HAMNER  expands the dictionary with headwords and design a span-level model, which predicts entity boundaries by an entity classification model. But for sentences with complex structures, it is difficult to detect boundaries by just using entity type information. Unlike these works, TEBNER annotates phrases with semantic context and distinguishes entity boundaries by fusing multi-granularity information, which can generate labels with high precision and recall rate.

Problem Definition
Formally, given a sequence of words X = [x 1 , x 2 , ..., x n ], we denote an entity as e t = [x i , ..., x j ](0 ≤ i ≤ j ≤ n), where < i, j > represents its boundary and t indicates the entity type. Specifically, entity types include pre-defined types (e.g., Disease, Chemical) and none type which denotes non-entity. In distant supervision NER task, we only need a dictionary D as input in addition to the original text. Each dictionary entry contains the surface name and the entity type. In the training phase, we use dictionary matching method to generate annotations on the training corpus.

The Proposed Method
The overall structure of our TEBNER model is shown in Figure 1. The proposed framework mainly includes two parts: Dictionary Extender High-quality Phrase Annotated Text Figure 1: The process of our distant supervision method. Blue arrows show the dictionary expansion procedures, including high-quality phrase extraction, entity classification, and entity filtering. Red arrows show the entity recognization procedures, including annotation generation, entity boundary detection and entity type prediction.
which enriches entities by assigning types to extracted high-quality phrases, Entity Recognizer which identifies entity boundaries by fusing multigranularity information and predicts entity types through the trained classifier. In the following, we will introduce the technical details of these modules.

Dictionary Extender
As the training annotations in the distantly supervised NER task are only generated from the entity dictionary, the coverage and quality of the dictionary become the key factors to improve the model performance, especially for the neural network. Therefore, we use a dictionary extender module to generate high-quality labels with high coverage to the target corpus. Our dictionary extender mainly consists of three parts: high-quality phrase extraction, entity classification and entity filtering.
High-quality phrase extraction As in previous work (Shang et al., 2018b;, we utilize the AutoPhrase (Shang et al., 2018a) to extract high-quality phrase from domain corpus. Au-toPhrase is a distantly supervised phrase mining tool, which generates frequent phrase candidates according to popularity requirement and estimates phrase quality based on features about concordance and informativeness. The main input to the tool is a corpus and a dictionary, and the output is a ranked list of phrases with decreasing quality score.
To obtain the high-quality phrases, we only select phrases with score higher than a certain threshold (e.g., 0.5 for multi-word phrase and 0.9 for singleword phrase). After getting the out-of-dictionary phrases, we treat them as potential entities and propose an entity classification model to predict phrase types.
Entity classification The entity classification model is used to classify and filter the mined phrases and candidate entities. It is trained on annotated corpus generated by the original dictionary matching. Considering that some mined phrases are not real entities, we further add non entities to the training corpus to help the classification model recognize noisy entities. Specifically, we label the phrases in the corpus with lower scores (e.g., less than 0.3) as none entity type. We use pre-trained language model BERT as the backbone, which has been proven to be able to capture rich language information from text. Given an entity e t = [x i , ..., x j ] and its context, we construct the input of each entity as: where [x i ], [x j ] denote the token at the beginning and the end of the entity respectively.
are the word-piece tokens of the entity. ctxt l and ctxt r denote the context before and after the entity respectively. After getting the BERT output of each token in the sentence, we concatenates the representations of [CLS], [x i ], [x j ] and input them into a fully-connected layer. Then the representation of the entity will be sent into a softmax layer to predict the type label:

Boundary Detector
Figure 2: The two-stage pipeline framework. In Stage I, "Break or Tie" schema, "BIO" schema and "Phrase Matching" schema are utilized to identify entity boundaries at the word, sentence and corpus level separately, where the entity marked in red represents the result correctly predicted by the model. On the contrary, the entity marked in blue indicates that it is not recognized correctly. In Stage II, the entity classification model trained in the dictionary expansion procedure is used to predict entity types.
Entity filtering We first filter the phrases that are predicted as none type by the entity classification model. Moreover, there are some phrases predicted as several different types. We skip the phrases that are identified as multiple categories during the entity recognizer module training. As for phrases with consistent results, we add them to original dictionary to improve the entity coverage. Finally, the extended dictionary will be used to generate more annotations for the training of entity recognizer.

Entity Recognizer
Previous works (Ma and Hovy, 2016;Shang et al., 2018b;Cao et al., 2019; always jointly model the entity boundary detection and classification tasks. In general, it is effective for the supervised model using manual annotation data. However, there are many labeling errors in the data annotated by distantly supervised method. The joint learning method can easily lead to overfitting of the NER model, which makes it difficult to identify unlabeled entities. With this in mind, we utilize a pipeline framework which learns entity boundary and entity type separately. The overall structure of our framework is shown in Figure 2.
Our entity recognizer contains two components: boundary detector and entity classifier. As the final results in NER are only generated from the boundary detector, we will recall candidate entities as comprehensively as possible to ensure that the target entity can be input into the entity classifier. To achieve this goal, three kinds of tagging schemas are utilized to identify entity boundaries at the word, sentence and corpus level. In the following, we will describe each tagging schema in detail.
"Break or Tie" Tagging Schema To capture boundary information at word granularity, we construct a token interaction tagger to distinguish whether two adjacent tokens are tied in the same entity mention or not. The key motivation is that domain entities usually contain the specific words, modeling the connection between adjacent tokens can help to find new entities with the same domain words. For example, in the field of biology, many disease entities may contain kidney, lung and other organ nouns; in the field of finance, many institutional entities may contain insurance, bank, etc. Inspired by this instuition, we utilize a "Break or Tie" tagging schema to recognize domain entities. As shown in Figure 2, (i) T (Tie) indicates that both of the two adjacent tokens belong to the same entity. (ii) B (Break) means that the ties between adjacent tokens are broken into two parts.
Specifically, we build a binary classifier to distinguish whether current token is connected to the next one in the sentence. Given the output from BERT, the representation of i-th token and i + 1th token are concatenated to a new feature vector, which is referred as V i for the i-th token. The model predicts the probability of each token being connected with next one as follows: where c i is the label between the i-th and its next tokens, and C is the set of connection modes (e.g., "Tie" and "Break"). To train the model, we adopt the following cross entropy loss function: where y ∈ {0, 1} represents the label of each token, and y ∈ [0, 1] indicates the predicted result of our model. In the inference stage, we connect the tokens between every two consecutive "Break" to form a candidate entity.
"BIO" Tagging Schema It is worth noting that above token interaction tagger cannot detect entities which just contain single token. Moreover, besides the tight internal connection between words, context information is also important for identifying entities. Therefore, we follow the sequence labeling framework using "BIO" tagging scheme to detect entities at sentence level. Concretely, we tag the beginning token of an entity by "B", the other token of this entity by "I", and the non-entity tokens by "O". As shown in Figure 2, we first tokenize each word in sentence and pass it through BERT Transformer stacks. Then we use a Dense layer with the softmax activation function as the entity classifier to get probability of the labels from the contextualized representation. Similar to the token interaction tagger mentioned above, we use the cross entropy loss function in the model training process and connect the tokens between "B" and "O" to form a candidate entity in the inference process.
"Phrase Matching" Tagging Schema Most of the previous distantly supervised models (Shang et al., 2018b;Cao et al., 2019; use deep neural network to identify entity boundaries at the word or sentence level. However, the statistical features of domain entities in the corpus are often ignored by them. For example, the part-of-speech tagging, the term frequency, and the probabilities of an entity in quotes and brackets are all helpful to identify boundaries. Therefore, we use a phrase mining tool which can capture multiple features Assign types to E all 18: end for to extract high-quality phrases and detect candidate entities from the testing data set through exact string matching. Finally, the results of the above three methods are fused and input into the entity classification model (mentioned in 4.1). In this stage, the entity classification model is trained on annotated corpus labeled by the extended dictionary. If the candidate entity does belong to a given entity type (not none), we output it as the final result. The details of the overall process of our model are presented in Algorithm 1.

Experiments
To evaluate the effectiveness of our method, we conduct experiments on a series of popular distantly supervised NER datasets which are also used by (Fries et al., 2017;Shang et al., 2018b;. The results show that our model achieves the state-of-the-art performance.

Experiment Setup
Dataset We train and evaluate our model on three benchmark datasets. The statistics of the datasets are shown in Table 1.  • LaptopReview (Pontiki et al., 2014) consists of 3,845 review sentences, including 3,012 AspectTerm mentions. As in previous work (Giannakopoulos et al., 2017;, it is split into three subsets: 2,445 for training, 600 for validation and 800 for testing. Dictionary and High-Quality Phrase For a fair comparison with the previous methods, we use the same dictionary and high-quality phrase as (Shang et al., 2018b;. Specifically, for the BC5CDR and NCBI-Disease datasets, the dictionary is a combination of both the MeSH database 1 and the CTD Chemical and Disease vocabularies 2 . For the LaptopReview dataset, the dictionary is crawled from the public website 3 . Moreover, the phrase mining tool AutoPhrase is pre-trained on a same domain text and then applied to small datasets. In the biomedical domain, it is pre-trained on the titles and abstracts of 686,568 PubMed papers. In the laptop review domain, it is pre-trained on Amazon laptop review dataset (Wang et al., 2011).
Training Details In the boundary detection and entity classification models, we use a fine-tuned BERT to encode entity context. During the finetuning process, we use "biobert-base-cased-v1.1" (Lee et al., 2020) and "bert-base-cased" (Devlin et al., 2019) as our pre-trained models for biomedical and technical domain separately. We set a maximum sentence length of 256 tokens. The dimension of hidden representations is set to 768, the learning rate is set to 3e-5, the probability of dropout is set to 0.15, and the AdamW (Loshchilov and Hutter, 2019) is utilized as optimizer. The multi-layer perceptron in the entity classifier has a depth of 2 and a hidden size of 256. All above modules are trained on Nvidia Tesla V100 GPU and implemented in the PyTorch framework.

Comparing with Previous Work
Baselines We compare TEBNER with a series of NER models which report state-of-the-art results on the test datasets. There are two types of baselines methods, including supervised model (BiLSTM-CRF, ELMo-NER, BERT-NER) and distantly supervised model (Dictionary Match, Swell-Shark, AutoNER, HAMNER).
• BiLSTM-CRF (Lample et al., 2016) adopts bi-directional LSTM with character-based representations to produce token embeddings, which are fed into a CRF layer to predict token labels.
• ELMo-NER  uses pretrained word embeddings, a character-based CNN representation, two BiLSTM layers with ELMo to train the NER model.
• BERT-NER adopts BERT-base model with sequence labeling framework to perform token-level prediction.
• Dictionary Match recognizes entities by performing string matching with given dictionary. It can be viewed as the baseline of distantly supervised model to test the improvement of other methods over the distant supervision itself.
• SwellShark (Fries et al., 2017) is a distantly supervised method designed for the biomedical domain. It needs regular expressions, and hand-tuning for special cases.
• AutoNER (Shang et al., 2018b) uses a BiL-STM network to learn connection between adjacent tokens and extracts high-quality phrases to reduce false-negative labels.  • HAMNER  is the best distantly supervised method in the past. It extends the dictionary with headword-based matching and infers the entity spans with an entity typing model.

Results
The comparative results on three benchmark datasets are shown in Table 2. We observe that TEBNER achieves the best performance on all datasets. It should be emphasized that we use the same dictionary and phrases as the AutoNER and HAMNER. Due to the differences in data processing methods, our dictionary matching results are slightly different from them. Although Swell-Shark is designed for the biomedical domain and utilizes much more expert effort, TEBNER can easily surpasses it without human effort. Since the original AutoNER model uses all the raw texts for training (i.e., the training dataset is the union of the training, development, and test sets), Liu et al.  retrained the model with ELMo. To make a fair comparison with them, we use the AutoNER+ELMo (trained on the training set only) results reported in , which are slightly lower than original results in (Shang et al., 2018b). Moreover, we also train the Au-toNER model with BERT and report the evaluation results in our paper. Compared with our proposed model that integrates multi-granularity boundary information, AutoNER only focuses on the ties between adjacent tokens and has poor performance on benchmark datasets. In particular, TEBNER outperforms the previous state-of-the-art method  HAMNER by {3.02%, 5.98%, 8.06%} in terms of F1 score and surpasses the supervision model on the BC5CDR and NCBI datasets, which demonstrates the significant superiority of our proposed model.

Impact of Different Modules
To analyze the performance of different modules and investigate their impact on the final results, we also conduct experiments on following aspects.

Effectiveness of Dictionary Extension
To evaluate the effectiveness of our dictionary extender, we compare three extension methods and report their distantly supervised annotation quality on the training set. As shown in Table 3  , our method significantly achieves 10.84% and 6.60% relative improvements on recall and F1 scores. It is worth noting that previous methods utilize complex strategies (i.e., headword matching, semantic similarity calculation, annotation weight setting) to improve the entity recall rate. Unlike them, our method mainly depends on contextual semantic information, which can be applied to any domain corpus. Moreover, we also try to extend the dictionary with a KNN model. Specifically, for each phrase, a closest entity will be recalled from the source dictionary based on the cosine similarity between the corresponding word embeddings. Then the type of the recalled entity will be assigned to the phrase. Limited by the size of the original dictionary, some entities with similar semantics but different types are easy to be recalled by KNN model, which leads to a decline in F1 score.

Influence of the number of Annotations
To evaluate the robustness of our model, we study the influence of the annotated data size on the final results. Concretely, we randomly select sentences from the distant annotations and evaluate our model trained on the selected texts. From Figure 3, we can observe that increasing the size of annotations will generally improve the performance of the model, and the improvement tends to flatten out with 80% data. In particular, our model achieve 86.70% test F1 score on the BC5CDR dataset with only 60% data, which demonstrates that our TEBNER model can significantly reduce the human efforts to create NER taggers.  Ablation studies To better explore the contribution of different modules to the overall performance, we conduct the ablation studies on BC5CDR dataset. From the results shown in table 4, we can observe that: (1) Dictionary extender is a necessary component that contributes 3.89% gain of F1 to the ultimate performance, we attribute this gain to the context semantic information.
(2) Removing "Break or Tie" tagging scheme degrades the performance by 1.42% F1, which shows that the connection information reflecting the interdependence between adjacent tokens is useful for NER.
(3) The "BIO" tagging scheme contributes much to the overall performance, since the F1 drops by 2.54% if it is removed. (4) When we remove phrase matching result, the score drops by 2.75%, which indicates that the participation of multi-aspect statistical information is important for our model.

Conclusion
In this paper, we propose a new dictionary extension method and design a boundary-aware model in specific domains using distant supervision. Our dictionary extender combines phrase mining method with entity classification model, which can be easily applied to any other domain corpus. By utilizing different tagging schemes to extract candidate entities from sentence and introducing AutoPhrase tool to extract high-quality phrases from corpus, our distantly supervised NER model can detect entities from both local and global perspectives. In experiments, we evaluate our method on different domain datasets and the results demonstrate the effectiveness of our model.