Toward Fully Exploiting Heterogeneous Corpus:A Decoupled Named Entity Recognition Model with Two-stage Training

Named Entity Recognition (NER) is a fundamental and widely used task in natural language processing (NLP), which is generally trained on the human-annotated corpus. However, data annotation is costly and time-consuming, which restricts its scale and further leads to the performance bottleneck of NER models. In reality, we can conveniently collect large-scale entity dictionaries and distantly supervised data. However, the collected dictionaries are lack of semantic context and the distantly supervised training instances contain large noise, which will bring uncertain effects to NER models when directly incorporated into the high-quality training set. To address the above issue, we propose a BERT-based de-coupled NER model with two-stage training to appropriately take advantage of the heterogeneous corpus, including dictionaries, distantly supervised instances, and human-annotated instances. Our decoupled model consists of a Mention-BERT and a Context-BERT to respectively learn from the context-deﬁcient dictionaries and noised distantly supervised instances at the pre-training stage. At the uniﬁed-training stage, the two BERTs are trained together on human-annotated data to predict the correct labels for candidate regions. Empirical studies on three Chinese NER datasets demonstrate that our method achieves signiﬁcant improvements against several baselines, establishing the new state-of-the-art performance.


Introduction
Named entity recognition is a fundamental Natural Language Processing task that labels each word in sentences with predefined types, such as Person (PER), Location (LOC), Organization (ORG), etc. The results of NER can be used in many downstream NLP tasks, e.g., relation extraction (Bunescu and Mooney, 2005), information retrieval (Chen et al., 2015), and question answering (Yao and Van Durme, 2014). Supervised methods are mainstream approaches to NER, including CRF (Lafferty et al., 2001) and neural network models (Collobert et al., 2011;Lample et al., 2016;Ma and Hovy, 2016). Recently, large-scale pre-trained language models fine-tuned upon a limited amount of annotated data achieve competitive or better performance in NER task (Peters et al., 2018;Devlin et al., 2019;Yang et al., 2019b).
Supervised NER methods require a sufficient amount of sentence-level annotated data, even for the methods using pre-trained language models. However, obtaining sentence-level annotated data is expensive and thus leads to small training data size and performance bottleneck of supervised models. In practice, entity dictionaries (or gazetteers) and unlabeled corpora can be obtained at a low cost. Furthermore, distantly supervised data can be automatically generated by matching the unlabeled data against entity dictionaries. These data can form a heterogeneous corpus, which has potential to improve the NER task. However, dictionaries contain only entity mentions without context, and distantly supervised data can be highly noisy in terms of wrong labels and wrong boundaries. As a result, it is unwise to treat dictionaries and distantly supervised data equally to human-annotated ones.
To better utilize heterogeneous corpus, we propose a BERT-based decoupled NER model with two-stage training. The decoupled model decouples mention information and context information with a Mention-BERT and a Context-BERT, which can better exploit the information in entity data and distantly supervised data respectively. In the pre-training stage, the Mention-BERT can be pretrained using the entity dictionary with a classifica-tion task, and the Context-BERT can be pre-trained using the distantly supervised data with two auxiliary tasks (masked language modeling task and classification task). During inference, the decoupled model can utilize the mention information and context information together to make the final prediction. We evaluate our methods on three Chinese NER datasets. Experimental results show that our method outperforms baseline methods and achieves the best results, demonstrating the effectiveness of our methods. The contributions of our work can be summarized as follows 1 : • We propose a decoupled NER model with two-stage training, which can fully exploit heterogeneous corpus consisting of dictionaries, distantly supervised instances, and humanannotated instances.
• Our model achieves the state-of-the-art results on three common Chinese NER datasets, significantly outperforming current SOTA by 1.51% on OntoNotes and 1.7% on Weibo, as well as obtaining a slight but noticeable gain on MSRA.

Named Entity Recognition
The task of named entity recognition is to find entities in sentences with predefined types, such as PER, LOC, and so on. Given an input sentence X = {x 1 , x 2 , ..., x n } where x i denotes the i-th token, and a predefined tag set Y , NER can be modeled as a sequence labeling or region-based classification task. In sequence labeling approaches, the model aims to assign a label y ∈ Y to each token x i . In region-based approaches, the model examines each candidate region {x i , x i+1 , ..., x i+k } and attempts to assign a label y ∈ Y to it, where i is the starting position of the region in sentence, and k is the length of the region. Our model follows the framework of region-based approaches.

BERT-NER Model
Recently, large-scale pre-trained language models, such as BERT (Devlin et al., 2019) and ELMo (Peters et al., 2018), are widely used in NLP and yield state-of-the-art performances on many tasks. Pre-trained language models follow a two-stage paradigm. They are first pre-trained on large-scale unlabeled texts via self-supervised tasks such as masked language modeling and next sentence prediction, and then fine-tuned on relatively small labeled data of downstream tasks. BERT-NER model is easily adapted from pretrained BERT model and can achieve competitive performance. Given a sentence X, BERT first outputs the sentence representation H = {h 1 , h 2 , ..., h n }, where h i is the representation of token x i . Then, H is passed through a feed forward network (FNN) to obtain the label sequence {y 1 , y 2 , ..., y n }: where W, b are parameters of the FFN, and y i is the predicted label of x i . Our model is built on top the BERT model. Compared with the BERT-NER, we propose a new decoupled architecture to better utilize heterogeneous data. Besides, different from the training tasks of BERT, our model introduces task-aware pretraining tasks into a two-stage training framework.

Model Architecture
Generally, an effective NER model should capture two types of information for determining an entity, i.e., mention information and context information. In traditional NER models, the mention and context information are typically coupled in annotated data. Our proposed model decouples the two types of information, making them to be more explicit and easily learned from the heterogeneous corpus.
Overview. As shown in Figure 1, our model consists of three main parts: a Mention-BERT, a Context-BERT, and a Global-Classifier. The input is a sentence along with a region denoting a mention candidate. The model will decouple the mention from the context and feed the two parts into the Mention-BERT and the Context-BERT respectively. Then, the outputs of the two BERTs will be concatenated and passed through the Global-Classifier to obtain the final label prediction. Additionally, the two BERT outputs are also passed through a mention-focused and a context-focused classifier respectively to provide auxiliary supervision during training, which we will elaborate later.
Mention-BERT. The Mention-BERT is used to capture the representation of the mention that tobe recognized. The input of the Mention-BERT

Mention-BERT
Context-BERT is an entity mention in the input sentence, and the output is the representation of the mention. The architecture of the Mention-BERT is the same as the original BERT, which is a multi-layer bidirectional Transformer encoder. As shown in Figure 1 Context-BERT. The Context-BERT aims to encode the context around an entity mention. It has the same architecture as the Mention-BERT. The input c is just the context of the candidate mention, where the mention is replaced by a special [MASK] token. The output corresponding to the [MASK] token position is used as a representation for the context, denoted as h c : As an example, in Figure 1(b), we have h c = h 1 . Note that at inference time we use only one [MASK] even for multi-token entities, as the Context-BERT are not allowed to not use any information of the mention.
Global-Classifier. The Global-Classifier determines the input mention's tag by considering both the mention representation and the context representation. In the implementation, we concatenate the output of Mention-BERT h m and the output of Context-BERT h c and pass them into a FFN: where W g , b g are parameters of the Global-Classifier, and y g is the final prediction.

Two-stage Training
Pre-trained language models such as BERT aim to model general patterns of language and treats entity and non-entity words indiscriminately. It is reasonable to expect that such models will not generate a perfect representation for the NER task. To better utilize external heterogeneous data for the NER task, we design a two-stage training framework: (1) pre-training the Mention-BERT and the Context-BERT on entity dictionaries and distantly supervised data, and (2) training the unified model on human-annotated data.
Heterogeneous Training Data. Despite the limited size of human-annotated data for NER, we can easily collect large-scale entity dictionaries and unlabeled text corpora, and hence generate distantly hCLS h1 h2 hSEP

Mention-BERT
足协 ORG (Football Association ORG) 乒协 ORG (Table Tennis  Association  The Mention-BERT is pre-trained on the entity dictionary using a label classification task. For example, we try to predict that "足协" (Football Association) on its own is an organization.
supervised data. For dictionary data, the text often contains rich entity structure information. For example, a person name often consists of the First name and the Last name. For distantly supervised data, the text often contains rich context information, while has high noise. The most common mistakes are wrong labels and wrong boundaries. As a result, these data are not suitable to be directly incorporated for NER. However, they can be naturally used as data for pre-training to learn highcoverage and task-aware representations of entity mentions and contexts. On the one hand, previous research showed that further pre-training BERT to do language modeling on in-domain corpus could improve the performance of downstream tasks (Gururangan et al., 2020). On the other hand, either the entity or the context itself can be a strong indicator of entity types.
Mention-BERT Pre-Training. To better capture the regularity information of entities, the Mention-BERT is pre-trained on entity dictionaries. As shown in Figure 2, we add a feed forward classifier denoted as Mention-Classifier for Pre-Training on top of the Mention-BERT. The task is to classify each input term into the most probable label according to the dictionaries. For example, the output for the term "足协" (Football Association) should be ORG. Besides, to empower the model to learn discriminative representations for non-entity terms as well, we sample items from a common dictionary that have never been seen in any one of the entity dictionaries, and assign an O label to them.

Context-Classifier for
Pre-Training Loss Context-BERT Pre-Training. As shown in Figure 3, the Context-BERT is pre-trained on distantly supervised data with a hybrid task of masked language modeling and entity label prediction. For each input sentence, we pick one entity mention in it at each time and replace all tokens in it with [MASK] tokens. Given only the context with the mention masked out, the model is trained to predict both the masked tokens along with the entity label. We also randomly pick some non-entity regions and assign O labels to them. To this end, we use two classifiers, namely the Masked Language Model and the Context-Classifier for Pre-Training. The Masked Language Model is the same as in the original BERT. The Context-Classifier for Pre-Training is fed with the average pooling of the Context-BERT's outputs for all masked tokens.

Context-Classifier for Pre-Training
Unified-Training. After pre-training, we perform the unified-training, in which the pre-trained Mention-BERT and Context-BERT are put together and further trained on human annotated data. To construct training examples, we iterate over all entity mentions in the annotated sentences and obtain pairs of 〈MENTION, CONTEXT〉as the input of our model (see Figure 1). We also select non-entity regions as O. Given the correct label y, we define the loss of the Global-Classifier L g as follows: L g = CE(y g , y) where CE is the cross-entropy loss. Furthermore, to avoid catastrophic forgetting for the pre-trained Mention-BERT and Context-BERT in the unified-training, we also add two auxiliary feed forward classifiers on top of Mention-BERT and Context-BERT, denoted as Mention-Classifier and Context-Classifier respectively (see (d) and (e) in Figure 1). Both of them have the same structure and objective as the Global-Classifier except input: where W m , b m , W c , b c are parameters of the Mention-Classifier and the Context-Classifier, and y m , y c are the respective predictions. We define losses for the two classifiers as follows: The final loss L of our model at the unifiedtraining stage has three parts: where α, β ∈ [0, 1] are hyper-parameters.

Pre-training Corpora
Entity Dictionary. There are four types of entity in our experiment: PER, ORG, GPE, and LOC. We extend the dictionary used in  with more gazetteers collected from Sougou Dictionary 4 and Baidu Dictionary 5 . Finally, our gazetteer contains 50k person names, 143k organization names, 43k geopolitical entities, and 33k location names (see Appendix 1.1).
Distantly Supervised Data. The entity dictionary above is used to match unannotated sentences to obtain distantly supervised data. For OntoNotes and MSRA dataset, we collect news documents on the People's Daily 6 published from 1949 to 2010. For the Weibo dataset, we use the Weibo unannotated data from Peng and Dredze (2015). Finally, we obtain 893k sentences of distantly supervised data for news and 837k for Weibo.

Training Setting
Some hyper-parameters for training can be found in Appendix 1.2. We set the α = 0.5, β = 0.5 in unified training through experiments. To better utilize the common knowledge of the Mention-BERT and Context-BERT, and also to reduce the model size, the parameters of Mention-BERT and Context-BERT are shared. We do not share the parameters of each classifier, because the label sets and output dimensions of the classifiers may be different across the two-stage training.

Baselines
We use the following models as baselines: BiLSTM-CRF from Lample et al. (2016), which is a classical baseline for NER.
Lattice LSTM from , which uses a dictionary and word embedding to enhance character-based Chinese NER model. Devlin et al. (2019), which uses the outputs from the last layer of BERT model as feature representations, and does token classification to extract entity.

BERT-NER from
Incomplete-NER from Jie et al. (2019), which is based on BERT-CRF and uses cross-validation to estimate the distribution of missing labels in distant supervision 7 .
MRC-NER from Li et al. (2020b), which considers NER as machine reading comprehension.
SoftLexicon from Ma et al. (2020), which proposes a simple but effective method for incorporating the word lexicon into the character representations in Chinese NER.
FLAT from Li et al. (2020a), which uses transformer to consider the relation between every character and word in the sentence.
ERNIE from Sun et al. (2019), which enhances BERT through knowledge integration by using a entity-level masked LM task and more raw text from the Web resources.
CoFEE from Xue et al. (2020), which proposes a NER-specific pre-training framework to inject coarse-to-fine automatically mined entity knowledge into pre-trained models.

Main Results
Following the evaluation metrics in previous work, entity-level (exact entity match) standard micro Precision (P), Recall (R), and F1 score are used to evaluate the results. Table 2 presents the comparison between our model and baseline models. We can observe that our decoupled model with two-stage pre-training significantly outperforms recent models, establishing a new state-of-the-art for supervised NER. For OntoNotes, our model outperforms the SoftLexicon model by +1.51% in terms of F1. For Chinese MSRA, the proposed method outperforms the FLAT model. We also improve the F1 from 70.94% to 72.64% on Weibo dataset. We can also see 7 We use the code from https://github.com/ ZhuiyiTechnology/AutoIE. We combine human annotated data and distantly supervised data of equal size for training that the Mention-BERT pre-trained on entity dictionary outperforms the plain decoupled model without two-stage pre-training by 0.89% in OntoNotes, 0.52% in MSRA, and 1.55% in Weibo respectively. These results show the effectiveness of Mention pre-training for the NER task. The results also show that Context-pretraining can improve performance (0.46% in OntoNote, 0.34% in MSRA, and 0.57% in Weibo). Moreover, further pre-training the Context-BERT based on Mention BERT using distantly supervised data can lead to a performance gain in F1 score(0.89% in OntoNote, 0.52% in MSRA, and 1.55% in Weibo).

Effect of Introducing External Data
In our experiments, it is not immediately clear which part is responsible for the final improvement: can it be the decoupled model or more additional data or both? To answer this question and show our model design can better utilize the heterogeneous corpus, we choose BERT-NER and Soft-Lexicon as base models to explore the effect of external data. For each base model, we experiment on two settings. First, we simply expand the training dataset by adding entity dictionary data and distantly supervised data. Second, we adopt a two-stage training strategy similar to the methods in Section 3.2, where we use the large external data to further pre-train the BERT part of BERT-NER and SoftLexicon, and then fine-tune the whole models on human-annotated data. The results are shown in Table 3. Our decoupled model achieves the best results. We can see a large performance drop when directly incorporating external training data in BERT-NER and SoftLexicon, as the distantly supervised data are noisy and its big size is unbalanced with the human-annotated data. Unexpectedly, the two base models also perform worse in the two-stage training setting. We suppose that the pre-training task of span classification is not suitable for the sequence labeling task.

Effect of Human-annotated Data Scale
To compare performances under different numbers of human-annotated training sentences, we randomly select different numbers of training sentences for training on the OntoNotes dataset. As shown in Figure 4, our model has better per-  formance than the BERT-NER model, which shows the effectiveness of our methods in small training data. Surprisingly, the results also show that in small data size (20% training data), our model also outperforms the BERT-NER model with full data size, which shows that our model requires less sentence-level annotated data compared with the original BERT-NER model. In addition to the model structure and external data, there are two other factors that lead to greater improvement. First, the 20% training data still contain over 3k examples in the news domain. Second, we leverage the mention boundary prediction from LTP, which provide high-quality candidates.
We also experiment on an even smaller training data size (only 1k sentences). In Table 4, we can see that our model performs better than BERT-NER on all the datasets.

Effect of Model Parameter Sharing
In practice, we share our model parameters of Mention-BERT and Context-BERT. In Table 7, we can see that the model with parameter sharing slightly outperforms the model without parameter sharing. A possible reason is that common knowl-    Table 6: Case study. Our model refers to the decoupled model with two-stage training. The text in brackets is the candidate mention, followed by the golden label. Predicted labels in red denote wrong answer. Table 6 shows two cases from OntoNotes. In the first example, the BERT-NER model misclassifies "九江"(Jiujiang) as LOC. We find that "九江" (Jiujiang) is in our dictionaries but the label is GPE. Benefited from incorporating entity dictionaries into pre-training, our model can correctly recognize "九江" (Jiujiang) as a city. In the second example, the BERT-NER misclassifies "东盟"(Association of Southeast Asian Nations) as GPE. We find that distantly supervised data contains the sentence, "在 上海合作组织成立5周年大会上"(At the 5th anniversary meeting of the Shanghai Cooperation Organization) and the context of "上海合作组织" (Shanghai Cooperation Organization) is similar to "东盟" (Association of Southeast Asian Nations).
The label of the "上海合作组织" (Shanghai Cooperation Organization) is GPE. With the context information from Context-BERT, our model can obtain the correct answer of "东盟" (Association of Southeast Asian Nations).
6 Related Work 6.1 Supervised NER Models NER models trained on human-annotated data often achieve appropriate performance. Sequence labeling methods are widely used in NER. Traditional methods use the CRF model to solve the NER task (Lafferty et al., 2001). With the advantages of eliminating feature engineering and significant performance improvement, neural network models become prevalent in NER research, e.g., the models based on FFN (Collobert et al., 2011), CNN (Ma and Hovy, 2016), LSTM (Lample et al., 2016), and pre-trained language model (Devlin et al., 2019). Recent work also propose different ways to model the NER task other than sequence labeling, such as machine reading comprehension (Li et al., 2020b), dependency parsing , span classification (Sohrab and Miwa, 2018). Generally, these approaches have achieved promising results but heavily rely on human-annotated data.

Enhancing NER with External Data
Entity dictionaries or gazetteers have long been regarded as an easily-obtainable and useful resource for NER. Previous methods commonly incorporated gazetteers as additional features (Ghaddar and Langlais, 2018;Al-Olimat et al., 2018;Liu et al., 2019a;Lin et al., 2019;Rijhwani et al., 2020). For languages without explicit word boundaries, such as Chinese, incorporating a universal dictionary with common words besides gazetteers can be further helpful for NER Liu et al., 2019b;Sui et al., 2019;Gui et al., 2019b,a;Ma et al., 2020;Li et al., 2020a;Jia et al., 2020). Dictionaries can also be used to construct distantly supervised data from unlabeled corpora. Previous work on reducing the noise in distantly supervised data include new labeling schemes (Shang et al., 2018), reinforcement learning , cross-training (Jie et al., 2019), positive unlabeled learning , HMM (Lison et al., 2020), consensus network (Lan et al., 2020a). In other NLP tasks, such as relation extraction, few works have exploited using both human annotated data and dis-tantly supervised data together (Angeli et al., 2014;Beltagy et al., 2019). Compared with previous works, our work focus on designing a new model architecture and training approaches to better exploit the heterogeneous data in NER task.

Two-stage Training Paradigm for NLP
Recently, large-scale pre-trained language models, such as BERT (Devlin et al., 2019) and ELMo (Peters et al., 2018), are widely used and yield stateof-the-art performances in many NLP tasks. These two-stage methods allow using large-scale unlabeled data in pre-training and small labeled data in fine-tuning. In order to adapt to specific tasks or domain, variants of BERT are proposed including small and practical BERT (Tsai et al., 2019;Lan et al., 2020b;Jiao et al., 2020), domain adaptive BERT (Yang et al., 2019a;Gururangan et al., 2020), and task adaptive BERT Xue et al., 2020;Jia et al., 2020). Our work performs further pre-training on BERT and proposes task-aware training objectives to improve NER.

Conclusion
In this work, we focus on fully exploiting heterogeneous corpus for NER. The corpus consists of entity dictionaries, distantly supervised instances, and human-annotated instances. We propose a decoupled NER model with two-stage training. The model first learns appropriate task-aware representations in pre-training, from large-scale contextdeficient dictionaries and noisy distantly supervised data. Then after unified-training, the model can predict entity labels according to both the mention and the context information. Experimental results show our method achieves better performance than previous state-of-the-art methods on three Chinese datasets. In the future, we will exploit more types of data, such as knowledge bases, and extend our approach to other languages.  Table 7: Coverage rate and conflict rate of entity dictionary. We use the entity dictionary to directly match the test dataset, and compute the coverage rate and conflict rate. The coverage rate is the number of entities both in the dictionary and in the test dataset divided by the number of entities in the test dataset. The conflict rate is the number of entities with inconsistent labels divided by the number of entities both in the dictionary and in test dataset.