Type Enhanced BERT for Correcting NER Errors

We introduce the task of correcting named entity recognition (NER) errors without re-training the model. After a NER model is trained and deployed in production, it makes prediction errors, which usually need to be fixed quickly. To address this problem, we firstly construct a gazetteer containing named entities and corresponding possible entity types. And then, we propose type-enhanced BERT (TyBERT), a method that integrates the named entity’s type information into BERT by an adapter layer. When errors are identified, we can repair the model by updating the gazetteer. In other words, the gazetteer becomes a trigger to control the NER model’s output. The exper-iment results in multiple corpus show the effectiveness of our method, which outperforms strong baselines.


Introduction
Named entity recognition (NER) is the task of identifying spans that belong to particular categories, such as person, location, organization, etc.The NER task is important in the information extraction area and NER models are widely deployed in real production systems (Yadav and Bethard, 2019).In recent years, many neural-based methods were proposed to push NER accuracy by designing novel network architectures (Lample et al., 2016;Devlin et al., 2018;Straková et al., 2019;Xue et al., 2022) or incorporating external knowledge (Liu et al., 2019;Wang et al., 2021).Unfortunately, all approaches are still far from perfect.When the model is served in production, we may still encounter recognition errors (e.g., bad cases).
Typically, to fix those bad cases, model developers need to (1) annotate the input sentences causing errors with correct labels, (2) combine newly annotated sentences with existing training data, (3) train and tune a new model with the new training data * Equal contribution.

Input Sentences
Predict case 1: Mike Moreton joined to run the XJ220 project.
case 2: Nicaragua, the previous year 's winner, was forced to withdraw from the contest.case 1: Mike Moreton [person] joined to run the XJ220 project.
case 2: Nicaragua [location_gpe] , the previous year 's winner, was forced to withdraw from the contest.
case 2: Nicaragua [organization_sportsteam] , the previous year 's winner, was forced to withdraw from the contest.and held-out evaluation data, and finally (4) deploy the new model in production.As one can tell, the above process is time-consuming, and cannot meet the requirement of fixing urgent errors quickly in a real production environment.Therefore, in this paper, we aim to tackle the problem of how to correct NER errors without retraining models.1 Taking case 1 and 2 from Figure 1 as examples, there are two kinds of common NER errors when we train and evaluate a model in the English Few-NERD (Ding et al., 2021) corpus: (1) the model fails to recognize the span "XJ220" as a named entity; (2) the model correctly identifies the boundary of the named entity "Nicaragua", but assigns a wrong entity type to it.

Update Gazetteer
For the first error, we find the span "XJ220" never appears in the training dataset.Therefore, it is difficult for the model to classify this span as a named entity with limited context.For the second error, the mention "Nicaragua" is found in the training dataset, but it is labeled with a different type location.Because of the incomplete type information, the model mistakenly classifies the mention as type location, though the correct label should be organization_sportsteam.
The above examples suggest that if we have proper type information about the span, the model may correct its mistakes, even without re-training.It motivates us to propose the Type Enhanced BERT (TyBERT) method that combines BERT with type information from a gazetteer.
As shown in Figure 1, the gazetteer is a list of pairs of spans and possible entity types.During training, we first look up spans from the gazetteer in training examples, and then integrate the matched span's type information into BERT layers by an adapter layer.In the inference stage, the test examples are processed in the same way.In such a manner, the model is tied to the gazetteer, which will play an important role when the model makes predictions.When encountering the aforementioned two kinds of errors, we can update the gazetteer: we insert a new named entity "XJ220" with the expected type product_car, and add a new type organization_sportsteam for the existing named entity "Nicaragua".Moreover, we introduce a noise rate parameter λ to randomly add some noise to the gazetteer.This parameter serves as an adjuster to balance the strength of the gazetteer and the generalization ability of the model.
To our knowledge, this is the first work to systematically study how to improve NER models without re-training models.When evaluated in four NER corpus in English and Chinese, the proposed method performs well in fixing errors and outperforms strong baselines.Our code and data will be released after publication.

Related Work
Our work is influenced by existing methods which combine both neural networks and lexicons or gazetteers for NER.For example, Zhang and Yang (2018) proposed a lattice-structured LSTM encoding both a sequence of input characters and potential words that match a pre-gathered lexicon.Sui et al. (2019) presented Collaborative Graph Network to solve the challenges of self-matched lexical words and the nearest contextual lexical words.Gui et al. (2019) aimed to alleviate the word ambigu-ity issue by a lexicon-based graph neural network with global semantics.Lin et al. (2019) designed an attentive neural network to explicitly model the mention-context association and gazetteer network to effectively encode name regularity of mentions only using gazetteers.Li et al. (2020) introduced a flat-lattice Transformer to incorporate lexicon information for Chinese NER.Meng et al. (2021) invented GEMNET to include a Contextual Gazetteer Representation encoder, combined with a novel Mixture-of-Expert gating network to conditionally utilize this information alongside any word-level model.Fetahu et al. (2022) invented an approach of using a token-level gating layer to augment pretrained multilingual transformers with gazetteers from a target domain.Finally, Liu et al. (2021) proposed Lexicon Enhanced BERT (LEBERT) for Chinese sequence labeling, which integrates external lexicon knowledge into BERT layers directly by a Lexicon Adapter layer.
It is worth noting that none of the previous works can be directly applied for correcting NER models without re-training.For example, LEBERT requires learning lexicon embeddings in the adapter layer.If we want to add a new span in the lexicon to fix a bad case, the model has to be re-trained to learn the new span's embedding.

Gazetteer Construction
As noted before, the gazetteer contains a list of named entities and their possible entity types.In this paper, we collect the gazetteer solely from NER annotations in the dataset.For instance, given the following two annotated sentences from the Few-NERD corpus: London [art−music] is the fifth album by the British [location−gpe] rock band.
He is domiciled in London [location−gpe] .We will construct the following gazetteer: We employ this simple approach because it is applicable for NER tasks in any language or domain.One can also use external resources such as Wikipedia to construct a larger gazetteer (Fetahu et al., 2021).We will explore a larger gazetteer in future work because it is not the focus in this paper.
Furthermore, although the generated gazetteer is pretty accurate, a downside is that when we integrate such a high-quality gazetteer in the model, the model tends to put too much trust in the gazetteer.In the other way round, it hurts the model's generalization ability.Therefore, we intentionally add some noise to the gazetteer.Specifically, with probability λ, we choose one of the following three strategies to add noise: (1) randomly select a span that is not labeled as named entity, and then add it to the gazetteer with a random entity type; (2) for a labeled named entity span, add it to the gazetteer with a randomly assigned wrong entity type; (3) skip over adding a labeled named entity span to the gazetteer.In practice, we set λ to a small value, so that it gives the gazetteer strong control in making final predictions, while the model's generalization ability is still reserved to some degree.
Note that during training, the gazetteer is constructed using training and development data.When we want to fix errors in test data, the gazetteer is updated using test data.

Model Architecture
TyBERT is built on standard BERT with two modifications: (1) given a sentence, the input word sequence is converted to a word-type pair sequence that will be the input for TyBERT; (2) a type adapter for integrating type information in BERT is attached between Transformer layers.Word-Type Pair Sequence.Given a gazetteer G and a sentence with a sequence of words s w = {w 1 , w 2 , ..., w n }, we match the word sequence with G to find out all potential named entities inside the sentence.So we have a word-type pair sequence s wt = {wt 1 , wt 2 , ..., wt n }.When the word w i is not a part of any potential named entity, wt i is w i .Otherwise, wt i is (w i , t i ), where t i is all matched entities' types with B-or I-as prefix to indicate whether it begins or inside a named entity.
Taking the sentence "London Bridge is famous" for example, the word "London" is a part of two potential named entities, i.e., (1) "London" with type art-music and location-gpe, and (2) "London Bridge" with type building.Therefore, t i for the word "London" is {[B-art-music, B-locationgpe], [B − building]}.
Formally, we have t i ={T ype(x ij )}.x ij is the j th potential named entity that contains the word w i .T ype(x)=[et 1 , et 2 , ..et k ] represents all possible entity types of named entity x based on G, and et i is one of the possible labels, such as B-artmusic, etc. Type Adapter.Our Type Adapter (TA) is shown Figure in Figure 2, which is inspired by Lexicon Adapter proposed in Liu et al. (2021).Specifically, as discussed above, t i has a two-level structure, so we propose a two-level attention mechanism.
Firstly, at position i, we compute the cross attention between the hidden state h i with the embeddings of possible entity types T ype(x ij ) for a potential named entity x ij to obtain m ij .Then we compute another cross attention between the hidden state h i and m ij , and finally obtain the new hidden state hi .
Compared with BERT, the only extra parameters of TyBERT are the embeddings of entity type et k and related weights in two cross attentions, which can be fully learned in training time.Thus, when updating the gazetteer in test time, we don't have to update any parameters in TyBERT.Following Liu et al. (2021), we only insert a TA after the first transformer layer.

Experimental Setup
Datasets.For evaluation, we employ four datasets, two in English and two in Chinese.For English, we employ the commonly used OntoNotes 5.0 corpus (Pradhan et al., 2013) and also the challenging Few-NERD corpus (Ding et al., 2021) with 66 finegrained types.For Chinese, we employ OntoNotes 4.0 corpus (Weischedel et al., 2011) and Weibo corpus (Peng andDredze, 2015, 2016) from social media domain.The detailed statistics of four corpora are shown in Table 1.Evaluation measures.Following previous NER works, Standard F1-score (F1), Precision (P) and Recall (R) are used as evaluation metrics.Hyperparameter tuning.We tune training related hyper-parameters in the development set and reported results in the test set.The tuned hyperparameter values are shown in Appendix A. Implementation details.The implementation details are explained in Appendix B.

Results
Baseline systems.To compare with our proposed method, we use BERT (Devlin et al., 2018) as a baseline.Because standard BERT cannot correct errors without model re-training, we further designed two additional baseline systems.These two baseline systems ensemble BERT and a rule-based method using a gazetteer as follows.We construct the gazetteer using all of training, development and test data.Then the gazetteer is used to match the sentences in test data to identify named entities.When a span has multiple entity types, we randomly assign a type.Depending on whether we intersect or union the output of BERT and the rule-based method, we name two baseline systems BERT+Intersect and BERT+Union respectively.Discussions.Results of BERT, two extra baseline systems and our proposed TyBERT are shown in Table 2 in three corpora, and BERT+Union only improves BERT slightly in Few-NERD corpus.In contrast, with λ=0.05 (tuned on development set), our proposed method TyBERT improves BERT by a large margin, i.e., 6.63% and 18.91% in two English corpus, and 3.56% and 6.05% in two Chinese corpus.We notice that the improvement in Chinese corpus is smaller than in English corpus.The reason is that there are much more named entities with multiple types in Chinese corpus, e.g., the confusion of location and gpe have caused many errors.In future work, we plan to consider named entity's context to fix errors.We have separately analyzed the gains brought by our solution on the ontonotes v4.0 datasets are shown in Appendix D.

Impact of gazetteer noise
We further conduct experiments to study the impact of gazetteer noise in Chinese OntoNotes corpus.
Results are shown in Table 3.For each λ, we show the results of TyBERT before and after updating the gazetteer using test data.A few observations are obtained.When λ is set to 0, the model before updating gazetteer loses generalization ability, and hence performs poorly.After λ is set to a nonzero value, the model before updating gazetteer improves a lot, and many errors are fixed after updating the gazetteer using test data.

Conclusions
We  2022), we will construct a larger gazetteer using external resources such as Wikipedia or knowledge bases.As mentioned in Section 3, we will leave this for future work.
Another limitation is that the gazetteer contains many spans that are associated with multiple entity types.Taking the running examples in Section 3.1 for example, the span "London" has type locationgpe in most cases, while it is sometimes labeled as type art-music.However, in our current design, given a named entity, there is no way to explicitly distinguish between different types.In future work, we will consider the context of named entity when fixing errors.

A Hyperparameter
The tuned hyperparemeters are shown in Table 4.

B Implementation Details
We implemented the models using PyTorch.All models are initialized from BERT-base English or Chinese checkpoints (Devlin et al., 2018) which have about 110M parameters.Each experiment is trained on a single V100 GPU for about 1 to 4 hours depending on the corpus size.

C Corpus License
Few-NERD corpus is under the CC 821 BY-SA 4.0 license, Weibo corpus is under CC BY-SA 3.0 license and OntoNotes corpus are used under LDC license.These corpus does not contain any personally identifiable information or offensive content.

D Correct and Recall details in ontonotes datasets
Comparing BERT and TyBERT, mainly includes the following aspects: 1. number of errors for each type of entity 2. type of errors for each type of entity (substitution or deletion) 3. number of corrections for unseen data in the training 4. number of corrections for seen data in the training More details can be found in Table 5 and 6.

Figure 1 :
Figure 1: Two motivating examples and the overall process to fix errors by updating the gazetteer.

Table 1 :
2: Structure of Type Adapter (TA).The Statistics of four corpus.

Table 3 :
. As we can see, compared with BERT, BERT+Intersect improves BERT by a small margin Results of TyBERT with different λ.
introduced a new task of correcting NER errors without re-training models.We propose Ty-BERT which extended standard BERT model with an adapter layer to incorporate span's type infor-mation stored in a gazetteer.We further introduce a noise rate parameter to balance the strength of the gazetteer and model's generalization ability.Extensive results justified the effectiveness of the proposed method.We hope our work will inspire future studies towards NER error correction without model re-training.

Table 4 :
The hyperparameters used in four corpus.

Table 5 :
The distribution of error labels corrected by the model

Table 6 :
The distribution of newly recalled labels by the model