Named Entity Recognition via Noise Aware Training Mechanism with Data Filter

Named entity recognition (NER) is a fundamental task in natural language processing, these is a long held belief that datasets beneﬁt the model. However, not all the data help with generalization, and some samples may contain ambiguous entities or noisy labels. The existing methods can not distinguish hard samples from noisy samples well, and becomes particularly challenging in the case of over-ﬁtting. This paper proposes a new method called Noise-Aware-with-Filter (NAF) to solve the issues from two sides. From the perspective of the data, we design a Logit-Maximum-Difference (LMD) mechanism, which maximizes the diversity between different samples to help the model identify noisy samples. From the perspective of the model, we design an Incomplete-Trust (In-trust) loss function, which boosts L CRF with a robust Distrust-Cross-Entropy(DCE) term. Our proposed In-trust can effectively alleviate the overﬁtting caused by previous loss function. Experiments on six real-world Chinese and English NER datasets show that NAF outperforms the previous methods, and which obtained the state-of-the-art(SOTA) results on the CoNLL2003 and CoNLL++ datasets.


Introduction
Named entity recognition (NER) is a primary task and which identifies both types and spans in sentences. NER models are becoming more and more accurate in prediction tasks, the potential improvement of existing architectures in real-world applications is often inherently limited by data quality (Pleiss et al., 2020). However, not all samples are completely correct in the NER datasets (Nooralahzadeh et al., 2019;Lange et al., 2019). Many real-world datasets generally contain samples which are "weakly-labeled" (Derczynski et al., 2017;Peng and Dredze, 2015). Specifically, some  Figure 1: An overview of Noise-Aware-with-Filter(NAF). In-trust loss prevents model from overfitting and helps model generates logit matrices, which enters LMD mechanism with labels to filter noise data and get a cleaner data for model training.
datasets which are annotated based on distant supervision (Yang et al., 2018;Liang et al., 2020) contain more noise labels, and manual annotators, especially on crowdsourcing platform, are prone to making labeling mistakes. Meanwhile labeling a huge datasets is an expensive and fallible process.
As the increase of training iteration epochs, the model will overfit noisy samples and hinder the generalization of the model (Pleiss et al., 2020). In NER task, it is impractical to get an absolutely clean dataset, and the existing datasets generally exist mislabeled samples (Flor et al., 2019) and ambiguous entity (Nadeau et al., 2006), even if some classical datasets (e.g., CoNLL2003 (Tjong Kim Sang andDe Meulder, 2003)) still contain noisy samples (Wang et al., 2019b). It makes sense to obtain a cleaner dataset, but it would be very difficult to correct these real-world datasets manually, and the existing methods can't solve the issues automatically , especially in the case of existing ambiguous entities in sentences (Wang et al., 2019b).
From the CoNLL2003 NER dataset, we divide the samples into Easy samples which are correctly labeled and do not contain ambiguous entities, Hard samples are correctly labeled but contain ambiguous entities, and Noisy samples are misla-beled (Wang et al., 2019b). Such as: • We can easily obtain the boundary between easy samples and noisy samples with utilizing loss values (Lin et al., 2017), but distinguishing hard samples from noisy samples still is a challenge (Wang et al., 2019b;Pleiss et al., 2020), and becomes particularly challenging in the case of overfitting (Wang et al., 2019b;Liu et al., 2020).
We propose a new method called Noise-Awarewith-Filter (NAF) to solve the issues from two sides. From the perspective of data, we design a Logit-Maximum-Difference (LMD) mechanism, which maximizes the diversity between different samples to help the model identify noisy samples. The difference between easy samples and noisy samples is very obvious in LMD score, meanwhile hard samples and noisy samples also can be well distinguished. From the perspective of model, we propose a noise tolerant term named Distrust-Cross-Entropy(DCE), which combines with L CRF form the basis of the approach Incomplete-Trust (Intrust) loss function. In-trust not only improves the robustness of the model, but also helps LMD improve the accuracy of identifying noisy samples. Experiments on six real-world Chinese and English datasets show that NAF is more accurate than other methods in identifying noisy samples, meanwhile the datasets after filtering are cleaner.
In summary, our major contributions are the following: • We propose a new method called Noise-Aware-with-Filter (NAF) to distinguish hard samples from noisy samples especially in the case of overfitting.
• To distinguish hard samples from noisy samples, we design a Logit-Maximum-Difference (LMD) mechanism. Meanwhile to alleviate the negative impact of overfitting, we propose Incomplete-Trust (In-trust) loss function, which utilizes both the incomplete correctness of labels and the relative correctness of the model output.
• We conduct extensive experiments on six real-world Chinese and English NER datasets show that NAF outperforms the previous methods, and which obtains the state-of-theart(SOTA) results on the CoNLL2003 and CoNLL++ datasets. We release the source code publicly for further research 1 .

Related Work
There are various approaches have been proposed to obtain a robust model. We summarize them into three categories: 1) Robust loss methods, 2) Training architectures methods, 3) Label correction methods.
Robust loss methods specifically design robust loss functions. They include Mean Absolute Error (MAE) (Ghosh et al., 2017), Improved MAE (Wang et al., 2019a) which is a reweighted MAE. Symmetric cross entropy (Wang et al., 2019b), by adding a symmetric reverse cross entropy after the cross entropy, makes the model have a certain noise tolerance, and Generalized cross entropy (Zhang and Sabuncu, 2018) is actually a new evolutionary form of MAE. Regularization (LSR) (Szegedy et al., 2016) is a technique using soft labels in place of one-hot labels to alleviate overfitting to noisy labels. ELR (Liu et al., 2020) is a kind of method that makes full use of early learning phenomenon to keep a large learning gradient for clean samples. But these methods can not effectively distinguish hard samples from noisy samples, and which are easy to confuse them.
Training architectures methods identify noisy samples from the perspective of model framework. Co-Tearching (Han et al., 2018;Yu et al., 2019) utilizes "early learning" phenomenon to maintain two networks in the process of training. All samples are sorted based on the loss values, and the noisy y . The x-axis refers to the number of training epochs, and the y-axis refers to the logit value. (Dataset:  samples are deleted according to the forgetting ratio (Jiang et al., 2018;Malach and Shalev-Shwartz, 2017). AUM (Pleiss et al., 2020) based on the output of the model to distinguish hard samples from noisy samples. CrossWeigh (Wang et al., 2019b) cover the label of a certain category in the datasets, and observes the model whether will predict the sample into another category. However, these methods always make the model to learn the easy samples and not consider the problem of overfitting (Chang et al., 2017).
Label correction methods are to improve the quality of raw labels. New labels equal to the probabilities estimated by the model (known as soft labels) or to one-hot vectors representing the model predictions (hard labels) (Tanaka et al., 2018;Yi and Wu, 2019). Another option is to set the new labels to equal a convex combination of the noisy labels and the soft or hard labels (Reed et al., 2015). However, these methods require the support from extra clean data or an expensive detection process to estimate the noise model.

Logit Maximum Difference Mechanism (LMD)
In this section, we propose a novel LMD mechanism. The LMD utilizes the tiny difference between hard samples and noisy samples in the model output. Meanwhile the LMD accumulates and expands the difference to identify noisy samples.

Preliminary
Easy samples and noisy samples are easy to distinguish(e.g., utilizing loss values (Han et al., 2018)), because of noisy samples are always contrary to the samples with correct tags. However, hard samples with ambiguous entity are difficult to distinguish from noisy samples, because hard samples also will produce large loss values (Song et al., 2020) in the early stage of training. This has become a major challenge in the denoising task (Song et al., 2020).
Utilizing Logit Matrix Neural network models will output a logit matrix in the training process, which goes through the softmax layer and then gets into loss function. The Softmax layer is a normalized exponential function, which will nonlinear increase the weight of maximum value in the logit matrix and bring unfairness for identifying noisy samples. LMD directly utilizes logit matrix to distinguish hard samples from noisy samples instead of loss values.
Given a sentence x = [x 1 , x 2 , ..., x n ] and its tag sequence y = [y 1 , y 2 , ...y n ], n is the sentence length. Every token x i will obtain a corresponding logit matrix Z = [z i 1 , z i 2 , ..., z i m ], m denotes total number of tags. LMD utilizes the difference between the z j corresponding to the class j and other values in the logit matrix.
Observing The Difference In Figure 2, the logit value Z (t) y corresponding to tag y and epoch t, and the Z (t) y is evidently higher than other values in easy samples(left). In hard samples(middle), the Z (t) y is small at the beginning of training, then Z (t) y begins to increase and become the maximum in the logit matrix with increasing epoch t. In noisy samples(right), the Z (t) y is relatively smaller than other values, and the Z (t) y becomes the maximum in the logit matrix when epoch exceeds 5 even if y is a negative tag, which indicates that overfitting occurs.

Identify Noisy Samples
Defining LMD We propose a new statistic LMD score, which averages the difference between Z (t) y and the other values Z (t) other at each epoch t. The tiny difference between hard samples and noisy samples are gradually accumulated and maximized, which will effectively help model identify noisy samples. The LMD score can be defined as: (1) Where T is the total number of epoch. A sentence is the minimum unit of the input in the NER task. If the tag corresponding to single token is mislabeled in a sentence, we can consider that the sentence is negative. Therefore we choose the smallest LMD score in each sentence as the LMD score of the sentence, where every token will obtain a LMD score in the sentence. Working Mechanism In order to steadily enhance the discrimination between easy, hard and noisy samples, we stack the LMD scores of multiple epochs to get an average value. By utilizing the LMD mechanism, every sample will get a LMD score. The LMD scores of easy, hard and noisy samples have obvious differentiation in Figure 3. Then we sort the samples according to the LMD scores, and define samples under the noise ratio µ as noisy samples, finally delete them to get a cleaner training set. And the noise ratio µ is a hyperparameter. The model is trained again with utilizing a clean training set, which will obtain better performance without the interference of noisy samples.

The Influence of Overfitting
Overfitting Appears We further explore the influence of overfitting in the denoising task from two sides. From the perspective of the LMD scores, the LMD scores of hard samples and noisy samples tend to be consistent with increasing epoch to 60 in the Figure 4. From the perspective of the logit values, the logit value Z Phenomenon Analysis As the increase of training iteration epochs, the model will overfit noisy samples. Meanwhile the model output of hard samples and noisy samples are almost consistent, which make it difficult to distinguish. This proposes another challenge that the model identifies noisy samples in the case of overfitting.

Incomplete-Trust Loss Function
In this section, we propose an Incomplete-Trust (In-trust) loss function. Previous loss functions (e.g., Cross Entropy) are easy to overfit noisy samples (Wang et al., 2019b), and they absolutely trust tags even if the tags are mislabeled. Meanwhile, neural networks have strong fitting ability, they can achieve zero training error even on datasets with randomly-assigned labels (Zhang et al., 2016). And deep neural networks have been observed to first fit the samples with clean tags during an "early learning" phase, before eventually memorizing the samples with mislabeled tags (Arpit et al., 2017;Zhang et al., 2016). With exploiting the early-learning phenomenon, our proposed In-trust utilizes both the model output which obtain the relative correctness after enough practices and the incomplete correctness of tags which maybe are mislabeled. We also provide theoretical analysis about the formulation and behavior of In-trust.

Definition
KL-divergence Given two distributions p and q, the relationship between entropy, cross entropy and KL-divergence is as follows: In NER task, q=q(k|x) is the one-hot distribution of the label in sample x, and p=p(k|x) is the prediction distribution of the model for sample x. The model makes the p=p(k|x) gradually approach the q=q(k|x), this is also to minimize the KL-divergence between the two distributions.
Proposing DCE Term However, if the sample x is a noisy sample and the q=q(k|x) is an incorrect distribution, it will cause negative impacts for model, so the label distribution q=q(k|x) is not worthy of full trust. According to the phenomenon of early learning, the model always tends to learn the correct samples in the early stage of training. It means that even if some samples are mislabeled, the model still may predict the correct results in the early stage of training. We exploit this phenomenon to trust that not only the label distribution q=q(k|x), but also the prediction distribution p=p(k|x) before the model overfit noisy samples. Therefore, we design the robust Distrust-Cross-Entropy L DCE term as follows: Where δ is a hyperparameter, and its size determines that the model whether trust labels or model output. When δ is larger, the model will trust prediction distribution p=p(k|x) more, on the contrary, the model will trust label distribution q=q(k|x) more.
Forming In-trust We proposed an Incomplete-Trust (In-trust) loss function, which boosts L CRF with L DCE term.
Where L DCE term is an acceleration regulator term, which can effectively prevent model from overfitting noisy samples. That will be proved in Appendix A. The L CRF term is not noise tolerant (Ghosh et al., 2017), but which benefits the convergence of the model (Zhang and Sabuncu, 2018). α and β are two decoupled hyperparameters, α regulates the overfitting issue of L CRF while β aims to flexibly explore the robustness of L DCE .
Contrasting Logit Values Figure 5 shows the result of comparative experiment with Figure 2. The logit value Z (t) y corresponding to mislabeled tag y is no longer the maximum in the logit matrix, this means that L In−trust effectively prevents model from overfitting noisy samples. Contrasting LMD Scores Figure 6 shows the result of comparative experiment with Figure 4. There is still obvious discrimination between hard samples and noisy samples when the epoch reaches 60, this indicates that L In−trust can help LMD mechanism identifies noisy samples accurately.

Robustness Analysis
L DCE Robustness Analysis: In order to simplify the calculation, we set α and β as 1 and derive the gradient of L DCE . For brevity, we denote p k , q k as abbreviations for p=p(k|x) and q=q(k|x), the gradient of the L DCE loss with respect to the logits Z j can be derived as: Where ∂p k ∂Z j can be further derived based on whether k = j: For brevity, we denote a k =δp k +(1−δ)q k and b k =p k log a k + δp 2 k a k . We know that ∂p k ∂Z j is a function of p k from Eq.(5), let L p k = ∂p k ∂Z j , and the gradient of the L p k with respect to the p k can be derived as: And the second derivative of L p k is: Where L p k is a monotone increasing function, when q j =1 and δ ∈ [0.0, 0.1, ...], we obtain the corresponding relation between L p k and p j in the Figure 7. It is concluded that L p k is a decreasing and then increasing function, which also shows that the acceleration of L DCE first decreases and then increases with the increase of p j corresponding to the label. When p j approaches to 1 with the q distribution is close to the p distribution, the model will believes correct tags more, and L DCE has larger acceleration in learning correct samples. That benefits the model learns cleaner samples and prevents overfitting. On the contrary, the L DCE term thinks that the model has relatively correct prediction result for noisy samples under the influence of learning other correct labels, and the acceleration is small which also effectively prevents the model overfitting and improves the noise tolerance. And other more detailed proofs are shown in the Appendix A.
L In−trust Robustness Analysis: According to Eq.(4), L In−trust consists of L CRF and L DCE . When q j = 1, L DCE term will provide a robust acceleration value, which benefits L In−trust obtains a correct loss value. Specifically, L In−trust will obtain a greater loss value when p j approach to q, which will benefit the model learn this sample like L CRF . On the contrary, L DCE will prevent model from learning the sample unlike L CRF . When q j = 0, other loss functions will prevent the model from learning the direction, even if the model output p is greater in this direction. We believe that the model is relatively correct after learning a large number of samples. And L DCE term will provide L In−trust with an acceleration to help the model to learn the direction. In addition, L DCE term also prevent the model from learning when p is small. Therefore L DCE term has no negative effect on the convergence of model.

Experiments
In this section, we verify the advantages of Noise-Aware-with-Filter (NAF) method by comparing experiments with other denoising methods.  In these six real-wold datasets, we use the same way of data segmentation as the original author. Since WUT-17 has no development set, we randomly select 10% samples from the training set as the development set. Evaluation Our primary evaluation metric is F1 score on the test set to compare the results of different methods.

Experimental Settings
In our experiments, we set the initial learning rate to lr = 1e−5 for all datasets. Since the scale of each dataset varies, we set different training batch size for different datasets. Specifically, we set the batch sizes of Weibo NER, OntoNotes4, WUT-17, CONLL2003 and CoNLL++ as 40, 40, 40,32 and 32 in BERT, and set the batch sizes of WUT-17, CONLL2003 and CoNLL++ as 2, 2 and 2 in LUKE. We stop the training when we find the best result in the development set. Table 1 presents the results for the baseline and our methods in the BERT. Compared with other methods, NAF shows obvious advantage in the six real-wold datasets. Our method outperforms other methods by 1.58%, 0.61%, 0.29%, 0.87% and 3.77% in F1 score on Weibo NER, OntoNotes4, CoNLL2003, CoNLL++ and WUT-17 datasets. Table 2 presents the results in the LUKE, and our method has achieved new state-of-the-art (SOTA) with the F1 score reached to 94.51% and 96.25% on CoNLL2003 and CoNLL++ datasets.

Robustness Performance
Specifically, NAF has made more obvious progress on Weibo NER, WUT-17 and CoNLL++ datasets, and our analysis shows that the noise ratio of Weibo NER and WUT-17 are greater than others and there is a cleaner test set after manual correction in CoNLL++ dataset.

Manual Verification
Results Statistics In order to prove the effectiveness of our method, we manually verify the noisy samples which are selected from CoNLL2003, OntoNotes4 and WUT-17 (Table 4). We randomly select 100 samples from "Original" train set and we manually verify the proportion of noisy samples. After utilizing LMD or NAF method, we will obtain a new datasets, and then we randomly select Dataset noisy samples Soccer -KEANE signs four-years contract with Manchester United{LOC} . CoNLL2003 Soccer -sharpshooter knup back in swiss{MISC} squad .

WUT-17
Federal lawyers{O} fly to Minneapolis to investigate shooting @janzensational at least may date ka na hahaha{O} . Goodluck zen ! : *  100 samples from the new datasets for manually verifying. The probability of true negative samples is 5% in the "Original" CoNLL2003 dataset, and which reaches to 76% and 82% respectively with utilizing LMD and NAF methods. While for OntoNotes4 dataset, the probability is 8% in "Original" dataset and which reaches to 72% and 80% with LMD and NAF. In the WUT-17 dataset, the probability is 18% in "Original" dataset and which reaches to 58% and 71% with LMD and NAF.
Result Analysis The accuracy of identifying noisy samples will greater with the dataset that model performs better, and the excellent datasets will benefit the model identify noisy samples.

Dataset
Original LMD(our) NAF(our) In addition, the more noisy samples in datasets are shown in the Appendix C.

Ablation Experiment
As mentioned in the previous Section 4.1, δ provides flexibility between the model output distribution p=p(k|x) and the label distribution q=q(k|x). In this section, we explore the influence of hyperparameter δ, and we conducted experiments on Weibo NER and CoNLL2003 with α=1 and β=1 to explore how it manipulates the tradeoff. Experimental results are shown in Table 5. The highest F1 on CoNLL2003 datatset is 91.72% when δ is set to 0.5, meanwhile for Weibo NER, the highest F1 is 62.55% when δ is set to 0.7. The optimal value of δ is different in different noise ratio datasets, and when there are more noisy samples in the datasets, δ should be set larger. Because of the noise ratio of Weibo NER is larger than CoNLL2003 dataset, the optimal δ value of Weibo NER is larger. The experiment result of the other hyperparameters α and β are show in the Appendix B.

Conclusion
In this paper, we observe that the existing denoising methods can not effectively distinguish hard samples from noisy samples, and we proposed a new method called Noise-Aware-with-Filter (NAF), which contains LMD mechanism and In-trust loss function to solve the issues. Specifically, NAF can effectively improve the discrimination between hard samples and noisy samples even in the case of overfitting. In addition, our proposed the Logit-Maximum-Difference(LMD) mechanism which maximizes the diversity between different samples to help the model identify noisy samples. Mean-while we design an Incomplete-Trust (L In−trust ) loss function, which boosts L CRF with a noise robust Distrust-Cross-Entropy(DCE) term. In order to verify the effectiveness of our method, we also conduct manual verification for noisy samples and the results show that our method has higher accuracy on identifying noisy samples. Experiments on six real-world Chinese and English NER datasets show that NAF outperforms the previous methods, and which obtained the state-of-the-art(SOTA) results on the CoNLL2003 and CoNLL++ datasets. A L In−trust Robustness Proof We have know L In−trust is: And make: Meanwhile we have know: When k=j: When k = j: Gradient Calculation: For the convenience of calculation, we make And L =2b +(p j −1)b Meanwhile we can get: b = log a+ pa a + 2δp−δp 2 a a 2 = log a+ δp a + 2δp−δ 2 p 2 a 2 = log a+ 3δp a −( δp a ) 2 When q j =0 : L =2 log(δ + 2)+ p−1 p When p approaches 0 : L <0 When p approaches 1 : L =2 log δ+4 E : δ1 ∈ [0, 1], make L (δ1)>0 Because L is continuous function, so L is Monotone increasing function E : δ2 ∈ [0, 1], make L (δ2)=0 So L is Decreasing then increasing function and the inflection point is only related to δ.
In summary, when q=0 or q=1 : L is Decreasing then increasing function and the inflection point is only related to δ.
We know L (p=0)>0 , L (p=1)>0, and E : µ ∈ [0, 1], make L (p=µ)<0 We observe that when δ is larger, the model tends to learn from the p of the model output, and when δ is smaller, the model tends to learn from the label q. Moreover, when the p j corresponding to the label is larger and the model output is close to the label distribution, the acceleration of L DC term is larger, which makes the model more inclined to learn the sample, which helps the model learn clean samples. When p j is small, there is a big gap between the model output and label distribution. We think that the sample may be a noisy sample, and the acceleration of L DC term is smaller, which makes the model more inclined to give up the learning of the sample, and prevents the model from over fitting the noisy sample.