Improving Distantly-Supervised Named Entity Recognition with Self-Collaborative Denoising Learning

Distantly supervised named entity recognition (DS-NER) efficiently reduces labor costs but meanwhile intrinsically suffers from the label noise due to the strong assumption of distant supervision. Typically, the wrongly labeled instances comprise numbers of incomplete and inaccurate annotation noise, while most prior denoising works are only concerned with one kind of noise and fail to fully explore useful information in the whole training set. To address this issue, we propose a robust learning paradigm named Self-Collaborative Denoising Learning (SCDL), which jointly trains two teacher-student networks in a mutually-beneficial manner to iteratively perform noisy label refinery. Each network is designed to exploit reliable labels via self denoising, and two networks communicate with each other to explore unreliable annotations by collaborative denoising. Extensive experimental results on five real-world datasets demonstrate that SCDL is superior to state-of-the-art DS-NER denoising methods.


Introduction
Named Entity Recognition (NER) is the task of detecting entity spans and then classifying them into predefined categories, such as person, location and organization. Due to the capability of extracting entity information and benefiting many NLP applications (e.g., relation extraction (Lin et al., 2017), question answering ), NER appeals to many researchers. Traditional supervised methods for NER require a large amount of high-quality corpus for model training, which is extremely expensive and time-consuming as NER requires token-level labels.
Therefore, in recent years, distantly supervised named entity recognition (DS-NER) has been proposed to automatically generate labeled training set Figure 1: A noisy sample generated by distantlysupervised methods, where Jack Lucas is the incomplete annotation and Amazon is inaccurate. by aligning entities in knowledge bases (e.g., Freebase) or gazetteers to corresponding entity mentions in sentences. This labeling procedure is based on a strong assumption that each entity mention in a sentence is a positive instance of the corresponding type according to the extra resources. However, this assumption is far from reality. Due to the limited coverage of existing resources, many entity mentions in the text cannot be matched and are wrongly annotated as non-entity, resulting in incomplete annotations. Moreover, two entity mentions with the same surface name can belong to different entity types, thus simple matching rules may fall into the dilemma of labeling ambiguity and produce inaccurate annotations. As illustrated in Figure 1, the entity mention "Jack Lucas" is not recognized due to the limited coverage of extra resources and "Amazon" is wrongly labeled with organization type owing to the labeling ambiguity.
Recently, many denoising methods (Shang et al., 2018b;Yang et al., 2018;Cao et al., 2019;Peng et al., 2019;Li et al., 2021) have been developed to handle noisy labels in DS-NER. For example, Shang et al. (2018b) obtained high-quality phrases through AutoPhrase (Shang et al., 2018a) and designed AutoNER to model these phrases that may be potential entities. Peng et al. (2019) proposed a positive-unlabeled learning algorithm to unbiasedly and consistently estimate the NER task loss, and Li et al. (2021) used negative sampling to eliminate the misguidance brought by unlabeled entities. Though achieving good performance, most studies mainly focus on solving incomplete annotations with a strong assumption of no inaccurate ones existing in DS-NER. Meanwhile, these methods aim to reduce the negative effect of noisy labels by weakening or abandoning the wrongly labeled instances. Hence, they can at most alleviate the noisy supervision and fail to fully mine useful information from the mislabeled data. Intuitively, if we can rectify those unreliable annotations into positive instances for model training, a higher data utilization and better performance will be achieved. We argue that an ideal DS-NER denoising system should be capable of solving two kinds of label noise (i.e., incomplete and inaccurate annotations) and making full use of the whole training set.
In this work, we strive to reconcile this gap and propose a robust learning framework named SCDL (Self-Collaborative Denoising Learning). SCDL co-trains two teacher-student networks to form inner and outer loops for coping with label noise without any assumption, as well as making full exploration of mislabeled data. The inner loop inside each teacher-student network is a self denoising scheme to select reliable annotations from two kinds of noisy labels, and the outer loop between two networks is a collaborative denoising procedure to rectify unreliable instances into useful ones. Specifically, in the inner loop, each teacher-student network selects consistent and high-confidence labeled tokens generated by the teacher to train the student, and then updates the teacher gradually via exponential moving average (EMA) 2 based on the re-trained student. And as for the outer loop, the high-quality pseudo labels generated by one network's teacher are used to update the noisy labels of the other network thanks to the stability of EMA and different noise sensitivities between two networks. Moreover, the inner and outer loop procedures will be performed alternately. Obviously, a successful self denoising process (inner loop) can generate high-quality pseudo labels which benefit the collaborative learning procedure (outer loop) a lot and a promising outer loop will promote the inner loop by refining noisy labels, thus handling the label noise in DS-NER effectively.
We evaluate our method on five DS-NER datasets. Experimental results indicate that SCDL consistently achieves superior performance over previous competing approaches. Extensive valida-2 A momentum technique that has been explored in several studies, e.g., Adam (Kingma and Ba, 2015), semi-supervised (Tarvainen and Valpola, 2017) and selfsupervised (Grill et al., 2020) learning. tion studies demonstrate the rationality and robustness of our self-collaborative denoising framework.

Related Work
Many studies have obtained reliable performance in NER. For example, BiLSTM-CRF (Lample et al., 2016) and BERT (Devlin et al., 2019) based methods become the paradigm in NER due to their promising performances. However, most of these works rely on high-quality labels, which are quite expensive. To address this issue, several studies attempted to annotate tokens via distant supervision (Liang et al., 2020). They matched unlabeled sentences with external gazetteers or knowledge Graphs (KGs). Despite the success of distant supervision, it still suffers from noisy labels (i.e., incomplete and inaccurate annotations in NER).
DS-NER Denoising. Many studies (Shang et al., 2018b;Cao et al., 2019;Jie et al., 2019) tried to modify the standard CRF for adapting to the scenario of label noise, e.g., Fuzzy CRF. Ni et al. (2017) selected high-confidence labeled data from noisy data to train NER models. And many new training paradigms were proposed to resist label noise in DS-NER, such as AutoNER (Shang et al., 2018b), Reinforcement Learning (Yang et al., 2018;Nooralahzadeh et al., 2019), AdaPU (Peng et al., 2019) and Negative Sampling (Li et al., 2021). In addition, some studies (Mayhew et al., 2019;Liang et al., 2020) performed iterative training procedures to mitigate noisy labels in DS-NER. However, most studies mainly focus on incomplete annotations regardless of inaccurate ones or depending on manually labeled data. What's more, most prior methods are insufficient since they can at most alleviate the negative effect caused by label noise and fail to mine useful information from the whole training set. Different from previous studies, we propose two denoising learning procedures which can be enhanced each other mutually with the devised teacher-student network and cotraining paradigm, mitigating two kinds of label noise and making full use of the whole training set.
Teacher-Student Network. The teacher-student network is well known in knowledge distillation (Hinton et al., 2014). A teacher is generally a complicated model and the light weight student imitates its output. Recently, there are many variations of teacher-student network. For example, selftraining copies the student as a new teacher to gen-erate pseudo labels (Xie et al., 2020;. Liang et al. (2020) applied self-training with teacher-student network to handle label noise in DS-NER. However, for the teacher-student network in our framework, the teacher selects reliable annotations with devised strategies for training student and then we use EMA to update the teacher based on re-trained student. With this loop, our method can learn entity knowledge effectively.
Co-Training. The co-training paradigm which jointly trains two models is used to improve the robustness of models (Blum and Mitchell, 1998;Nigam and Ghani, 2000;Kiritchenko and Matwin, 2011). Many previous frameworks (Han et al., 2018;Yu et al., 2019;Wei et al., 2020;Li et al., 2020) have adopted co-training to denoise, but they mainly use the diversity of two single models and the single one doesn't have the denoising ability. But supervision signals from the peer model are not always clean. Instead, we train two groups of teacher-student networks and each group can also perform label denoising effectively which further improves the co-training paradigm.

Task Definition
Given the training corpus D where each sample is a form of (X i , Y i ), X i = x 1 , x 2 , ..., x N represents a sentence with N tokens and Y i = y 1 , y 2 , ..., y N is the corresponding tag sequence. Each entity mention e = x i , ..., x j (0 ≤ i ≤ j ≤ N ) is a span of the text , associated with an entity type, e.g., person, location. In this paper, we use the BIO scheme following (Liang et al., 2020). In detail, the begin token of an entity mention is labeled as B-type and others are I-type. The non-entity tokens are annotated as O.
The traditional NER problem is a supervised learning task by fitting a sequence labeling model based on the training dataset. However, we mainly explore the practical scenario when the labels of training data are contaminated due to the distant supervision. In other words, the revealed tag y i may not correspond to the underlying correct one. The challenge posed in this setting is to reduce the negative influence of noisy annotations and generate high-confidence labels for them to make full use of the training data.

Methodology
In this section, we give a detailed description of our self-collaborative denoising learning framework, which consists of two interactive teacher-student networks to address both the incomplete and inaccurate annotation issues. As illustrated in Figure 2, each teacher-student network contributes to an inner loop for self denoising and the outer loop between two networks is a collaborative denoising scheme. These two procedures can be optimized in a mutually-beneficial manner, thus improving the performance of the NER system.

Self Denoising Learning
It is widely known that deep neural networks have high capacity for memorization (Arpit et al., 2017). When noisy labels become prominent, deep neural NER models inevitably overfit noisy labeled data, resulting in poor performance. The purpose of self denoising learning is to select reliable labels to reduce the negative influence of noisy annotations. To achieve this end, self denoising learning involves a teacher-student network, where the teacher first generates pseudo labels to participate in labeled token selection, then the student is optimized via back-propagation based on selected tokens, and finally the teacher is updated by gradually shifting the weights of the student in continuous training with exponential moving average (EMA). We take two neural NER models with the same architecture as the teacher and student respectively.

Labeled Token Selection
This subsection illustrates our labeled token selection strategy based on the consistency and high confidence predictions.
Consistency Predictions. It has been observed that the model's predictions of wrongly labeled instances fluctuate drastically in previous studies (Huang et al., 2019). A mislabeled instance will be supervised by both its wrong label and similar instances. For example, Amazon is wrongly annotated as organization in Figure 1. The wrong label organization pushes the model to fit this supervision signal while other clean tokens with similar context will encourage the model to predict it as location. Therefore, we can take advantage of this property to separate clean tokens from noisy ones.
Based on above analysis, how to quantify the fluctuation becomes a key issue. One straightforward solution is to integrate predictions from different training iterations but with more time-space complexity. Thanks to the widespread concern of EMA, we use it to update the teacher's parameters.

Jack
Lucas was born in the Amazon region .  (2) The interplay between two teacher-student networks is an outer loop (i.e., collaborative denoising): the pseudo labels are applied to update the noisy labels of the peer network periodically.
In this way, the teacher can be viewed as the temporal ensembling of the student models in different training steps and then its prediction will be the ensemble of predictions from past iterations. Therefore, the pseudo labels predicted by the teacher can quantify the fluctuation of noisy labels naturally. Subsequently, we devise the first token selection strategy based on the fluctuation of noisy labels to identify the correctly labeled tokens (X i ,Ȳ i ) via the consistency between noisy labels and predicted pseudo labels, denoted as: where y j ∈ Y i is the noisy label of the j-th token in the i-th sentence andỹ j is the pseudo label predicted by the teacher θ t .
High Confidence Predictions. As studied in previous works (Bengio et al., 2009;Arpit et al., 2017), hard samples can not be learnt effectively at first, thus predictions of those mislabeled hard samples may not fluctuate and then they are mistakenly believed to be reliable. To alleviate this issue, we propose the second selection strategy to pick tokens with high confidence predictions, as formulated in Equation 2, wherep j is the label distribution of the j-th token predicted by the teacher, δ denotes the confidence threshold.

Optimization
Loss Function of the Student. Standard supervised NER methods are fitting the outputs of a model to hard labels (i.e, one-hot vectors) to optimize the parameters. However, when the model is trained with tokens and mismatched hard labels, wrong information is being provided to the model. Compared with hard labels, the supervision with soft labels is more robust to the noise because it carries the uncertainty of the predicted results. Therefore, we modify the standard cross entropy loss into a soft label form defined as: where p i j,c is the probability of the j-th token with the c-th class in the i-th sentence predicted by the student andp i j,c is from the teacher. T i includes the tokens in the i-th sentence meeting the consistency and high confidence selection strategies simultaneously. I is the indicator function, I i,j = 1 when the j-th token is in T i , otherwise I i,j is 0.
Then the parameters of the student model can be updated via back-propagation as follows: Update of the Teacher. Different from the optimization of the student model, we apply EMA to gradually update the parameters of the teacher, as shown in Equation 6, where α denotes the smoothing coefficient.
Although the clean token selection strategies indeed alleviate noisy annotations, they also suffer from unreliable token choice which misguides the model into generating biased predictions. As formulated in Equation 7, the update of the teacher θ i t in i-th iteration can be converted into the form of back-propagation (derivations in Appendix A.1): where γ is the learning rate and (1 − α) is a small number because α is generally assigned a value close to 1 (e.g., 0.995), equivalent to multiplying a small coefficient on the weighted sum of student's past gradients. Therefore, with the conservative and ensemble property, the application of EMA has largely mitigated the bias. As a result, the teacher tends to generate more reliable pseudo labels, which can be used as new supervision signals in the collaborative denoising phase.

Collaborative Denoising Learning
Based on the devised clean token selection strategy in self denoising learning, the teacher-student network can utilize the correctly labeled tokens in an ideal situation to alleviate the negative effect of label noise. However, just filtering unreliable labeled tokens will inevitably lose useful information in training set since there is no opportunity for the wrongly labeled tokens to be corrected and explored. Intuitively, if we can change the wrong label to the correct one, it will be transformed into a useful training instance. Inspired by some co-training paradigms (Han et al., 2018;Yu et al., 2019;Wei et al., 2020), we propose the collaborative denoising learning to update noisy labels mutually for mining more useful information from dataset by deploying two teacherstudent networks with different architecture. As stated in (Bengio, 2014), a human brain can learn more effectively if guided by the signals produced by other humans. Similarly, the pseudo labels predicted by the teacher are applied to update the noisy labels of the peer teacher-student network periodically since two teacher-student networks have different learning abilities based on different initial conditions and network structures. With this outer loop, the noisy labels can be improved continuously and the training set can be fully explored.

Algorithm Workflow
In this subsection, we introduce the overall procedure of our SCDL framework. Algorithm 1 gives Algorithm 1 Training Procedure of SCDL Input: Training corpus D = {(Xi, Yi)} M i=1 with noisy labels Parameter: Two network parameters θt 1 , θs 1 , θt 2 , and θs 2 Output: The best model 1: Pre-training two models θ1, θ2 with D.
Self Denoising Learning.
end if 13: end while 14: Evaluate models θt 1 , θs 1 , θt 2 , θs 2 on Dev set. 15: return The best model θ ∈ {θt 1 , θs 1 , θt 2 , θs 2 } the pseudocode. To summarize, the training process of SCDL can be divided into three procedures: (1) Pre-Training with Noisy Labels. We warm up two NER models θ 1 and θ 2 on the noisy labels to obtain a better initialization, and then duplicate the parameters θ for both the teacher θ t and the student θ s (i.e., θ t 1 = θ s 1 = θ 1 , θ t 2 = θ s 2 = θ 2 ). The training objective function in this stage is the cross entropy loss with the following form: where y i j means the j-th token label of the i-th sentence in the noisy training corpus and p(y i j |X i ; θ) denotes its probability produced by model θ. M and N are the size of training corpus and the length of sentence respectively. (2) Self Denoising Learning. In this stage, we can select correctly labeled tokens to train the two teacher-student networks respectively. (3) Collaborative Denoising Learning. Self denoising can only utilize correct annotations and this phase will update noisy labels mutually to relabel tokens for two teacher-student networks. The initial noisy labels of two networks comes from distant supervision. The second and third phase are conducted alternately, which  will promote each other to perform label denoising. It's worth noting that only the best model θ ∈ {θ t 1 , θ s 1 , θ t 2 , θ s 2 } is adopted for predicting.

Experiments
In this section, we evaluate the performance of SCDL, compared with several comparable baselines. Additionally, we conduct lots of auxiliary experiments and provide comprehensive analyses to justify the effectiveness of SCDL.
Implementation Details. For fair comparison, we adopt RoBERTa (θ 1 ) and DistilRoBERTa (θ 2 ) as the basic models. The max training epochs is 50, and the confidence threshold δ is 0.9. The batch size is set to 16 or 32, the learning rate is 1e-5 or 2e-5 according to different datasets. We tune EMA parameter α from {0.9,0.99,0.995,0.998}, tune update cycle according to the size of dataset (e.g., 6000 iterations (about 7 epochs) for CoNLL03) on development set. We implement our code with Pytorch based on huggingface Transformers 3 . Detailed hyperparameter settings for each dataset and tuning procedures are listed in Appendix A.3. Table 1 shows the results of our proposed method compared with baselines and highlights the best overall performance in bold. Obviously, SCDL achieves the best performance, and improves the precision as well as F1 score significantly, compared with previous state-of-the-art models.  Compared to our basic models (i.e., Distil-RoBERTa and RoBERTa), SCDL improves the F1 score with an average increase of 8.33% and 6.37% respectively, which demonstrates the necessity of label denoising in the distantly-supervised NER task and the effectiveness of the proposed method.

Experimental Results
In addition, SCDL performs much better than previous studies which consider the noisy labels in NER, including AutoNER, LRNT, NegSampling and BOND. The reason is that they mainly focus on one kind of label noise in DS-NER or fail to make full use of the mislabeled data with their strategies. On contrast, our method can not only exploit correctly labeled tokens but also explore valuable information in wrongly labeled ones by correction. Compared to the popular denoising methods in computer vision: Co-teaching+ and JoCoR, SCDL gains of up to 12.05% absolute percentage points in F1 score. We guess this is beacause most computer vision denoising studies focus on instancelevel classification, while NER is a token-level task where non-entity category accounts for the majority, and this case is not fully considered. Thus corruption occurs easily in DS-NER denoising task for these methods as the training goes.

Analysis
Ablation Study. To evaluate the influence of each component in our method, we conduct the ablation study for further exploration (see Table 2). Overall, although SCDL is not optimal on precision or recall, it achieves the best in F1 score, which indicates that our method can balance well when taking two kinds of annotation noise into account and exploring full training set. Based on these ablations, we observe that: (1) Token selection strategy with the consistency and high confidence predictions indeed promote the overall performance (F1 score) by improving the precision and marginally lowering the recall. The recall value doesn't decrease sharply in our framework because of the unbiased predictions generated by teacher model and alternate optimization.
(2) When we keep only one teacher-student network (i.e., w/o θ t 2 , θ s 2 ), both recall and F1 decrease visibly, which validates the effectiveness of collaborative denoising learning since more wrongly labeled tokens (e.g., false negative tokens) can be explored via the peer dynamic update of noisy labels. (3) Meanwhile, removing two teacher models (i.e., w/o θ t 1 , θ t 2 ) leads to the decline on both precision and recall. Because this simplification impairs the devised teacher-student network. It uses the predictions of each student to support the token selection strategies and the mutual update of noisy labels, which loses the stable optimization ability of EMA and leads to unreliable token selection. (4) Learning from noisy annotations benefits from soft labels since they contain the uncertainty of predicted results and are more tolerant to the noise compared to the hard ones.
Learning Curve of SCDL. To evaluate the advantage of the proposed framework in handling noisy labels during training, we show the F1 score vs. training iterations on CoNLL03 test set in Figure 3. Compared to RoBERTa and DistilRoBERTa, the performance of SCDL and BOND remains stable as the training goes. Because of the memorization effect of networks, the F1 score of RoBERTa and DistilRoBERTa first reach a high level and then gradually decrease. Moreover, SCDL consistently achieves better performance than other baselines at almost any training stage, which again confirms the effectiveness of our denoising framework.
Robustness to Different Noise Ratio. To study the robustness of the proposed method in different noise ratio, we randomly replace k% entity mention labels in the corpus with other entity types or non-entity to construct different proportions of label noise and report the test F1 score on CoNLL03 in Figure 4. The pre-trained language models (e.g., RoBERTa) are robust to low level noise (less than Case 1 Case 2

Sentence
The girl , Abyss DeJesus , suffers ⋯ the St. Christopher Children 's Hospital said .
Thai poll shows military wants PM Banharn out .  20%) due to their strong expressive power. When the noise ratio is between 30% and 80%, SCDL is more robust and exhibits satisfactory denoising ability, since the training data still has reasonable entity type knowledge and SCDL can learn from it to refine noisy labels. However, both SCDL and BOND degenerate to the basic model in the hardest case (more than 80%) which may not exist in reality and needs further studies in the future.

Reliable Labels O O O B-PER I-PER O O ⋯ O B-ORG I-ORG I-ORG I-ORG I-ORG
Effectiveness of Noisy Label Refinery. As the noisy labels are updated dynamically during training to explore the full dataset, we compare the F1 score before and after denoising on training set, as shown in Table 3. In detail, SCDL refines noisy labels on CoNLL03 and Twitter training set, from 70.97 to 81.22, 37.73 to 50.82 respectively, which surpasses BOND. The reason may be that BOND mainly depends on self-training which suffers from confirmation bias, while SCDL can bypass this issue by the devised teacher-student network and cotraining paradigm and then improves both precision and recall significantly. Overall, the comparison before and after denoising demonstrates that SCDL indeed refines the training noisy labels to a certain extent, leading to the better use of the mislabeled data and outstanding performance on test.
Case Study. Different from most prior denoising studies on DS-NER, our proposed framework SCDL can not only handle two kinds of label noise (i.e., inaccurate and incomplete annotations) with-  out any assumption, but also make full use of the whole training set. High F1 score in Table 1 and the effectiveness of noisy label refinery in Table 3 have proved the feasibility of SCDL quantitatively. For better understanding intuitively, we give two samples from CoNLL03 after two periodic updates to show the denoising ability of SCDL in Table 4. For case 1 with two kinds of label noise, the person name "Abyss DeJesus" and organization name "St. Christopher Children 's Hospital" are not correctly annotated by DS-NER. After denoising, "Abyss DeJesus" is corrected and transformed into a useful instance. Though the hospital name is still not corrected in the teacher-student network 2, but network 2 selects reliable annotations successfully for training student. It shows that SCDL can not only exploit reliable instances but also explore unreliable ones. Similar situations also occur in case 2, while the network 2 has better capability which demonstrates the validity of co-training paradigm.
Efficiency Analysis. In training stage, with the same batch size, the serial efficiency of our method is about 1.5 batches per second on single GPU Tesla T4, other baselines like BOND is 2.6, Co-teaching+ is 1.8, JoCoR is 1.9. The memory usage of our method is equivalent to Co-training models (e.g., Co-teaching+). Although there are two student and two teacher models in our method, only two students need back-propagation which occupies the main computational overhead (time and memory usage), while two teachers updated with EMA only need forward-propagation which occupies less computational overhead. It's worth not-ing that the two teacher-student networks in our framework can be trained in parallel, which will further accelerate the training. What's more, compared with other baselines, the test efficiency of our method is the same because we only use one model for predicting.

Conclusion and Future Work
This paper proposes SCDL to handle two kinds of label noise in DS-NER without any assumption. With devised teacher-student network and co-training paradigm, SCDL can not only exploit more reliable annotations to avoid the negative effect of noisy labels but also explore more useful information from the mislabeled data. Experimental results confirm its effectiveness and robustness in dealing with the label noise. For future work, data augmentation is worth exploring in our framework. Besides, SCDL can also be adapted to other NLP denoising tasks, e.g., classification and matching.  student models (i.e., θ s 1 and θ s 2 ) (e.g., learning rate chosen from {1e-5, 2e-5, 5e-5, 1e-4}, training epoch from {20, 50, 100}, batch size from {16, 32}). Pretraining epoch is determined when the F1 score on development dataset doesn't increase.
The number of steps for the scheduler warmup is chosen from {100, 200, 500}. Then we tune EMA α from {0.9, 0.99, 0.995, 0.998} for teacher models (i.e., θ t 1 and θ t 2 ). Finally, we tune update cycle range from 100 to 8000 according to the size of dataset. The confidence threshold is set to 0.9. The rest parameters are default in huggingface Transformers 4 . For fair comparison, NegSampling and BOND adopt RoBERTa as the basic model. Co-teaching+ and JoCoR adopt RoBERTa, DistilRoBERTa as the basic models. For NegSampling, we run the officially released code using suggested hyperparameters in the original paper. For Co-teaching+ and JoCoR, noise rate τ is calculated by distantly supervised and original training set.
We conduct the experiments on NVIDIA Tesla T4 GPU. It is worth noting that only the best model θ ∈ {θ t 1 , θ s 1 , θ t 2 , θ s 2 } is adopted for predicting in our SCDL framework. Therefore, the complexity of our model is not increased during the test stage.