Learning from Noisy Labels for Entity-Centric Information Extraction

Recent information extraction approaches have relied on training deep neural models. However, such models can easily overfit noisy labels and suffer from performance degradation. While it is very costly to filter noisy labels in large learning resources, recent studies show that such labels take more training steps to be memorized and are more frequently forgotten than clean labels, therefore are identifiable in training. Motivated by such properties, we propose a simple co-regularization framework for entity-centric information extraction, which consists of several neural models with identical structures but different parameter initialization. These models are jointly optimized with the task-specific losses and are regularized to generate similar predictions based on an agreement loss, which prevents overfitting on noisy labels. Extensive experiments on two widely used but noisy benchmarks for information extraction, TACRED and CoNLL03, demonstrate the effectiveness of our framework. We release our code to the community for future research.


Introduction
Deep neural models have achieved significant success on various information extraction (IE) tasks. However, when training labels contain noise, deep neural models can easily overfit the noisy labels, leading to severe performance degradation (Arpit et al., 2017;Zhang et al., 2017a). Unfortunately, labeling on large corpora, regardless of using human annotation (Raykar et al., 2010) or automated heuristics (Song et al., 2015), inevitably suffers from labeling errors. This problem has even drastically affected widely used benchmarks, such as CoNLL03 (Sang and De Meulder, 2003) and TA-CRED (Zhang et al., 2017b), where a notable portion of incorrect labels have been caused in annotation and largely hindered the performance of SOTA 1 Our code is publically available at https://github. com/wzhouad/NLL-IE < l a t e x i t s h a 1 _ b a s e 6 4 = " l 0 4 w b J f f y / o E T 6 4 o n E a U Z 0 7 R r X Y = " > A A A B 7 n i c b V B N S w M x E J 3 1 s 9 a v q k c v w S L U S 9 m I q M e i F 4 8 V 7 A e 0 a 8 m m 2 T Y 0 m w 1 J V i h L f 4 Q X D 4 p 4 9 f d 4 8 9 + Y t n v Q 1 g c D j / d m m J k X K s G N 9 f 1 v b 2 V 1 b X 1 j s 7 B V 3 N 7 Z 3 d s v H R w 2 T Z J q y h o 0 E Y l u h 8 Q w w S V r W G 4 F a y v N S B w K 1 g p H t 1 O / 9 c S 0 4 Y l 8 s G P F g p g M J I 8 4 J d Z J L f W Y V f D Z p F c q + 1 V / B r R M c E 7 K k K P e K 3 1 1 + w l N Y y Y t F c S Y D v a V D T K i L a e C T Y r d 1 D B F 6 I g M W M d R S W J m g m x 2 7 g S d O q W P o k S 7 k h b N 1 N 8 T G Y m N G c e h 6 4 y J H Z p F b y r + 5 3 V S G 1 0 H G Z c q t U z S + a I o F c g m a P o 7 6 n P N q B V j R w j V 3 N 2 K 6 J B o Q q 1 L q O h C w I s v L 5 P m e R V f V v H 9 R b l 2 k 8 d R g G M 4 g Q p g u I I a 3 E E d G k B h B M / w C m + e 8 l 6 8 d + 9 j 3 r r i 5 T N H 8 A f e 5 w + M i I 8 O < / l a t e x i t > p (1) < l a t e x i t s h a 1 _ b a s e 6 4 = " m a Y 5 H 7 5 y o L b H A J O 0 Q a g a k 9 J d Q n 8 = " > A A A B 6 H i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 K o m I e i x 6 8 d i C / Y A 2 l M 1 2 0 q 7 d b O L u R i i h v 8 C L B 0 W 8 + p O 8 + W / c t j l o 6 4 O B x 3 s z z M w L E s G 1 c d 1 v Z 2 V 1 b X 1 j s 7 B V 3 N 7 Z 3 d s v H R w 2 d Z w q h g 0 W i 1 i 1 A 6 p R c I k N w 4 3 A d q K Q R o H A V j C 6 n f q t J 1 S a x / L e j B P 0 I z q Q P O S M G i v V H 3 u l s l t x Z y D L x M t J G X L U e q W v b j 9 m a Y T S M E G 1 7 n h u Y v y M K s O Z w E m x m 2 p M K B v R A X Y s l T R C 7 W e z Q y f k 1 C p 9 E s b K l j R k p v 6 e y G i k 9 T g K b G d E z V A v e l P x P 6 + T m v D a z 7 h M U o O S z R e F q S A m J t O v S Z 8 r Z E a M L a F M c X s r Y U O q K D M 2 m 6 I N w V t 8 e Z k 0 z y v e Z c W r X 5 S r N 3 k c B T i G E z g D D 6 6 g C n d Q g w Y w Q H i G V 3 h z H p w X 5 9 3 5 m L e u O P n M E f y B 8 / k D 3 Z W M + w = = < / l a t e x i t > q < l a t e x i t s h a 1 _ b a s e 6 4 = " P n A Q V + E 8 G u M 1 2 a j 7 9 V 0 t 8 K s J S + 4 = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E 1 G P R i 8 c W 7 A e 0 o W y 2 k 3 b t Z h N 2 N 0 I o / Q V e P C j i 1 Z / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k A i u j e t + O 4 W 1 9 Y 3 N r e J 2 a W d 3 b / + g f H j U 0 n G q G D Z Z L G L V C a h G w S U 2 D T c C O 4 l C G g U C 2 8 H 4 b u a 3 n 1 B p H s s H k y X o R 3 Q o e c g Z N V Z q Z P 1 y x a 2 6 c 5 B V 4 u W k A j n q / f J X b x C z N E J p m K B a d z 0 3 M f 6 E K s O Z w G m p l 2 p M K B v T I X Y t l T R C 7 U / m h 0 7 J m V U G J I y V L W n I X P 0 9 M a G R 1 l k U 2 M 6 I m p F e 9 m b i f 1 4 3 N e G N P + E y S Q 1 K t l g U p o K Y m M y + J g O u k B m R W U K Z 4 v Z W w k Z U U W Z s N i U b g r f 8 8 i p p X V S 9 q 6 r X u K z U b v M 4 i n A C p 3 A O H l x D D e 6 h D k 1 g g P A M r / D m P D o v z r v z s W g t O P n M M f y B 8 / k D 6 b W N A w = = < / l a t e x i t > y < l a t e x i t s h a 1 _ b a s e 6 4 = " A 3 p r O m Z y L 0 T n l F a r z Y Y 9 N i z X q r 0 j q r e R c 1 7 / 6 8 U r / O 4 y j C E R x D F T y 4 h D r c Q g O a Q G A M z / A K b 4 5 0 X p x 3 5 2 P e W n D y m U P 4 A + f z B 7 c w j y o = < / l a t e x i t > p (M ) < l a t e x i t s h a 1 _ b a s e 6 4 = " n Z A f 4 q y 5 S 6 X v 1 Y l p Z 4 m J 7 + 4 7 7 h k = " ground-truth labels back-propagation mean aggregation Figure 1: Illustration of our co-regularization framework. The base models are jointly optimized with the task-specific loss from label y and an agreement loss, which regularizes the models to generate similar predictions to the aggregated soft target probability q.
systems (Reiss et al., 2020;Alt et al., 2020). Hence, developing a robust learning method that better tolerates noisy supervision represents an urged challenge for emerging IE models.
So far, few research efforts have been made to developing noise-robust IE models, and existing work mainly focuses on the weakly supervised or distantly supervised setting (Surdeanu et al., 2012;Ratner et al., 2016;Huang and Du, 2019;Mayhew et al., 2019). Most of such methods typically depend on multi-instance learning that relies on bags of instances provided by distant supervision (Surdeanu et al., 2012;Zeng et al., 2015;Ratner et al., 2016) or require an additional clean and sufficiently large reference dataset to develop a noise filtering model (Qin et al., 2018). Accordingly, those methods may not be generally adapted to supervised training settings, where the aforementioned auxiliary learning resources are not always available. Particularly, CrossWeigh (Wang et al., 2019c) is a representative work that denoises a natural language dataset without using extra learning resources. This method trains multiple independent models on different partitions of training data and downweighs instances on which the models disagree. Though effective, a method of this kind requires training tens of redundant neural models, leading to excessive computational overhead for large models. As far as we know, the problem of noisy labels in supervised learning for IE tasks has not been well investigated.
In this paper, we aim to develop a general denoising framework that can easily incorporate existing supervised learning models for entity-centric IE tasks. Our method is motivated by studies (Arpit et al., 2017;Toneva et al., 2019) showing that noisy labels often have delayed learning curves, as incorrectly labeled instances are more likely to contradict the inductive bias captured by the model. Hence, noisy label instances take a longer time to be picked up by neural models and are frequently forgotten in later epochs. Therefore, predictions by more than one model tend to disagree on such instances. Accordingly, we propose a simple yet effective co-regularization framework to handle noisy training labels, as illustrated in Fig. 1. Our framework consists of two or more neural classifiers with identical structures but different initialization. In training, all classifiers are optimized on the training data with the task-specific loss and jointly regularized with regard to an agreement loss that is defined as the Kullback-Leibler (KL) divergence among predicted probability distributions. Then for instances where a classifier's predictions disagree with labels, the agreement loss encourages the classifier to give similar predictions to the other classifier(s) instead of the actual (possibly noisy) labels. In this way, the framework prevents the incorporated classifiers from overfitting noisy labels.
We apply the framework to two important entitycentric IE tasks, named entity recognition (NER) and relation extraction (RE). We conduct extensive experiments on two prevalent but noisy benchmarks, CoNLL03 for NER and TACRED for RE, and apply the proposed learning frameworks to train various models from prior studies for these two tasks. The results demonstrate the effectiveness of our method in noise-robust training, leading to promising and consistent performance improvement. We present contributions as follows: • We propose a general co-regularization framework that can effectively learn supervised IE models from noisy datasets without the need for any extra learning resources.
• We discuss in detail the different design strategies of the framework and the trade-off between efficiency and effectiveness.
• Extensive experiments on NER and RE demonstrate that our framework yields promising improvements on various SOTA models and outperforms existing denoising frameworks.

Method
In this paper, we focus on developing a noiserobust learning framework that improves supervised models for entity-centric IE tasks. In such tasks, (noisy) labels can be assigned to either individual tokens (NER) or pairs of entities (RE) in natural language text. Specifically, is a noisily labeled dataset, where each data instance consists of a lexical sequence or a context x, and a label y. y is annotated either on tokens of x for NER or on a pair of entity mentions in x for RE. For some instances in D, the labels are incorrect. Our objective is to learn a noise-robust model f with the presence of such noisily labeled instances from D without using external resources such as a clean development dataset (Qin et al., 2018).

Learning Process
Our framework is motivated by the delayed learning curve of a neural model on noisy data, compared with learning on clean data. On noisy data, neural models tend to fit easy and clean instances that are more consistent with the well-represented patterns of data in early steps but need more steps to capture noise (Arpit et al., 2017). Moreover, learned noisy examples tend to be frequently forgotten in later epochs (Toneva et al., 2019) since they conflict with the general inductive bias represented by the clean data majority. Therefore, model prediction is likely to be consistent with the clean labels while is often inconsistent or oscillates on noisy labels over different training epochs. As a result, labels that are different from the model's predictions in the later epochs of training are likely to be noisy and should be down-weighted or rectified so as to reduce their impact on optimization. The proposed framework incorporates several copies of a task-specific IE model with the same architecture but different (random) parameter initialization. These IE models are jointly optimized on the noisy dataset based on their task-specific losses as well as on an agreement loss. During training, the predicted probability distributions from models are aggregated as a soft target probability, which represents the models' estimations of the true label. The agreement loss is responsible for encourag-ing these models to generate similar predictions to the soft target probability. In this learning process, models starting their training from varied initialization generate different decision boundaries. By aggregating their predictions, the soft target probability can better separate noisy labels from clean labels that have not yet been learned.
The learning process of our framework is described in Alg. 1. It consists of M (M ≥ 2) copies of the task-specific model, denoted {f k } M k=1 , with different initialization. Regarding initialization, for models that are trained from scratch, all parameters are randomly initialized. Otherwise, for those that are built upon pre-trained language models, only the parameters that are external to the language models (e.g., those of a downstream softmax classifier) are randomly initialized, while the pre-trained parameters are the same. Once initialized, our framework trains those models in two phases. The first α% training steps undergo a warm-up phase, where α is a hyperparameter. This phase seeks to help the model reach initial convergence on the task. When a new batch comes in, we first calculate the task-specific training losses on M models {L (k) sup } M k=1 and average them as L T , then update model parameters w.r.t. L T . After the warm-up phase, an agreement loss L agg is further introduced to measure the distance from the predictions of M models to the soft target probability q. Parameters are accordingly updated based on the joint loss L, encouraging the model to generate predictions that are consistent with both the training labels and the soft target probability. The formalization of the loss function is described next ( §2.2). In the end, we can either use the model f 1 or select the best-performing model for inference.

Co-regularization Objective
In our framework, the influence of noisy labels in training is decreased by optimizing the agreement loss. Specifically, given a batch of data instances , we first feed the instances to M incorporated models to get their predictions on B, where p ∈ R C is the predicted probability distribution of C classes. Then we calculate the soft target probability q by averaging the predictions: Get the probability distribution of classes {p} M k=1 with M models. Calculate the soft target probability q by Eq. 1. Calculate the agreement loss L agg by Eq. 2 and Eq. 3. L = L T + γ · L agg . Update model parameters w.r.t. L. Return f 1 or the best-performing model. which represents the models' estimates of the true label. Finally, we calculate the agreement loss L agg as the average KL divergence from q to each p (k) , k = 1, ..., M : where is a small positive number to avoid division by zero. We can easily tell that the agreement loss encourages the models to get similar predictions based on the same input. As the KL divergence is non-negative, the agreement loss is minimized only when should be equal because we use the average probability for q. We may also use other aggregates for q as long as they satisfy that i are equal so as to maintain such property of the agreement loss. We consider the following alternatives for q: • Average logits. Given the logits {l • Max-loss probability. A noise-robust model will disagree on noisy labels and produce large training losses. Therefore, for each instance i in the batch, we assume the prediction p * i that has the largest task-specific loss among the M models to be more reliable and use it as the soft target probability for instance i.
In experiments, we observe that all aggregate functions generally achieve similar performance. We present the results of different q in §4.6.

Joint Training
The main learning objective of our framework is then to optimize the joint loss L = L T + γL agg , where γ is a positive hyperparameter and L T is the average of task-specific classification losses . For classification problems such as NER and RE, the task-specific loss is defined as the following cross-entropy loss, where I denotes an indicator function: N thereof is the number of tokens for NER and the number of sentences for RE. The joint training can be interpreted as a "softpruning" scheme. For clean labels where the models' predictions are usually close to the labels, the agreement loss and its gradient are both small, so they have a small impact on training. While for noisy labels where the model predictions disagree with the training labels, the agreement loss incurs a large magnitude of gradients in training, which prevents the model from overfitting the noisy labels.
Besides co-regularization, denoising may also be attempted by "hard-pruning" the noisy labels. Small-loss selection (Jiang et al., 2018;Han et al., 2018) assumes that instances with large taskspecific loss are noisy and excludes them from training. However, some clean label instances, especially those from long-tail classes, can also have large task-specific losses and will be incorrectly pruned. While for the frequent classes, some noisy instances can have smaller task-specific losses and fail to be identified. Such errors can accumulate during training and may hinder model performance. In our framework, as we use the agreement loss instead of hard pruning, such errors will not be easily propagated (see §4.6).

Tasks
We evaluate our framework on two fundamental entity-centric IE tasks, namely RE and NER. Our framework can incorporate any kind of neural model that is dedicated to either task. Particularly, in this paper, we adopt off-the-shelf SOTA models that are mainly based on Transformers. This section introduces the two attempted tasks and the design of task-specific models.
Relation extraction. RE aims at identifying the relations between a pair of entities in a piece of text from the given vocabulary of relations. Specifically, given a sentence x and two entities e s and e o , identified as the subject and object entities respectively, the goal is to predict the relation between e s and e o . Following Shi and Lin (2019), we formulate this task as a sentence classification problem. Accordingly, we first apply the entity masking technique (Zhang et al., 2017b) to the input sentence and replace the subject and object entities with their named entity types. For example, a short sentence "Bill Gates founded Microsoft" will become "[SUBJECT-PERSON] founded [OBJECT-ORGANIZATION]" after entity masking. We then feed the sentence to the pre-trained language model and use a softmax classifier on the representation of the [CLS] token to predict the relation.
Named entity recognition. NER seeks to locate and classify named entities in text into pre-defined categories. Following Devlin et al. (2019), we formulate the task as a token classification problem. In detail, a Transformer-based language model first tokenizes an input sentence into a sub-token sequence. To classify each token, the representation of its first sub-token is sent into a softmax classifier. We use the BIO tagging scheme (Ramshaw and Marcus, 1995) and output the tag with the maximum likelihood as the predicted label.

Experiment
In this section, we evaluate the proposed learning framework based on two (noisy) benchmark datasets for the two entity-centric IE tasks ( §4.1- §4.4). In addition, a noise filtering analysis is presented to show how our framework prevents an incorporated neural model from overfitting noisy training data ( §4.5), along with a detailed ablation study about configurations with varied model copies, alternative noise filtering strategies, target functions, and different noise rates ( §4.6).

Datasets
The experiments are conducted on TA-CRED (Zhang et al., 2017b) and CoNLL03 (Sang and De Meulder, 2003). TACRED is a crowdsourced dataset for relation extraction. A recent study by Alt et al. (2020) found a large portion of examples to be mislabeled and rectified some incorrect labels in the development and test sets. CoNLL03 is a human-annotated dataset for NER. Another study by Wang et al. (2019c) found that in 5.38% of sentences in CoNLL03, at least one token is mislabeled. Accordingly, Wang et al. also relabeled the test set 2 . We summarize the statistics of both datasets in Tab. 1. For all compared methods, we report the results on both the original and relabeled evaluation sets.

Base Models
We evaluate our framework by incorporating the following SOTA models: • C-GCN ) is a graph-based model for RE. It prunes the dependency graph and applies graph convolutional networks to get the representation of entities.
• BERT (Devlin et al., 2019) is a Transformerbased language model that is pre-trained from large-scale text corpora. Both Base and Large versions of the model are considered in our experiments.
• LUKE (Yamada et al., 2020) is a Transformerbased language model that is pre-trained on both large-scale text corpora and knowledge graphs. It achieves SOTA performance on various entityrelated tasks, including RE and NER.
We report the performance of the base models trained with and without our co-regularization framework. We also compare our framework to CrossWeigh (Wang et al., 2019c), which is another noisy-label learning framework. Specifically, CrossWeigh partitions the training set into equalsized chunks, reserves each chunk, and then trains several models on the rest ones. After training, the models predict on the reserved chunk, and instances on which the models disagree are downweighted. In the end, the chunks are combined and used to train a new model for inference. Learning  by CrossWeigh is dependant on a high computation cost. Wang et al. (2019c) split the CoNLL03 dataset into 10 chunks and train 3 models on each partition, resulting in a total of number 30 models.
In this paper, we follow their settings and train 30 models on both TACRED and CoNLL03.

Model Configurations
For base models C-GCN  and LUKE (Yamada et al., 2020), we rerun the officially released implementations using the recommended hyperparameters in the original papers. We implement BERT BASE and BERT LARGE based on Huggingface's Transformers (Wolf et al., 2020). For CrossWeigh (Wang et al., 2019c), we re-implement this framework using those compared base models. All models are optimized with Adam (Kingma and Ba, 2015) using a learning rate of 6e−5 for TACRED and that of 1e−5 for CoNLL03, with a linear learning rate decay to 0. The batch size is fixed as 64 for all models. We finetune the TACRED model for 5 epochs and the CoNLL03 model for 50 epochs. The best model checkpoint is chosen based on the F 1 score on the development set. We tune γ from {1.0, 2.0, 5.0, 10.0, 20.0}, and tune α from {10, 30, 50, 70, 90}. We report the median of F 1 of 5 runs using different random seeds. For efficiency, we use the simplest setup of our framework with two model copies (M = 2) in the main experiments ( §4.4). q is set as the average probability in the main experiment. Performance with more model copies and alternative aggregates is later studied in §4.6.

Main Results
The experiment results on TACRED and CoNLL03 are reported in Tab. 2 and Tab. 3 respectively, where methods incorporated in our learning framework are marked with "CR". As stated, the results are reported under the setup where M = 2. For a fair comparison, the results are reported based on the predictions from model f 1 in the framework. On TACRED, our framework leads to an absolute improvement of 2.5−4.1% in F 1 on the relabeled test set for Transformer-based models, and a relatively smaller gain (0.8% in F 1 ) for C-GCN. In partic-  67  ular, our framework enhances the SOTA method LUKE by 2.5% in F 1 , leading to a very promising F 1 score of 83.1%. On CoNLL03, where the noise rate is smaller than TACRED, our framework leads to a performance gain of 0.28 − 0.82% in F 1 on the relabeled test set. On both IE tasks, our framework also leads to a consistent improvement on the original test set. Compared to CrossWeigh, except for C-GCN where the results are similar, our framework consistently outperforms it by 0.9 − 2.2% on TACRED and by 0.13 − 0.45% on CoNLL03. Moreover, as our framework requires training M models concurrently while CrossWeigh requires training redundant models (30 in experiments), the computation cost of our co-regularization framework is much lower than CrossWeigh. In general, the results here show the effectiveness and practicality of the proposed framework.

Noise Filtering Analysis
The main experiments show that our framework can improve the overall performance of models trained with noisy labels. In this section, we further demonstrate how our framework prevents overfitting on noisy labels. To do so, we extract the 2,526 noisy instances from the development and test sets of TACRED where the relabeling by Alt et al. (2020) disagrees with the original labels. Accordingly, we obtain a noisy set containing those examples with original labels and a clean set with rectified labels. We train a relation classifier on the union of the training set and the noisy set and then  Table 3: F 1 score (%) on the dev and test set of CoNLL03. ♣ marks results obtained using the originally released code.
evaluate the model on the clean set. In this case, worse performance on the clean set indicates more severe overfitting on noisy labels. Fig. 2 shows the results by C-GCN-CR and BERT BASE -CR on the clean set, where we observe that: (1) Compared to the original base models (γ = 0.0), those trained with our framework achieves higher F 1 scores, indicating improved robustness against the label noise; (2) Comparing different base models, the large classifier BERT BASE is typically less noise-robust than a smaller model like C-GCN, which explains why the performance gain from our framework is more notable on BERT BASE ; (3) For both models, the F 1 score first increases then decreases, consistent with the delayed learning curves that the neural models have on noisy instances (Arpit et al., 2017).

Ablation Study
Using extra model copies. The main results show that using two copies of a model in the coregularization framework has already improved the performance by a remarkable margin. Intuitively, more models may generate higher-quality soft target probabilities and thus further improve the performance. We further show the performance on TACRED by incorporating more model copies. We report the relabeled test F 1 on TACRED in Tab. 4. We observe that increasing the number of copies does not necessarily lead to a notable increase in performance. On BERT LARGE , increasing the number of model copies from 2 to 4 gradually improves the performance from 82.0% to 82.7%. While on BERT BASE , increasing the number of model copies does not improve the performance. We notice that the increased number of copies leads to a significant increase in the agreement loss for BERT BASE ,  indicating that the copies of BERT BASE fail to reach a consensus based on the same input. This may be due to the relatively small model capacity of BERT BASE . Overall, this study shows that the optimal M is dependent on the models and needs to be tuned on the specific task. Note that as the models can be trained in parallel, increasing the number of models does not necessarily increase the training time, though being at the cost of more computational resources.
Alternative strategies for noise filtering. Besides co-regularization, we also experiment with other noise-filtering strategies. Small-loss selection (Jiang et al., 2018;Han et al., 2018;Lee and Chung, 2020) prunes the instances with the largest training losses in the training batches. This method is motivated by the fact that the noisy instances take a longer time to be memorized and usually cause a large training loss. We further try another strategy named relabeling. Instead of pruning the large-loss training instances, we relabel them with the most likely labels from model predictions.
We evaluate the two noise filtering strategies on TACRED using BERT BASE as the base model. For both strategies, we prune/relabel δ t = δ · t T percent of examples with the largest training loss in each training batch following Han et al. (2018), where t is the current number of training steps, T is the total number of training steps, and δ is the maximum pruning/relabeling rate. These hyperparameters are tuned on the development set. The training loss is defined as the average task-specific loss of the   M models, where we set M = 2 in consistent with the main experiments ( §4.4). We try δ from {2%, 5%, 8%} and report the best results. Results are shown in Table 5. We find that δ = 2% achieves the best performance for both strategies. The small-loss selection strategy underperforms the base model without noise filtering. Relabeling outperforms the base model slightly, but the improvements are lesser than the proposed co-regularization method. We observe that these two strategies do not work well on imbalanced datasets, mostly pruning or relabeling training examples from long-tail classes. Specifically, on the TACRED dataset, where the NA class accounts for 80% of the total labels, only 20% pruned labels are from NA while the remaining 80% are from other classes. It is because that the model's predictions will be biased towards the frequent classes on imbalanced datasets, therefore leading to the large training loss on long-tail instances. Once pruned or relabeled, such long-tail instances are excluded from training, causing further error propagation that can lead to more biased predictions. Our framework, on the contrary, adopts an agreement loss instead of hard pruning or relabeling, which reduces such error propagation.
Alternative aggregates for q. Besides the average probability, we evaluate two other aggregates for q, i.e. the average logits and the max-loss probability ( §2.2). This experiment is conducted with M = 2. F 1 results on the relabeled TACRED test set (Tab. 6) suggest that different aggregates generally achieve comparable performance,with a marginal difference of up to 0.6% in F 1 . Therefore, the default setup is suggested to be the average probability, which is easier to implement.  Performance under different noise rates. We further evaluate our framework on training data of different noise rates. To do so, we create noisy training data by randomly flipping 10%, 30%, 50%, 70%, or 90% labels in the training set of TACRED. Then we use those synthetic noisy training sets to train RE models and evaluate them on the relabeled test set of TACRED. We use BERT BASE as the base model and report the median F 1 score of 5 trials. Results are given in Tab. 7, which show that our co-regularization framework consistently outperforms both the base model and CrossWeigh under different noise rates. The gain generally becomes larger as the noise rate increases. In comparison to BERT BASE trained on the training sets where all flipped labels are removed, our framework, even trained on synthetic noise, achieves comparable or better results when the noise rates are below 50%.

Related Work
We discuss two lines of related work. Each has a large body of work which we can only provide as a highly selected summary.
Distant supervision. Distant supervision (Mintz et al., 2009) generates noisy training data with heuristics to align unlabeled data with labels, whereas much effort has been devoted to reducing labeling noise. Multi-instance learning (Zeng et al., 2015;Lin et al., 2016;Ji et al., 2017; creates bags of noisily labeled instances and assumes at least one instance in each bag is correct, then it uses heuristics or auxiliary classifiers to select the correct labels. However, such instance bags may not exist in a general supervised setting. Reinforcement learning (Qin et al., 2018;Yang et al., 2019;Wang et al., 2020) and curricular learning (Jiang et al., 2018;Huang and Du, 2019) methods use a clean validation set to obtain an auxiliary model for noise filtering, while constructing a perfectly labeled validation set is expensive. Our framework can learn noise-robust IE models without extra learning resources and can be easily incorporated into existing supervised IE models.
Supervised learning with noisy labels. A deep neural network can memorize noisy labels, and its generalizability will severely degrade when trained with noisy labels (Zhang et al., 2017a). In computer vision, much investigation has been conducted for supervised image classification with noise, producing techniques such as robust loss functions (Zhang and Sabuncu, 2018;Wang et al., 2019b), noise filtering layers (Sukhbaatar et al., 2015;Goldberger and Ben-Reuven, 2017), label re-weighting , robust regularization (Krogh and Hertz, 1992;Srivastava et al., 2014;Müller et al., 2019), and sample selection (Malach and Shalev-Shwartz, 2017;Jiang et al., 2018;Han et al., 2018;Wei et al., 2020). The robust loss functions and noise filtering layers require modifying model structures and may not be easily adapted to IE models. The sample selection methods assume the data instances with large training losses to be noisy and exclude them from training. However, some clean instances, especially those from long-tail classes, can also have a large training loss and be wrongly pruned, leading to propagated errors.
In NLP, few efforts have focused on learning with denoising. CrossWeigh (Wang et al., 2019c), one label re-weighting method, partitions the training data into multiple folds and trains multiple models on each fold. Instances on which models disagree are regarded as noisy and down-weighted in training. However, this method requires training many models and is computationally expensive. Our framework only requires training several models concurrently, which is more computationally efficient and achieves better performance. NetAb (Wang et al., 2019a) assumes that noisy labels are created from randomly flipping clean labels and uses a CNN to model the noise transition matrix . However, this assumption does not hold for real datasets, where the noise rate vary among data instances (Cheng et al., 2020).

Conclusion
This paper presents a co-regularization framework for learning supervised IE models from noisy data. This framework consists of two or more identically structured models with different initialization, which are encouraged to give similar predictions on the same inputs by optimizing an agreement loss.
On noisy examples where model predictions usually differ from the labels, the agreement loss prevents the model from overfitting noisy labels. Experiments on NER and RE benchmarks show that our framework yields promising improvements on various IE models. For future work, we plan to extend the use of the proposed framework to other tasks such as event-centric IE (Chen et al., 2021) and co-reference resolution (Peng et al., 2015).

Ethical Consideration
This work does not present any direct societal consequence. The proposed work seeks to develop a general learning framework that learning more robust neural models for entity-centric information extraction under noisy label settings. We believe this leads to intellectual merits that benefit the information extraction community where learning resources may often suffer from noisy labeling issues. And it potentially has broad impacts since the tackled issues also widely exist in tasks of other areas.