H-FND: Hierarchical False-Negative Denoising for Distant Supervision Relation Extraction

Although distant supervision automatically generates training data for relation extraction, it also introduces false-positive (FP) and false-negative (FN) training instances to the generated datasets. Whereas both types of errors degrade the final model performance, previous work on distant supervision denoising focuses more on suppressing FP noise and less on resolving the FN problem. We here propose H-FND, a hierarchical false-negative denoising framework for robust distant supervision relation extraction, as an FN denoising solution. H-FND uses a hierarchical policy which first determines whether non-relation (NA) instances should be kept, discarded, or revised during the training process. For those learning instances which are to be revised, the policy further reassigns them appropriate relations, making them better training inputs. Experiments on SemEval-2010 and TACRED were conducted with controlled FN ratios that randomly turn the relations of training and validation instances into negatives to generate FN instances. In this setting, H-FND can revise FN instances correctly and maintains high F1 scores even when 50% of the instances have been turned into negatives. Experiment on NYT10 is further conducted to shows that H-FND is applicable in a realistic setting.


Introduction
Relation extraction (Zelenko, Aone, and Richardella 2003;Mooney and Bunescu 2006;Zhou et al. 2005) is a core task in information extraction. Its goal is to determine the relation between two entities in a given sentence. For instance, given the sentence "Jobs was born in San Francisco", with head and tail entities "Jobs" and "San Francisco", the relation to be extracted is "Place of Birth". Relation extraction can be applied for many applications, such as question answering and knowledge graph completion.
A major difficulty with supervising relation extraction models is the cost of collecting training data, against which distant supervision (DS) (Hoffmann et al. 2011;Surdeanu et al. 2012) is proposed. DS obtains the relational facts from a knowledge base and aligns these facts to all sentences in the corpus to generate learning instances. In specific, if a relation triple r(h, t) exists in a knowledge base, then for a sentence s which mentions both the head entity h and the * Equal contribution.  tail entity t, it is tagged with relation r to form a learning instance (r, h, t, s).
Although datasets for relation extraction can be generated using distant supervision, they contain considerable noise (Roth et al. 2013). Under distant supervision, there are two types of noisy instances: false positives (FP) and false negatives (FN). Table 1 shows an example. The FP "Jobs moved back to San Francisco" should not reflect the relation 'Place of Birth'. Also, an FN: as there is no relation between "Manuela" and "New York" in the knowledge base, "Manuela was born in New York" is wrongly labeled as a non-relation (NA) under the closed world assumption. Both FP and FN degrade model performance if they are treated as correct labels at training time. FPs harm prediction precision, while excessive FNs lead to low recall rates.
In addition to denoising methods for learning robustly with noisy data (Han et al. 2018;Northcutt, Jiang, and Chuang 2019), many works focus on alleviating the FP problem in DS datasets, including those on pattern-based extraction (Alfonseca et al. 2012;Jia et al. 2019), multipleinstance learning (Surdeanu et al. 2012;Lin et al. 2016;Zeng et al. 2018), and sentence-level denoising with adversarial training or reinforcement learning (Qin, Xu, and Wang 2018a,b;Feng et al. 2018). However, few investigate the FN problem for distant supervision (Xu et al. 2013;Roller et al. 2015). To the best of our knowledge, there is no previous study on this problem for deep neural networks.
In this paper, we investigate the impact of FNs on neuralbased models and propose H-FND, a hierarchical falsenegative denoising framework for robust distant supervision. Specifically, this framework integrates a deep reinforcement learning agent which keeps, discards, or revises probable FN instances with a relation classifier to generate revised relations. In addition, to constrain the study to the FN problem and to construct ground-truth relations to further analyze model behavior, we conduct our research on the following two human-annotated datasets: SemEval-2010(Hendrickx et al. 2010) and TACRED (Zhang et al. 2017, with controlled FN ratios that randomly flip relations of training/validation instances into negatives to generate FN instances. Then, we further conduct our experiment on a distantly supervised dataset NYT10 (Riedel, Yao, and McCallum 2010) and fix its positive set, to demonstrate that our framework is applicable for resolving FN problem in a realistic setting. In summary, our contributions are three-fold: • We propose a denoising framework focused on false negatives in relation extraction.
• We present a special transfer learning scheme for pretraining denoising agent as training data is not available for this pretraining task.
• We show that our method revises correctly and maintains high F1 scores even under a high percentage of false negatives, and is applicable in a realistic setting.
We organize the rest of this paper as follows: Section 2 discusses the related work, Section 3 describes our H-FND framework, Section 4 shows the experimental results, and Section 5 concludes. Mintz et al. (2009) propose distant supervision (DS) to automatically generate labeled data for relation classification, a new paradigm that synthesizes positive training data by aligning a knowledge base to an unlabeled corpus, and produces negatives with a closed-world assumption. Although this method requires no human effort for sentence labeling, it introduces FPs and FNs into the generated data and degrades the performance of relation extraction models.

Related Work
Many previous works have attempted to solve the FP problem. Among these works, denoising methods that utilize reinforcement learning (RL) are the most relevant to ours. Feng et al. (2018) propose a sentence-level denoising mechanism that trains a positive instance selector using RL, and set the RL reward to the prediction probability of the relation classifier. Qin, Xu, and Wang (2018b) also utilizes RL, but in a different way. It learns a denoising agent to redistribute FPs to NA via prediction accuracy of the classifier as the RL reward.
To solve the FN problem, one method is to align the KB to the corpus after performing KB completion using inference (Roller et al. 2015). Although this does reduce the number of FNs in DS datasets, it helps little when the FN relations cannot be inferred from the KB, e.g., the entities mentioned in the FN are not in the KB. IRMIE (Xu et al. 2013), another method, constructs a negative set in a more conservative sense, in which the head or tail entities have already participated in other relation triples in the KB. Other sentences outside the positive and negative sets are left unlabeled (labeled as RAW in original paper) to prevent FNs.
After training on the positive and negative sets, positive relation triples are retrieved from the unlabeled set to expand the KB, after which the original DS is performed to improve the quality of relation extraction. The final performance of this method depends heavily on the heuristic for constructing the negative set, which may not be applicable for all possible relation types.
To address the FN problem in DS datasets more generally, we propose a hierarchical denoising method to mitigate the negative effect of FNs, ensuring a more robust relation extraction model when the presence of FN instances is unavoidable.

H-FND Framework
We propose H-FND, a hierarchical false-negative denoising framework that determines whether to keep, discard, or revise negative instances. As illustrated in Fig. 1, H-FND is composed of the denoising agent and relation classifier modules. The denoising agent makes a ternary decision on the action to take on each negative instance, and after discarding, the relation classifier predicts a new relation for each to-be-revised instance to produce a cleaned dataset.

Convolutional Neural Network
Convolutional neural networks (CNN) are commonly adopted for sentence-level feature extraction (Kim 2014) in language understanding tasks, such as relation extraction (Zeng et al. 2014;Nguyen and Grishman 2015). PC-NNs (Zeng et al. 2015), a variation of CNN that applies piecewise max pooling, are also widely used for extracting sentence features (Lin et al. 2016;Qin, Xu, and Wang 2018a). We included both as the base model in our experiments to show that our framework is base model agnostic. In our implementation, the extracted features of a learning instance s are fed into a fully connected softmax classifier to compute the final logits: O(r) = softmax(FC(CNN(s))).
For detailed mathematical descriptions of CNN and PCNN, please refer to the Appendix.

Hierarchical Denoising Policy
The proposed hierarchical denoising policy is a framework using policy-based reinforcement learning (RL). Previous work utilizing RL to suppress noise from FPs (Feng et al. 2018;Qin, Xu, and Wang 2018b) can be categorized in two types of strategies: the first decides whether to remove the input instance, and the second decides whether to revise the input instance to be negative. Both policies make a binary decision on each input instance, and successfully reduce FP instances in DS datasets.
While applicable on the FP problem, it is risky to directly apply these strategies on the FN problem. First, discarding a negative instance even when it is most likely positive can result in a loss of useful learning instances. Second, changing a negative instance to positive is not enough for the training process: we must also know which type of positive relation to revise to. Therefore, we propose a hierarchical denoising policy to perform the FN denoising in two steps. The first step, a soft policy that combines the two above-mentioned denoising methods, is an agent that takes an action from the action set {Keep, Discard, Revise} for a negative instance s: • Keep: maintain s as a negative instance for training/validation; • Discard: remove s to prevent it from misleading the model; • Revise: predict a new relation type for s and treat it as a positive for the following training/validation.
The policy π(a|s) of this ternary decision is calculated based on the sentence feature extracted from s with the base model CNN encoder: π(a|s) = softmax(FC 1 (CNN 1 (s))); each action a has the possibility of π(a|s) of being taken by the denoising agent. Then, if the negative instance s is to be revised, the hierarchical policy goes on to the second step and gives the revised relation by selecting the most likely relation (excluding NA) predicted by the relation classifier:

Pretraining
Supervised pretraining (Qin, Xu, and Wang 2018b), commonly used to accelerate RL agent training, is easily performed for the relation classifier on the original DS dataset (Han et al. 2018). For the denoising agent, however, there is no available training data. Therefore, we propose a special transfer learning scheme that utilizes the learnt knowledge in the relation classifier (source domain) to help generate action labels for pretraining denoising agent (target domain). (See Fig. 2).
First, we select the positives for which the pretrained relation classifier correctly predicts the relation, and tag these with Revise. This prepares the denoising agent to identify positive instances in the negative set in future training, and then pass these kinds of instances to the relation classifier to predict the correct positive relations for them. Similarly, we tag with Keep those negatives correctly predicted by the relation classifier. Lastly, for instances in which the relation classifier wrongly predicts their relation, we tag them with Discard, encouraging the denoising agent to discard such instances to avoid incorrect revisions.
In summary, our pretraining strategy is thus: 1. Relation classifier pretraining: pretrain the relation classifier (RC) directly on the original training set with the categorical loss function: where G represents the distantly supervised relation in the training set. Then, fix the parameters of the relation classifier for the next step. 2. Label generation: generate labels H with the predictions of the relation classifier. 3. Denoising agent pretraining: Supervise the denoising agent (DA) with categorical loss: ls DA = cross-entropy(π, H).

Co-Training
To combine the training of the relation classifier and the denoising agent, we propose the following co-training framework during each epoch (see Fig. 1): 1. Denoising agent decision: At the beginning of each epoch, the denoising agent first executes the denoising policy on the dataset. For both training and validation sets, the policy keeps, discards, or revises NA instances. 2. Relation classifier revision: For instances to be revised, the relation classifier generates revision relations for them. Denoising yields the cleaned training and validation sets. 3. Relation classifier training: Given the cleaned training set, we train the relation classifier in a supervised fashion based on categorical loss: where G represents the modified training set, which contains all the positives and the kept or revised negatives. Note that discarded negatives are not included in G . Figure 2: A special transfer learning scheme for H-FND pretraining. Symbols "P" and "N" represent positive and negative instances for relation classifier pretraining. Symbols "O" and "X" indicate two sets of training instances which are correctly predicted and wrongly predicted by pretrained relation classifier correspondingly.  We evaluate the trained relation classifier on the cleaned validation set to obtain the F1 score, which we use as reward R for denoising. As the validation set is cleaned by the denoising policy, R reflects the efficacy of the policy.

5.
Denoising policy update: To maximize the reward R, we adopt policy gradient (Sutton et al. 2000) to optimize the denoising agent by maximizing the objective function J(θ): where θ is the parameter of the denoising policy, p(a|θ) represents the softmax probability of the sampled determination or revision step, and b is the baseline which mitigates the high variance of the REINFORCE algorithm (Williams 1992). We set b to the average reward of the previous five epochs.
For each epoch, we obtain the revised set from the original training/validation set via the denoising policy, and H-FND finds the best denoising policy adaptively between supervised training and reward maximization.

Experiment
In order to quantify our model's performance on denoising false negatives. We first evaluated the proposed H-FND on human-annotated datasets SemEval-2010 (Hendrickx et al. 2010) and TACRED (Zhang et al. 2017) with controlled FN ratios. Then, we evaluate H-FND on a DS dataset NYT10 (Riedel, Yao, and McCallum 2010) to evaluate its performance in a more realistic setting. Table 2 shows the statistics of each dataset used in the experiments. For more information of the datasets and the preprocessing procedure, please refer to Appendix.

Baselines and Experiment Settings
A simple H-FND baseline was the original CNN and PCNN relation classifier. To demonstrate the impact of FNs, we also included SelATT (Lin et al. 2016), an FP noise resistant model.
We further compared our H-FND framework with the following strong baselines: the FN denoising method IR-MIE (Xu et al. 2013) and two other general-purpose denoising methods: co-teaching (Han et al. 2018) and cleanlab (Northcutt, Jiang, and Chuang 2019). Co-teaching is a general training method for deep neural networks to combat extremely noisy labels. It simultaneously maintains two networks (each with the same structure), each of which samples its small-loss instances with a given overall noise rate as clean batches to its peer networks for further training. Cleanlab is a state-of-the-art robust learning method which directly estimates the joint distribution of noisy observed labels and latent uncorrupted labels with a consistent estimator, filters out noisy instances based on this joint distribution, and trains the relation classifier on the cleaned dataset with co-teaching mentioned above. We use these denoising methods to train the base CNN and PCNN models on our simulated FN datasets. 1 As the focus of this paper is on the FN problem, and therefore all the positives of the simulated FN datasets are kept error-free, the H-FND framework assumes that no positives need be changed. Hence, for a fair comparison, we kept the positive sets of the FN datasets unchanged for the two general-purpose denoising methods, preventing them from discarding error-free positives. Also, we fix the positive set of NYT10 to evaluate the applicability of H-FND of resolving FN problem in a realistic setting.
In the experiments on SemEval and Tacred, every data point is the average of five independent runs. In the experiment of NYT10, some RL training is not stable, which might resulte from the excessive amount of FPs in NYT10. For a fair comparison, the included data points are the average of three best results out of five independent runs for H-FND and the baselines. See Appendix for more detailed information on experiment and model implementation.

Quantitative Results
The quantitative SemEval results are shown in the upper part of Fig. 3, including both CNN and PCNN. Under the 50% FN ratio, for both the base CNN and PCNN models, with or without SelATT, the F1 scores are heavily influenced by FN sentences: the performance drops by nearly 20%. ERMIE and co-teaching enhance the performance by more than 5% and 8% correspondingly. Except for cleanlab, H-FND denoising remains competitive to the baselines for FN ratios from 10% to 30%, and significantly wins after 30%. Among all baselines, cleanlab's performance is the strongest and is competitive with our approach, but as cleanlab relies on a coteaching model to train the relation classifier, a given noise rate is required. In our experiments, these are directly provided to the model. However, in practice, the noise rate (the FN ratios in our experiment) is unknown and must be estimated correctly, entailing extra effort. In contrast, H-FND has no such requirement.
The quantitative results on TACRED are shown in the lower part of Fig. 3. CNN, PCNN, and the two models with SelATT are all vulnerable to FN instances. As IRMIE fails to exclude enough FNs from the negative set on TACRED, 2 its performance is also strongly influenced by FN instances. Although the F1 scores of H-FND are 2% behind co-teaching and cleanlab for FN ratios from 0% to 20%, it successfully maintains its performance when the FN ratio exceeds 30% 2 The size of the RAW set is less than 10% of the original negative set under all FN ratios. and becomes competitive with these two baselines. This is similar to the experimental results on SemEval for FN ratios less than 30%. Together with the fact that TACRED has many more positives than SemEval, we increased the FN ratio to 90%. The result of this extended experiment shows that when the FN ratio exceeds 60%, the F1 scores for coteaching drop significantly, whereas H-FND maintains a relatively high F1 score. Here, again, although cleanlab performs similar to ours with the pre-defined FN ratios, 3 the proposed approach needs no such information, which better fits real-world circumstances of distant-supervised relation classification. Fig. 4 shows the result of the ablation study to justify the effectiveness of the Revise action and pretraining strategy. On Semeval, pretraining boosts the F1 score for the PCNN architecture for FN ratios from 10% to 40%, but yields no significant difference for the other ratios. On TACRED, however, the Revise action and the pretraining strategy clearly yield improved results. This improvement is substantial in particular for pretraining. As TACRED has more positive relation types and a much larger negative set, the FN denoising problem is more severe than on SemEval; thus the pretraining strategy is crucial to provide a better initial point for the denoising agent and to ensure more stable performance.

Detailed Analysis
We first analyzed the distribution of the denoising policy for TN and FN instances in the training set. Figure 5 shows the percentage of kept, discarded, or revised training instances. The left histogram under each filter ratio is for TN; the right is for FN.
On SemEval, we observe that for TN instances, H-FND mainly keeps them as NA and revises only a small portion to the wrong relation, even under the 50% filter ratio. For FN instances, H-FND prefers to discard or revise them. This difference shows that H-FND distinguishes FN instances from TN instances, and does not take arbitrary actions on them.
On TACRED, the policy distribution also shares the same tendency, but the portion of kept instances is generally larger. This is due to a higher ratio of negative instances in TACRED. As more negative instances result in more Keep labels in the generated pretraining data, after pretraining, the probability of the model taking the Keep action is generally higher. It also explains that the portion of kept instances grows when the filter ratio is raised. Note that this prevents H-FND from revising too many instances at the beginning of co-training, making co-training more stable. Table 3 show the correctness of revisions on FN instances which are determined to be revised. The accuracy is around 90% for both CNN and PCNN architectures and for both SemEval and TACRED. This shows that H-FND accurately corrects FN instances once they are identified and determined to be revised in the first stage.

Results on Realistic Dataset
Lastly, we evaluated H-FND on NYT10 to gain an understanding of our framework's performance on real DS datasets. For baselines, apart from the base model, we included cleanlab, as it is the best performing baseline in the controlled FN experiments. In the training set of NYT10, we conducted human evaluation on 200 randomly sampled instances and came to an estimate of 14% of FN in the negative instances.
We followed Zeng et al. (2015) and plotted the precisionrecall curve to demonstrate the result on NYT10 (see Fig. 6). At recall rate lower than 40% cleanlab performs slightly worse than the base model, while H-FND remains competitive in terms of precision. This could be a result of inaccuracies in the estimation of FN rate in the dataset. Since H-FND does not require a given FN rate, it is not encumbered by such estimation error. At higher recall rates (> 50%), H-FND retains significantly higher precision. This result shows  that H-FND is applicable for real DS datasets, especially when the recall rate matters.

Conclusion and Future Work
In this work, to increase the robustness of distant supervision, we present H-FND, a hierarchical false-negative denoising framework, which keeps, discards, or revises nonrelation (NA) inputs during training and validation phases to suppress noise from FN instances and yield a clean dataset for relation classifiers to learn from. We also present a special transfer learning scheme for pretraining the denoising agent.
To investigate the effects of FN instances addressed by our approach, we generate FN instances from SemEval-2010 and TACRED by replacing relations of instances with NA under controlled ratios. The experimental results show that H-FND revises FN instances to the appropriate relations and facilitates robust relation extraction. Further experiment on NYT10 demonstrates that our framework is applicable to real world DS denoising. This framework can be applied on tasks such as knowledge base enrichment task, where a large corpus is aligned to an incomplete knowledge base.
For large distant supervised corpora, both FP and FN instances may emerge simultaneously. Both of which should be addressed for a optimal results. This is a challenging but very practical setting. We leave this as future work. Also, we plan to attempt other advanced relation classification approach like R-BERT (Wu and He 2019) to replace CNN or PCNN in our architecture.
The source code will be released on https://github.com...

Appendices Convolutional Neural Network
We use a convolutional neural network (CNN) (Nguyen and Grishman 2015) as our base model for both the denoising agent and the relation classifier. This architecture consists of four main layers (the first three layers compose the CNN encoder): 1. Embedding: The embedding layer transforms a word into a vector representation, which is a concatenation of a word embedding V w and a pair of positional embedding vectors V p1 , V p2 (Lin et al. 2016). Word embedding V w is a vector that represents the semantics of a word, and positional embedding pair V p1 , V p2 is two vectors representing the relative distance from the current word to two entities in the sentence. The final embedding vector V of dimension d e for each word is the concatenation of V w , V p1 , and V p2 :

Convolution:
The convolutional layer transforms the embedding vectors of words into local features by applying sliding filters over them. Each filter consists of a weight matrix A i ∈ R f ×de and a bias term b i ∈ R, to extract specific patterns in the embedding vectors. With h filters of length f , the entry in the feature map C f ∈ R h×(L−f +1) for the i-th filter at position t is where L is the length of the input sentence. To capture information expressed in phrases of all lengths, we further use n different lengths of filters, and concatenate all C f under filter size f as the jointed feature map C ∈ R nf ×de : 3. Max pooling: The max pooling layer captures the most significant feature into the pooling feature P i by selecting the highest value in the feature map extracted by the i-th filter C i over all positions: PCNN (Zeng et al. 2015) involves piecewise max pooling, which better suits the relation extraction task. It divides an input sentence into three segments based on the two selected entities, and then extracts features from all the three segments to capture fine-grained features for relation extraction. For PCNN, the extracted feature map where C i1 , C i2 , and C i3 are the three feature map segments separated by the two selected entities. We also view P as the sentence feature, as it represents the essential features of the whole sentence.

Datasets
1. Human-Annotated Datasets: SemEval-2010 4 contains nine relations with an additional NA as a non-relation, and the number of instances for each relation is roughly equal. TACRED 5 is about 10 times larger than SemEval, and it has 42 relations including NA, and the number of negative instances accounts for 80% of the entire corpus. For Se-mEval, we used 10% of the training set for validation, and for TACRED we simply used the dev set as the validation set (see Table 2). We filtered out the training and validation instances which had relation triples that appeared in the testing set to eliminate any overlap between relation triples in the training, validation, and testing sets, to simulate the held-out evaluation settings in distant supervision (Mintz et al. 2009).
To simulate FN conditions, we randomly filtered a ratio (10%-50%) of training/validation positives into negatives. Note that the filtering process was only for training/validation: the testing sets were well-labeled under all FN ratios. Also note that the models were not aware in advance which sentences were TN and which were FN. 2. Distantly Supervised Dataset: The NYT10 dataset 6 uses Freebase as knowledge base for distant supervision. The relations are extracted from a December 2009 snapshot of Freebase. Four categories of Freebase relations are used: "people", "business", "person", and "location". These types of relations are chosen because they appear frequently in the newswire corpus. All pairs of Freebase entities that are at least once mentioned in the same sentence are chosen as candidate relation instances. For consistency with previous research (Lin et al. 2016;Feng et al. 2018;Qin, Xu, and

Implementation
H-FND was implemented with PyTorch 1.6.0 (Adam et al. 2017) in python 3.6.9. In our implementation, we used pretrained word embeddings provided by SpaCy (Honnibal and Johnson 2015) as the fixed word embeddings (d w = 300).
The positional embedding (d p = 50) was randomly initialized and then trained with the following network; therefore the overall dimension of embedding vector d e = d w +2d p = 400. In the convolutional layer, we applied four different sizes of filters (f ∈ [2, 3, 4, 5]) and set all of their feature sizes to h = 230. Both CNN and PCNN architectures were implemented. The total trainable parameters of each models are listed in table 4. To prevent overfitting, we inserted dropout layers with a dropout rate of 0.5 before the convolutional layer and after the max pooling layer. We trained H-FND using the Adam optimizer (Kingma and Ba 2015). In addition, we used mini-batches (batch size b = 256) only when training the relation classifier; the prediction of the relation classifier and both the decision and policy gradient of the denoising agent were executed per epoch. Last, the revised result of H-FND in each epoch was used by the classifier only in the same epoch and did not accumulate over epochs, which means that at the beginning of each epoch, H-FND applied the denoising policy on the original dataset but not on the revised dataset of the last epoch.
We list in Table 5 the learning rates for base CNN and PCNN relation classifiers (RC), for RC with SelATT, and for RC with denoising agent (DA) under pretraining and cotraining phrases. The learning rate of RC is selected from {1e-4, 3e-4, 1e-3, 3e-3, 1e-2}, with the F1 score on the noise-free version of SemEval and TACRED as the selection criteria. Except SelATT and DA cotraining, the learning rates for the other models are the same to the learning rate of base RC. For SelATT, the learning rate is selected from {1e-6, 3e-6, 1e-5, 3e-5, 1e-4}, also with the F1 score on the noise-free version of the two datasets as the selection cri-teria. For DA cotraining, the learning rate is selected from {1e-6, 3e-6, 1e-5, 3e-5, 1e-4}, with the F1 score on the Se-mEval and TACRED under a 50% FN ratio as the selection criteria.
All the RC of each method are trained to converge with validation-based early stopping. In specific, we train all the model for 150 epochs on SemEval and for 200 epochs on TACRED. For NYT, we trained all the odels for 30 epochs.
The pretraining of H-FND trains the RC and DA for 5 and 20 epochs respectively. We select these pretraining periods by the criteria that the two models can achieve about 80% performance comparing to the converged ones. By this means, we can prevent H-FND from overfitting the noisy labels (Han et al. 2018) and initialize H-FND with good parameters for co-training.
All the implemented models are trained on NVIDIA GTX 1080 Ti and Intel(R) Xeon(R) Silver 4110 CPU, with 12GN GPU memory, 128GB RAM, clock rate 2.10 GHz, and Linux as the operating system. The expected running time for each model on each dataset is listed in Table 6.

Performance on Validation Set
The F1 scores of each model running on validation sets of SemEval and TACRED are provided in Figure 7 and 8. Notice that the validation sets are noisy in our experiment, so the performance on validation sets do not fully reflect the robustness of each models. Also, in IRMIE and H-FND, the validation sets are modified, so their validation F1 scores can only be compared with their own across different FN ratios. For more accurate performance measurement, please refer to Figure 3 and 4, whose F1 scores are measured on noise-free testing sets.

Denoising policy with Standard Deviations
On SemEval and TACRED, the Denoising policy distributions with standard deviation are provided in Table 7