Class-Adaptive Self-Training for Relation Extraction with Incompletely Annotated Training Data

Relation extraction (RE) aims to extract relations from sentences and documents. Existing relation extraction models typically rely on supervised machine learning. However, recent studies showed that many RE datasets are incompletely annotated. This is known as the false negative problem in which valid relations are falsely annotated as 'no_relation'. Models trained with such data inevitably make similar mistakes during the inference stage. Self-training has been proven effective in alleviating the false negative problem. However, traditional self-training is vulnerable to confirmation bias and exhibits poor performance in minority classes. To overcome this limitation, we proposed a novel class-adaptive re-sampling self-training framework. Specifically, we re-sampled the pseudo-labels for each class by precision and recall scores. Our re-sampling strategy favored the pseudo-labels of classes with high precision and low recall, which improved the overall recall without significantly compromising precision. We conducted experiments on document-level and biomedical relation extraction datasets, and the results showed that our proposed self-training framework consistently outperforms existing competitive methods on the Re-DocRED and ChemDisgene datasets when the training data are incompletely annotated. Our code is released at https://github.com/DAMO-NLP-SG/CAST.


Introduction
Relation extraction (RE) (Wang et al., 2019;Chia et al., 2022a) is an important yet highly challenging task in the field of information extraction (IE).Compared with other IE tasks, such as named entity recognition (NER) (Xu et al., 2021), semantic role labeling (SRL) (Li et al., 2021), and aspectbased sentiment analysis (ABSA) (Li et al., 2018;Zhang et al., 2021b), RE typically has a significantly larger label space and requires graphical reasoning (Christopoulou et al., 2019).The complexity of the RE task inevitably increases the difficulty and cost of producing high-quality benchmark datasets for this task.
In recent years, several works that specifically focus on revising the annotation strategy and quality of existing RE datasets were conducted (Stoica et al., 2021;Alt et al., 2020;Tan et al., 2022b).For example, the DocRED (Yao et al., 2019) dataset is one of the most popular benchmarks for documentlevel relation extraction.This dataset is produced by the recommend-revise scheme with machine recommendation and human annotation.However, Huang et al. (2022) and Tan et al. (2022b) pointed out the false negative problem in the DocRED dataset, indicating that over 60% of the relation triples are not annotated.To provide a more reliable evaluation dataset for document-level relation extraction tasks, Huang et al. (2022) re-annotated 96 documents that are selected from the original development set of DocRED.In addition, Tan et al. (2022b) developed the Re-DocRED dataset to provide a high-quality revised version of the development set of DocRED.The Re-DocRED dataset consists of a development set that contains 1,000 documents and a silver-quality training set that contains 3,053 documents.Nevertheless, both works on DocRED revision did not provide gold-quality datasets due to the high cost of annotating the relation triples for long documents.Learning from incompletely annotated training data is crucial and practical for relation extraction.Hence, in this work, we focused on improving the training process with incompletely annotated training data.
To tackle the problem of training with incompletely annotated datasets, prior works leveraged the self-training method to alleviate the detrimental effects of false negative examples (Feng et al., 2018;Hu et al., 2021;Chen et al., 2021 et al., 2023).However, self-training-based methods are highly susceptible to confirmation bias, that is, the erroneously predicted pseudo-labels are likely to deteriorate the model's performance in subsequent rounds of training (Arazo et al., 2020;Tarvainen and Valpola, 2017;Li et al., 2020a).Furthermore, the label distribution of relation extraction task is highly imbalanced.Therefore, the predictions made by prior self-training methods are likely to be of the majority classes.Wei et al. (2021) proposed a re-sampling strategy based on class frequencies to alleviate this problem in image classification.In this way, not all generated pseudo-labels will be used to update the training datasets.The pseudo labels of the minority classes have higher probabilities to be preserved than those of the frequent classes.However, such a sampling strategy does not specifically address the problems caused by the erroneously generated pseudo labels.When a model is trained on incompletely annotated datasets, minority classes exhibit bad performance and frequent classes may have low recall scores, as shown in Figure 1.Merging pseudo labels with original labels of the training dataset without considering the correctness of the former potentially deteriorates performance in subsequent iterations.
In order to overcome confirmation bias in selftraining, we proposed a class-adaptive self-training (CAST) approach that considers the correctness of the pseudo labels.Instead of sampling the pseudo labels based on class frequencies, we introduced a class-adaptive sampling strategy to determine how the generated pseudo labels should be preserved.Specifically, we calculated the precision and recall scores of each class on the development set and used the calculated scores to compute the sampling probability of each class.Through such an approach, CAST can alleviate confirmation bias caused by erroneous pseudo labels.Our proposed approach preserves the pseudo labels from classes that have high precision and low recall scores and penalizes the sampling probability for the pseudo labels that belong to classes with high recall but low precision scores.
Our contributions are summarized as follows.
(1) We proposed CAST, an approach that considers the correctness of generated pseudo labels to alleviate confirmation bias in the self-training framework.(2) Our approach was evaluated with training datasets of different quality, and the experimental results demonstrated the effectiveness of our approach.(3) Although our approach is not specifically designed for favoring the minority classes, the minority classes showed more significant performance improvements than the frequent classes, which is a nice property as the problem of longtail performance is a common bottleneck for real applications.

Related Work
Neural Relation Extraction Deep neural models are successful in sentence-level and documentlevel relation extraction.Zhang et al. (2017) proposed position-aware attention to improve sentencelevel RE and published TACRED, which became a widely used RE dataset.Yamada et al. (2020) developed LUKE, which further improved the SOTA performance with entity pre-training and entity-aware attention.Chia et al. (2022b) proposed a data generation framework for zero-shot relation extraction.However, most relations in real-world data can only be extracted based on inter-sentence information.To extract relations across sentence boundaries, recent studies began to explore document-level RE.As previously mentioned, Yao et al. (2019) proposed the popular benchmark dataset DocRED for document-level RE.Zeng et al. (2020) leveraged a double-graph network to model the entities and relations within a document.To address the multilabel problem of DocRE, Zhou et al. (2021) pro-posed using adaptive thresholds to extract all relations of a given entity pair.Zhang et al. (2021a) developed the DocUNET model to reformulate document-level RE as a semantic segmentation task and used a U-shaped network architecture to improve the performance of DocRE.Tan et al. (2022a) proposed using knowledge distillation and focal loss to denoise the distantly supervised data for DocRE and achieved great performance on the Doc-RED leaderboard.However, all preceding methods are based on a closed-world assumption (i.e., the entity pairs without relation annotation are negative instances).This assumption ignores the presence of false negative examples.Hence, even the abovementioned state-of-the-art methods may not perform well when the training data are incompletely annotated.(Erkan et al., 2007;Sun et al., 2011;Chen et al., 2021;Hu et al., 2021).However, self-training is susceptible to confirmation bias; conventional self-training suffers from the problem of error propagation and makes overwhelming predictions for frequent classes.Prior research on semi-supervised image classification (Wei et al., 2021;He et al., 2021) indicated that re-sampling of pseudo-labels can be beneficial to class-imbalanced self-training.However, existing re-sampling strategies are dependent only on the frequencies of the classes and do not consider the actual performance of each class.Our method alleviates confirmation bias by employing a novel re-sampling strategy that considers the precision and recall of each class on the development set.In this way, we can downsample the predictions for popular classes and maintain high-quality predictions for long-tail classes.

Problem Definition
Document-level relation extraction (DocRE) is defined as follows: given a text T and a set of n entities {e 1 , ..., e n } appearing in the text, the objective of the document-level RE is to identify the relation type r ∈ C ∪ {no_relation} for each entity pair (e i , e j ).Note that e i and e j denote two different entities, and C is a predefined set of relation classes.The complexity of this task is quadratic in the number of entities, and the ratio of the NA instances (no_relation) is very high compared with sentence-level RE.Therefore, the resulting annotated datasets are often incomplete.The setting of this work is to train a document-level RE model with an incompletely labeled training set, and then the model is evaluated on a clean evaluation dataset, such as Re-DocRED (Tan et al., 2022b).
We denote the training set as S T and the development set as S D .Two types of training data are used in this work, each representing a different annotation quality.The first type is the training split of the original DocRED data (Yao et al., 2019), which we refer to as bronze-level training data.This data is obtained by a recommend-revise scheme.Even though the annotation of this bronze level is precise, there are a significant number of missing triples in this dataset.On the other hand, the training set of the Re-DocRED dataset has added a considerable number of triples to the bronze dataset, though a small number of triples might still be missed.We refer to this Re-DocRED dataset as silver-quality training data.

Overview
The main objective of our approach is to tackle the RE problem when the training data S T is incompletely annotated.We propose a class-adaptive self-training (CAST) framework, as shown in

Self-Training
In traditional self-training, models are trained on a small amount of well-annotated data and pseudolabels are generated on unlabeled instances (Zhu and Goldberg, 2009).However, we do not have access to well-annotated training data, and our training data contains false negative examples.Therefore, we need to construct an N -fold crossvalidation self-training system.Given a set of training documents S T with relation triplet annotation, these documents are divided into N folds.The first N − 1 folds will be used for training an RE model.Then, the trained model will be used to generate pseudo-labels for the held out N -th fold.The pseudo-labels will be merged with the original labels, and the merged data will be used to train a new model.The N -fold pseudo labeling process will be repeated for multiple rounds until no performance improvement is observed on the final RE system.However, because the class distribution of the document-level RE task is highly imbalanced, pseudo-labeling may favor the popular classes during prediction.This inevitably introduces large confirmation bias to popular classes, which is similar to the "rich-get-richer" phenomenon (Cho and Roy, 2004).

Intuition
When the annotation of the training set is incomplete, the model trained on such data typically shows high precision and low recall scores for most of the classes.Figure 1 shows the precision and recall of each class of the model that is trained on the DocRED dataset and evaluated on the development set of Re-DocRED.Among the 96 classes, most of the classes obtain higher precision scores than recall scores.Only one class that has a higher recall score than precision score; some classes have 0 precision and recall scores.Given this empirical observation, boosting self-training performance by sampling more pseudo-labeled examples from the classes that have high precision and low recall is a good strategy because (1) the pseudo labels of such classes tend to have better quality and (2) the recall performance of these classes can be improved by adding true positive examples.For extreme cases in which a class has predictions that are all wrong (i.e. its precision and recall are both 0), the logical action is to discard the corresponding pseudo-labels.

Class-Adaptive Self-Training (CAST)
As previously mentioned, traditional self-training suffers from confirmation bias, especially for RE task that has a highly imbalanced class distribution.The pseudo-labels that are generated by such an approach tend to be biased toward the majority classes.To alleviate this problem, we propose a class-adaptive self-training framework that filters the pseudo-labels by the per-class performance.Unlike existing self-training re-sampling techniques (Wei et al., 2021;He et al., 2021) that take only the class frequencies into account, our framework samples pseudo-labels based on their performance on the development sets.
First, we evaluate the model for pseudo-labeling on the development set S D and calculate the precision P and recall R for each class.Then, we define our sampling probability µ i for each relation class i as: where P i and R i are the precision and recall scores of class i, respectively, and β is a hyper-parameter that controls the smoothness of the sampling rates.Note that all pseudo labels will be used when the sampling probability equals to 1. Conversely, all the pseudo labels will be discarded when the sampling probability equals to 0. If the recall of a specific class is very small and its precision is close to 1, the sampling rate of the class will be closer to 1. On the contrary, if the recall for a certain class is high, the sampling rate of the class will  be low.In this way, our method is able to alleviate confirmation bias toward the popular classes, which typically have higher recall.The pseudocode of our proposed CAST framework is provided in Algorithm 1.

Experimental Setup
Our proposed CAST framework can be applied with any backbone RE model.For the experiment on DocRED, we adopted the ATLOP (Zhou et al., 2021) model as the backbone model, which is a well-established baseline for the DocRE task.
We used the PubMedBERT (Gu et al., 2021) encoder for the BioRE experiments.We use the development set of Re-DocRED in the document-level RE experiments because the Re-DocRED dataset has a high quality.Moreover, we use the distantlysupervised development set of ChemDisGene for the BioRE experiments.Our final models are evaluated on the test sets of Re-DocRED and ChemDis-Gene.Both of the test sets are human-annotated and have high quality, the statistics of the datasets can be found in Table 1.
For the hyper-parameters, we set M = 5 (i.e., the iteration round in Algorithm 1) and N = 5 for the self-training-based methods because these methods typically reach the highest performance before the fifth round and five-fold training is the conventional practice for cross validation.For β, we grid searched β ∈ {0.0, 0.25, 0.5, 0.75, 1}.For evaluation, we used micro-averaged F1 score as the evaluation metric.We also evaluate the F1 score for frequent classes and long-tail classes, denoted as Freq_F1 and LT_F1, respectively.For the DocRED dataset, the frequent classes include the top 10 most popular relation types2 in the label space; the rest of the classes are categorized as the long-tail classes.Following Yao et al. (2019), we use an additional metric Ign_F1 on the DocRE task.This metric calculates the F1 score for the triples that do not appear in the training data.

Baselines
Vanilla Baselines This approach trains existing state-of-the-art RE models on incompletely annotated data and serves as our baseline method.As stated earlier, we use ATLOP as the backbone model for the DocRE experiments.In addition to ATLOP, we compare GAIN (Zeng et al., 2020), DocuNET (Zhang et al., 2021a), and KD-DocRE (Tan et al., 2022a) as our vanilla baselines.These methods are top-performing methods on the Re-DocRED dataset.However, similar to ATLOP, the performances of these models deteriorate significantly under the incomplete annotation setting.
Negative Sampling (NS) (Li et al., 2020b) This method tackles the incomplete annotation problem through negative sampling.To alleviate the effects of false negatives, this method randomly selects partial negative samples for training.Such an approach can help to alleviate the detrimental effect of the false negative problem.Vanilla Self-Training (VST) (Peng et al., 2019;Jie et al., 2019) VST is a variant of simple selftraining.In this approach, models are trained with N folds, and all pseudo-labels are directly combined with the original labels.Then, a new model is trained on the datasets with combined labels.
Class Re-balancing Self-Training (CREST) (Wei et al., 2021) This algorithm is the most advanced baseline of class-imbalanced semisupervised training, re-samples the pseudo-labels generated by models.However, this sampling strategy only considers the frequencies of the training samples, whereas our CAST considers the per-class performance on the development set.
SSR Positive Unlabeled Learning (SSR-PU) (Wang et al., 2022) This method applies a positive unlabeled learning algorithm for DocRE under the incomplete annotation scenario.SSR-PU utilizes a shift-and-squared ranking (SSR) loss to accommodate the distribution shifts for the unlabeled examples.
BioRE Baselines For the BioRE experiments, we compare our methods with Biaffine Relation Attention Network BRAN (Verga et al., 2018) and PubmedBERT (Gu et al., 2021), which is a pretrained language model in the biomedical domain.finding can be ascribed to the low recall score of this method, as shown in Figure 1.NS significantly improves the performance compared with the baseline.After comparing vanilla self-training with the baseline, we observe that although the recall score is the highest for this method, its precision is significantly reduced.We observe similar trends for all self-training based methods (i.e., VST, CREST, and CAST), the recall improved at the expense of precision.Notably, the performance of the simple NS baseline exceeds the performance of SSR-PU when trained on the DocRED data.Our proposed CAST framework consistently outperforms the competitive baselines and achieves the highest performance for both BERT and RoBERTa encoders.Our bestperforming model outperforms the baseline by 16.0 F1 (49.32 vs. 65.32).Moreover, the CAST obtains the highest precision score among the three selftraining methods, thereby showing that the examples added by our class-adaptive sampling strategy have better quality.

Experimental Results
The experimental results on the test set of Re-DocRED (Table 3) depict that the baseline F1 score is significantly improved due to the large gain in the recall score when the training data are switched from bronze-quality to silver-quality.Compared with baseline approaches, our CAST achieves consistent performance improvements in terms of F1 score.The F1 difference between the baseline and our CAST is 2.06 (72.61 vs. 74.67).However, the performance gap between our approach and the baseline is smaller than the corresponding gap when both are trained with DocRED.This indicates that the performance of existing state-of-theart models for document-level RE is decent when high-quality training data is provided but declines when the training data are incompletely annotated.This finding verifies the necessity of developing better self-training techniques because preparing high-quality training data is costly.
Table 4 presents the experiments on biomedical RE.Our CAST model consistently outperforms strong baselines, exceeding the performance of SSR-PU by 5.47 F1 (54.03 vs. 48.56).
On the basis of the results of DocRE and BioRE experiments, self-training-based methods aim to improve recall and consistently improve overall performance when the training data is incompletely annotated.However, our CAST maintains a better balance between increasing recall and maintaining precision.Figure 3b shows that all self-training-based methods generally have improving recall scores as the number of self-training rounds increases.On the contrary, the precision scores decline.From Figure 3c, we observe that VST outperforms CREST and CAST in the first two rounds.This is mainly because VST does not perform re-sampling on the pseudo-labels and it utilizes all pseudo-labels.At the beginning stage, these labels are of relatively good quality.However, the performance of VST drops after the second round of pseudo-labeling because as the number of rounds increases, the increase in the number of false positive examples in the pseudo-labels outweighs the benefit.Meanwhile, the performance gains of CREST and CAST are relatively stable, and both methods produce their best-performing models at round 4. Compared with CREST, our CAST maintains higher precision scores as the number of rounds increases (Figure 3a).We also assess the F1 performance of the fre-quent and long-tail classes with respect to the number of rounds, and the comparison is shown in Figure 5.The results reveal that VST suffers greatly from confirmation bias on both frequent and LT classes, i.e., Figure 5a and Figure 5b, and its performance becomes very poor in round 5.In Figure 5b, we can see that the performance gains of CAST is stable across the training rounds and achieved the best LT performance.

Detailed Analysis of CAST
In this section, we analyze the performance of our CAST framework in detail.We first plot the precision and recall scores of VST and CAST for all the classes in Figure 4, where the experimental results are obtained by training with the DocRED dataset.The formulation of Figure 4 is the same as Figure 1. Figure 4a demonstrates that VST significantly improves the recall scores of many classes compared with the baseline in Figure 1.However, the improvements in recall scores are accompanied by a large decline in precision scores.This observation shows that the pseudo-labels in VST contain a considerable amount of erroneous predictions.By contrast, our CAST framework is able to better maintain the precision scores for most of the classes.The recall scores for most of the classes are significantly higher compared with those of the baseline.This observation justifies the improvements of the overall F1 scores in Table 2 despite the lower recall of CAST model than VST.

Effect of β
We further analyze the effect of the sampling coefficient β on our CAST framework in Figure 6, the experiments are conducted by training with the DocRED dataset.When β value is small, CAST behaves like the VST model, exhibits some F1 improvements in the first few rounds, and demonstrates diminishing positive effects in the later rounds.Larger β leads to better overall improvements and smaller fluctuations across different rounds.However, because the term in Eq. 1 is smaller than 1, higher β may lead to lower sampling rates for all the classes.As a result, the convergence time of self-training may be longer.The interpretation for other values of β is provided in the Appendix C.

Conclusions and Future Work
In this work, we study the under-explored problem of learning from incomplete annotation in relation extraction.This problem is highly important in real-world applications.We show that existing state-of-the-art models suffer in this scenario.To tackle this problem, we proposed a novel CAST framework.We conducted experiments on DocRE and BioRE tasks, and experimental results show that our method consistently outperforms competitive baselines on both tasks.For future work, we plan to extend our framework to the distant supervision scenario.From the domain perspective, we plan to apply our framework to image classification tasks.

Limitations
The proposed CAST framework carries the same limitation of self-training-based methods, which is the requirement for multiple rounds and multiple splits of training.As a result, the GPU computing hours of CAST are longer than those of vanilla baselines and NS.The experimental results on SentRE are shown in Table 5.For the TACRED dataset, the top 5 classes4 are included in the frequent classes.We can also see that when training with bronze-quality data (i.e., the upper section), our proposed CAST still achieves the best performance in terms of F1 score.This observation shows that our method is effective across different relation extraction scenarios and backbone models.On the other hand, we can observe that the baseline model achieves the highest F1 score when training with the Re-TACRED dataset (i.e., the lower section).As mentioned in the section of problem definition, the Re-TACRED training set has resolved the false negative and false positive problems of TACRED.Therefore, by simply using all training samples of Re-TACRED, the baseline approach achieves the best F1.It is worth noting that our CAST is very robust and does not hurt the performance, i.e., achieving slightly worse F1 but slightly better recall compared with the baseline.

B Hyper-Parameters of the Baselines
In this section, we report the hyper-parameters of the baseline experiments.For the negative sampling experiments, we used sampling rate γ = 0.1 for the DocRED experiment, γ = 0.5 for TACRED experiment and γ = 0.7 for the Re-TACRED and Re-DocRED experiments.γ is searched from γ ∈ {0.1, 0.3, 0.5, 0.7, 0.9}.
From CREST (Wei et al., 2021), the classes are first ranked by their frequencies, and the sampling rate for class i is calculated as: where X 1 is the count of the most frequent class among the positive classes.We set the power α = 0.33 as reported in their paper.For all the self-training-based experiments (VST, CREST, and CAST), we trained with 10 epochs per fold.All our experiments were run on a NVIDIA-V100 GPU.

Figure 1 :
Figure 1: Precision and recall scores of each class (ranked by class frequency: left [high] →right [low]) on the development set of Re-DocRED when the model is trained on DocRED.P Reg. and R Reg. stand for the regression lines of the scores.

Figure 2 :
Figure 2: Illustration of training dataset update of CAST, and Algorithm 1 describes its full details.
Figure 2, to pseudo-label the potential false negative examples within the training set.First, we split the training set into N folds and train an RE model with N − 1 folds.The remaining fold S T k is used for inference.Next, we use a small development set S D to evaluate the models and calculate the sampling probability for each relation class (Eq.1).The predicted label set Y T k is obtained by conducting inference on S T k .Then, we re-sample the predicted labels based on the computed probability, which is calculated based on the performance of each class.The re-sampled label set is denoted as Y ′ T k .Lastly, Y ′ T k will be merged with the initial labels of S T k .The details of the proposed framework are discussed in the following subsections.

Algorithm 1
Class-Adaptive Self-Training Input: M : Number of rounds N : Number of folds S T : An incompletely annotated training set S D : A task-specific development set θ: A backbone model with parameters β: Smoothness coefficient

Figure 3 :Figure 4 :
Figure 3: Comparison of different self-training strategies when training on DocRED.

Figure 5 :Figure 6 :
Figure 5: F1 scores of frequent and long-tail classes with respect to rounds when trained on DocRED.

Figure 7 :
Figure 7: Effect of larger β when training on DocRED.

Table 2
presents the experimental results for the document-level RE.The experimental results on the original DocRED dataset show that the F1 score of the ATLOP-RoBERTa model is only 49.32.This

Table 3 :
Experimental results on the test set of Re-DocRED when trained on silver quality data.

Table 4 :
Experimental results on ChemDisGene.The results with numeric superscripts are taken from the respective papers.
† : The results are retrieved from Zhang et al. (2022).* : The results are retrieved from Wang et al. (2022).

Table 5 :
Experimental results on the test set of Re-TACRED when trained on TACRED and Re-TACRED, respectively.Model selection is based on the dev set of Re-TACRED.