Manual Evaluation Matters: Reviewing Test Protocols of Distantly Supervised Relation Extraction

Distantly supervised (DS) relation extraction (RE) has attracted much attention in the past few years as it can utilize large-scale auto-labeled data. However, its evaluation has long been a problem: previous works either took costly and inconsistent methods to manually examine a small sample of model predictions, or directly test models on auto-labeled data -- which, by our check, produce as much as 53% wrong labels at the entity pair level in the popular NYT10 dataset. This problem has not only led to inaccurate evaluation, but also made it hard to understand where we are and what's left to improve in the research of DS-RE. To evaluate DS-RE models in a more credible way, we build manually-annotated test sets for two DS-RE datasets, NYT10 and Wiki20, and thoroughly evaluate several competitive models, especially the latest pre-trained ones. The experimental results show that the manual evaluation can indicate very different conclusions from automatic ones, especially some unexpected observations, e.g., pre-trained models can achieve dominating performance while being more susceptible to false-positives compared to previous methods. We hope that both our manual test sets and novel observations can help advance future DS-RE research.


Introduction
Relation extraction (RE) aims at extracting relational facts between entities from the text. One crucial challenge for building an effective RE system is how to obtain sufficient annotated data. To tackle this problem, Mintz et al. (2009) propose distant supervision (DS) to generate large-scale auto-labeled data by aligning relational facts in knowledge graphs (KGs) to text corpora, with the core assumption that one sentence mentioning two entities is likely to express the relational facts between the two entities from KGs.
As DS can bring hundreds of thousands of autolabeled training instances for RE without any human labor, DS-RE has been widely explored in the past few years (Riedel et al., 2010;Hoffmann et al., 2011;Zeng et al., 2015;Lin et al., 2016;Feng et al., 2018;Vashishth et al., 2018) and has also been widely extended to other related domains, such as biomedical information extraction (Peng et al., 2017; and question answering (Bordes et al., 2015;. Although DS-RE has achieved great success, we identify one severe problem for the current DS-RE research-its evaluation. Existing works usually take two kinds of evaluation methods following Mintz et al. (2009): held-out evaluation, which directly uses DS-generated test data to approximate the trend of model performance, and human evaluation, which manually checks the most confident relational facts predicted by DS-RE models. Since manually checking is costly, most works with human evaluation only examine a small proportion of the predictions. Moreover, different works may sample different splits of data, making human evaluation inconsistent across the literature. Most recent studies even skip the human evaluation for the above factors and only take the held-out one.
However, the held-out evaluation can be quite noisy: there are many false-positive cases, where the sentences do not express the auto-labeled relations at all; besides, due to the incompleteness of KGs, DS may miss some relations, just as shown in Figure 1. After checking 9, 744 sentences in the held-out set of NYT10 (Riedel et al., 2010), the most popular DS-RE dataset, we found that about 53% of the entity pairs are wrongly labeled, emphasizing the need for a more accurate evaluation.
To make DS-RE evaluation more credible and alleviate the trouble of manual checking for later work, we build human-labeled test sets for two DS-RE datasets: NYT10 (Riedel et al., 2010) and Wiki20 . For NYT10, we manually annotate sentences with positive DS relations in its held-out test set. We also use a fine-tuned BERT-based (Devlin et al., 2019) RE model to predict all "N/A" (not applicable) sentences, and manually label the top 5, 000 sentences scored as having a relation. Additionally, we merge some unreasonably split relations and reduce the number of relation types from 53 to 25. For Wiki20 dataset, we utilize both the relation ontology and humanlabeled instances of an existing supervised dataset Wiki80 (Han et al., 2019) for the test, and then re-organize the DS training data accordingly.
Based on the newly-constructed benchmarks, we carry out a thorough evaluation of existing DS-RE methods, as well as incorporating recently advanced pre-trained models like BERT (Devlin et al., 2019). We found that our manually-annotated test sets can indicate very different conclusions from the held-out one, especially with some surprising observations: (1) although pre-trained models lead to large improvements, they also suffer from falsepositives more severely, probably due to the preencoded knowledge they have (Petroni et al., 2019); (2) existing DS-RE denoising strategies that have been proved to be effective generally do not work for pre-trained models, suggesting more efforts needed for DS-RE in the era of pre-training. To conclude, our main contributions in this work are: • We provide large human-labeled test sets for two DS-RE benchmarks, making it possible for later work to evaluate in an accurate and efficient way. • We thoroughly study previous DS-RE methods using both held-out and human-labeled test sets, and find that human-labeled data can reveal inconsistent results compared to the held-out ones. • We discuss some novel and important observa-tions revealed by manual evaluation, especially with respect to pre-trained models, which calls for more research in these directions.

Related Work
Relation extraction is an important NLP task and has gone through significant development in the past decades. In the early days, RE models mainly take statistical approaches, such as pattern-based methods (Huffman, 1995;Califf and Mooney, 1997), feature-based methods (Kambhatla, 2004;Zhou et al., 2005), graphical methods (Roth and Yih, 2002), etc. With the increasing computing power and the development of deep learning, neural RE methods have shown a great success (Liu et al., 2013;Zeng et al., 2014;Zhang and Wang, 2015;Zhang et al., 2017). Recently, pre-trained models like BERT (Devlin et al., 2019) have dominated various NLP benchmarks, including those in RE (Baldini Soares et al., 2019;Zhang et al., 2019b). All these RE methods focus on training models in a supervised setting and require largescale sufficient human-annotated data.
To generate large-scale auto-labeled data without human effort, Mintz et al. (2009) first use DS to label sentences mentioning two entities with their relations in KGs, which inevitably brings wrongly labeled instances. To handle the noise problem, Riedel et al. (2010);Hoffmann et al. (2011);Surdeanu et al. (2012) apply multi-instance multi-label training in DS-RE. Following their settings, later research mainly takes on two paths: one aims at selecting informative sentences from the noisy dataset, using heuristics (Zeng et al., 2015), attention mechanisms (Lin et al., 2016;Han et al., 2018c;Zhu et al., 2019), adversarial training (Wu et al., 2017;Han et al., 2018a), and reinforcement learning (Feng et al., 2018;Qin et al., 2018); the other incorporates external information like KGs (Ji et al., 2017;Han et al., 2018b;Zhang et al., 2019a;Hu et al., 2019), multilingual corpora (Verga et al., 2016;Lin et al., 2017;, as well as relation ontology and aliases (Vashishth et al., 2018). Recently, pretrained DS-RE models have also been explored, including both domain-general (Alt et al., 2019; and domain-specific (Amin et al., 2020) models. Some other latest works  utilize DS data for intermediate pre-training in order to boost supervised RE tasks.
As mentioned in our introduction, the evalua-

DS-RE Datasets
In this section, we introduce the way we build the manually-annotated test sets for NYT10 (Riedel et al., 2013) and Wiki20 . We show the statistics of these datasets in Table 1.

NYT10 Dataset
NYT10 is constructed by aligning facts from the FreeBase (Bollacker et al., 2008) with the New York Times (NYT) corpus (Sandhaus, 2008). The original NYT10 dataset contains 53 relations (including N/A). After thoroughly examining the dataset, we found that (1) there are many duplicate instances in the dataset, (2) there is no public validation set, and some previous works directly take the test set to tune the model, and (3) the relation ontology is not reasonable for an RE task. Therefore, we first clean the dataset by removing duplicate sentences, split a validation set, and then merge some of the relations as described below.
A New Relation Ontology One example of NYT10's improper relation ontology is the relations related to state/province capitals. There are 12 such relations in the original dataset, representing region capitals of different countries, while some of these relations even do not have instances in the test set. We merge these 12 relations as one unified relation /location/region/capital, and we also merge 3 relations related to organization locations as /business/location. Besides, we delete relations that only show up in either the training set or the test set (most of which have very few sentences). In the end, we get 25 relations in the new dataset. We provide the detailed relation list of the new dataset in Appendix A. Interestingly, we found that training models with the new dataset leads to a slight performance degeneration, as shown in Table 2, which is counterintuitive (since merging classes usually makes the task easier). We suspect that the original relation ontology provides heuristics for the model. For example, models can learn from the original ontology that every sentence with a "US state" as the head entity expresses the fine-grained relation /location/us state/capital, which is a shortcut that cannot be acquired with the merged relation /location/region/capital. This shows the bias of the original NYT10 dataset being inappropriately exploited by models.

Annotated Test Set
The original NYT10 dataset only provides an auto-labeled test set for the heldout evaluation. Based on the original test set, we manually annotate all sentences that have a positive DS label. In addition to that, we also fine-tune a BERT model (as described in §4.2) to predict the relations of all sentences originally labeled as N/A, and annotate the 5, 000 sentences with the highest predicted scores of non-N/A relations. For each sentence, we ask 4 annotators to decide whether it expresses one or more relations among 25 rela-  tions. Note that one sentence may have multiple relations. Specially, we do not provide any relation suggestions using DS labels or model predictions to annotators, in order to avoid the annotators being biased by the external information.
When aggregating the final annotation results, we consider each relation for each sentence independently. The sentence is regarded to express one relation if more than half of the annotators labeled it with this relation. If one sentence gets no votes for any positive relations in the above process, then it is labeled as N/A. For annotation conflicts (i.e., no candidate relation gets more than half votes), the authors manually annotate these sentences.
Finally, we obtain the human-labeled test set with 9, 744 sentences, 32% of which are N/A instances. It contains 5, 174 entity pairs and 3, 899 manually-verified relational facts in total. After comparing it with the corresponding original DSgenerated labels, we found that at the fact level, the DS annotations only have a precision of 69.1% and a recall of 33.9%. At the entity pair level, the accuracy of DS labels is only 47.1%. This emphasizes the need to take the human-annotated test set for more accurate evaluation in DS-RE.

Wiki20 Dataset
Han et al. (2020) construct the Wiki20 dataset by aligning the English Wikipedia corpus with Wikidata (Vrandečić and Krötzsch, 2014). To provide an annotated test set for it, we utilize Wiki80 (Han et al., 2019), a supervised RE dataset with 80 relations from Wikidata. We re-organize Wiki20 by adopting the same relation ontology as Wiki80 and re-splitting the train/validation/test sets, while taking sentences in Wiki80 as the test set. We make sure that there is no overlap of entity pairs among the three splits, to avoid any information leakage.
Note that Wiki20 is quite different from NYT10: NYT10 labels a sentence as "N/A" if the entity pair in the sentence does not have a relation in the KG. On the contrary, Wiki20 labels entity pairs with a relation outside its relation ontology as "N/A". In  Figure 2: An illustration for a typical multi-instance multi-label (bag-level) model. The model aims to predict relation probabilities for entity pairs, instead of sentences, which is usually accomplished by aggregating a bag representation and doing classification over it. other words, "N/A" instances in Wiki20 express unknown relations instead of no relation.

DS-RE Models
In this section, we elaborate the multi-instance multi-label framework for DS-RE, and introduce models we evaluate, including both previous methods and the latest pre-trained ones.

Multi-instance Multi-label Evaluation
Unlike the supervised RE tasks which usually evaluate models at the sentence level, DS-RE evaluates how well models can extract relational facts from the corpus, i.e. measuring the precision and recall of extracted relational facts (a fact is an entity pair and a relation between them). It is named as multiinstance multi-label, since each entity pair might be mentioned in multiple sentences, and one entity pair can have more than one relation. Under the framework, models are required to predict the potential relations for each entity pair-according to all sentences mentioning the pair-during the evaluation, as shown in Figure 2. Sentences correlated with the same entity pair are also named as a bag, and thus we interchangeably refer to multi-instance multi-label framework as bag-level framework.
The most popular way to compare DS-RE models is to plot precision-recall (P-R) curve and calculate the area-under-the-curve (AUC). We also report micro F1 and macro F1 in our experiments. Since the numbers of instances for different relations are extremely imbalanced, macro F1 can better demonstrate model performance while avoiding the bias brought by those major relations.

Model Details
Sentence-level Training Models are trained in a sentence-level fashion (as in supervised RE), but used as a bag-level model during evaluation. As shown in Figure 2,   an aggregator to fuse embeddings of all sentences in the bag, and then feeds the bag-level representation to the classifier. Here we take two aggregation strategies: average (AVG), which averages the representations of all the sentences in the bag; and at-least-one (ONE) (Zeng et al., 2015), which first predicts relation scores for each sentence in the bag, and then takes the highest score for each relation.
Bag-level Training Directly deploying sentencelevel training for DS-RE suffers from the wrong labeling problem: one sentence mentioning two entities may not express its auto-labeled relation.
To alleviate this problem, models can also take bag-level framework in the training, based on the expressed-at-least-one assumption (Riedel et al., 2010): at least one sentence in the bag expresses the auto-labeled relation. Besides the AVG and ONE 2 strategies mentioned above that can be directly deployed for bag-level training, Lin et al. (2016) also propose to use sentence-level attention (ATT) for aggregation: It produces bag-level representation as a weighted average over embeddings of sentences, and determines weights by attention scores between sentences and relations.
For our experiments, we take CNN (Liu et al., 2013), PCNN (Zeng et al., 2015) and BERT (Devlin et al., 2019) as options of sentence encoders, which are all common choices for neural RE models. We evaluate combinations of different sentence encoders, training policies, and aggregation strategies, e.g., bag-level trained PCNN with ATT aggregator (PCNN+bag+ATT) or sentence-level trained BERT with ONE aggregator (BERT+sent+ONE). Besides, we evaluate several representative DS-RE models from literature, namely RL-DSRE (Qin et al., 2018), which takes deep reinforcement learning for denoising training instances, BGWA (Jat et al., 2017), which takes both word-level and sentence-level attention, and RESIDE (Vashishth et al., 2018), which introduces side information like relation aliases to put soft constraints on prediction.
For BERT-based sentence encoder, there are some practical challenges when adopting bag-level training: in the worst cases, one bag can contain thousands of sentences, which are beyond the capacity of most computing devices due to the large size of pre-trained models. To address this issue, we take a random sampling strategy during training: for each bag, we randomly sample b sentences, instead of taking all of them. For evaluation, we use the same routine as other non-pre-trained encoders, taking all of the sentences into account (because back propagation is not needed here so the bag can be split into several batches). Since this is different from the original bag-level training, we carry out a pilot experiment to examine the effect of the sampled training. From Table 3, we can see that our sampling strategy does not significantly hurt the performance of the bag-level training.
We also add another variant, BERT-M, in our evaluation. We observe from the top predictions of BERT models (Figure 3) that BERT tends to make false-positive errors for entity pairs that express a relation in the KG but do not have any sentence truly expressing the relation in the data, probably due to that model learns shallow cues solely from entities. Thus, following , we mask entity mentions during training and inference to avoid learning biased heuristics from entities.

Implementation Details
We use the OpenNRE toolkit (Han et al., 2019) for most of our experiments, including both sentencelevel and bag-level training. For CNN and PCNN, we follow the hyper-parameters of Han et al. (2019). For BERT, we use pre-trained checkpoint bert-base-uncased for initialization, take a batch size of 64, a bag size of 4 and a learning rate of 2 × 10 −5 , 3 and train the model for 3 epochs. For RL-DSRE, RESIDE and BGWA, we directly use their original implementation.

Evaluation Settings
We take three different settings in our experiments: 3 This is determined by a grid search over batch sizes in {16, 32, 64} and learning rates in {1e-5, 2e-5, 5e-5}.

Model
Bag Strategy Held-out  Table 4: Results (%) on NYT10, including the held-out evaluation, bag-level manual evaluation, and sentencelevel manual evaluation. The "bag" column indicates whether the model uses bag-level training, and the "strategy" column shows the bag aggregation policy. We report the AUC, micro F1 (Micro) and macro F1 (Macro) scores.

Held-out evaluation:
We take the test data of the original DS datasets for evaluation. The trend of this evaluation should be consistent with the reported results in most DS-RE literature.
Bag-level manual evaluation: We take our human-labeled test data for bag-level evaluation. Since annotated data are at the sentence-level, we construct bag-level annotations in the following way: For each bag, if one sentence in the bag has a human-labeled relation, this bag is labeled with this relation; if no sentence in the bag is annotated with any relation, this bag is labeled as N/A.
Sentence-level manual evaluation: As we wonder how well bag-level-trained models can handle sentence-level predictions, our human-labeled test set is also used for a sentence-level evaluation.
We report AUC, micro F1 and macro F1 for all above evaluation settings. We take the best micro F1 on the P-R curves and use the corresponding threshold for calculating macro F1. Considering the difference between NYT10 and Wiki20, we evaluate models on the two datasets respectively. Table 4 shows the main results on NYT10. We also plot P-R curves of selected models in Appendix B. Overall, for all three settings, training pre-trained models in a sentence-level style always perform the best, while applying bag-level training strategies can significantly boost the performance when taking other non-pre-trained encoders. Feng et al. (2018) observe that bag-level training is not helpful in the sentence-level evaluation, which contradicts our observation. We suspect that it is because Feng et al. (2018) only manually check a small proportion of test data, leading to a biased result.

The Results on NYT10
More importantly, by comparing the held-out test results to the manual ones, we come to the conclusion that manual evaluation matters: autolabeled and human-labeled test data lead to very different observations. For example, the comparisons between PCNN and RL-DSRE, and BGWA and RESIDE are reversed when taking different evaluations. Also, the performance gaps between different models become much smaller when it comes to the manual test. Since our manual test set is smaller than the original held-out one (because we did not annotate all N/A sentences), to make a clearer comparison, we evaluate two selected models on the bag-level manual test set and on the corresponding instances in the held-out test set, respectively, and we plot the P-R curves in Figure 3. It shows that not only the absolute values of the two measurements differ a lot, but it also affects the relative performances between the models. For instance, BERT+sent+ONE shows a considerable advantage over PCNN+bag+ATT at the top predictions on the held-out test set, but it is completely the opposite case at the manual test, where BERT+sent+ONE is even significantly worse than PCNN+bag+ATT. It clearly suggests that using the held-out test set cannot well demonstrate the real pros and cons of the models.
Compared to others, BGWA and RESIDE suffer an extreme change in performance between the held-out and manual evaluations, and we suspect that it is due to the fact that they use entity types as extra information, which leads to overfitting biased heuristics of entities. This further emphasizes the need of using manually-labeled test data in DS-RE.
After checking the manual results, we further identify some interesting observations that have not been clearly demonstrated with the DS evaluation: Pre-trained Models First of all, BERT-based models have achieved supreme performance across all three metrics. To thoroughly examine BERT and its variants in the DS-RE scenario, we further plot their P-R curves with the bag-level manual test in Figure 4. It is surprising to see that all bag-level training strategies, especially the ATT strategy which brings significant improvements for PCNN-based models, do not help or even degenerate the performance with pre-trained ones. This observation is also consistent with that in Amin et al. (2020), though they only compare BERT+bag+AVG and BERT+bag+ATT. We hypothesize the reasons are that solely using pre-trained models already makes a strong baseline, since they exploit more parameters and they have gained pre-encoded knowledge from pretraining (Petroni et al., 2019), all of which make them easier to directly capture relational patterns from noisy data; and bag-level training, which essentially increases the batch size, may raise the optimization difficulty for these large models.
Another unexpected observation is that, though the P-R curve of BERT is far above other models in the held-out test, we identify a significant drop of that in the manual test, as shown in Figure 3 and Appendix B. By manually checking those errors, we find that most of them are models predicting facts that exist in the KG but are not supported by the text (i.e., false-positive). For example, Arthur Schnitzler was indeed born in Vienna, but it is wrong for the model to infer the relation place of birth from sentence "Authur Schnitzler wrote a story set in Vienna." We assume that it is not only because of the prior knowledge of pre-trained models, but is also due to that BERT can better learn heuristics from entity themselves, as shown in the study of  with supervised RE. Considering the data of DS-RE are noisy and in many cases the text does not support the labeled facts, this overfittingto-heuristic phenomenon can only be more severe.
To verify the assumption and try out a simple solution to alleviate the problem, we take a BERT-M variant (as described in §4.2) and show its results in Figure 4. We can see that the P-R curves of BERT-M are above those of BERT at the beginning, demonstrating that BERT-M models have higher precisions at those top predictions. Later on, since BERT can extract more facts than BERT-M by fully utilizing information from entity names, BERT-M reduces below BERT. From these results, we highlight that how to handle the false-positives and denoise DS-RE data for pre-trained models still remains an open and challenging problem.
Imbalanced Classes Previous works of DS-RE usually take AUC, micro F1, or P-R curves to measure the abilities of models, which show the over-  all performance trend averaged on relational facts. However, the distribution of training instances across relations is extremely uneven. For example, in NYT10 , almost half of the positive instances are /location/location/contains. On the contrary, half of the relations have fewer than 1, 000 sentences. In this case, macro F1 can better show the averaged performance across different relations, without being biased by the majority class. Table 4 demonstrates that even though in most times conclusions of different metrics are consistent, there are cases when models improve micro F1 but degenerate macro F1.
To further study how models perform on each relation, we plot several representative models and their micro F1 scores for each relation in Figure 5. We can see that: (1) The top-4 relations, which account for 80% test instances, do not vary much in performance with different models, while the difference of performance takes place mostly outside the  Table 5: Results (%) on Wiki20 of representative models. "Bag" indicates bag-level training and "Micro" and "Macro" represent micro and macro F1 respectively.
top-4; (2) Some relations even have zero F1 scores, mostly because they have very few training or test sentences. These results further underscore the importance to look into per-relation scores for DS-RE, and we advocate that later works should include macro F1 for more comprehensive comparisons.

The Results on Wiki20
We choose several representative models and further evaluate them on Wiki20, as shown in Table 5.
The main observation of results on Wiki20 is consistent with that of NYT10-sentence-level pretrained models perform the best, and using baglevel training helps with non-pre-trained onesthough the overall performance is much higher. Another difference is that, in Wiki20, AVG performs better than ONE and ATT. We think that it is due to the inherent difference in how the two datasets are constructed, especially the difference in the aspect of determining N/A sentences. Compared to NYT10, part of the N/A instances in Wiki20, instead of indicating no relation between the entities, may correspond to a specific relation that is outside of the dataset ontology. It suggests that when dealing with N/A instances, considering their latent semantics, rather than simply treating them as one abstract class, may further benefit RE models.

Conclusion
In this paper, we study the problem of test protocols in DS-RE and build large manually-annotated test sets for two DS-RE datasets, to enable a more accurate and efficient evaluation. We not only demonstrate that our manual test sets show different observations from previous held-out ones, but also capture some interesting reflections by using the manual test, e.g., pre-trained models suffer falsepositives more and bag-level training strategies generally do not help with pre-trained models. We hope that our manual test sets can mark a new starting line for DS-RE, while these observations can motivate novel research directions towards better DS-RE models, e.g., studying denoising methods for pre-trained models or processing N/A relations in a more fine-grained way.

Acknowledgments
This work is supported by the National Key Research and Development Program of China (No. 2020AAA0106501) and Beijing Academy of Artificial Intelligence (BAAI). This work is also supported by the Pattern Recognition Center, WeChat AI, Tencent Inc.

Ethical Considerations
Our work mainly focuses on two parts: the construction of two manually-labeled test sets and the analyses of models based on the manual test. Regarding the annotation, we first approximate the workload by annotating a few examples on our own, and then determine the wages for annotators according to local standards. The two datasets are based on NYT and Wikipedia, and we did not identify any unethical content during annotation.
Concerning the analyses, we find that models tend to utilize some shallow clues for classification, such as learning heuristics from entities. This behavior can potentially create biased extraction results based on the distributions of entities in the training set and is worth further investigating. .1: P-R curves of representative models in held-out test, bag-level manual test and sentence-level manual test of NYT10. Note that the scales of X-axis are not the same in the three figures.