Does Recommend-Revise Produce Reliable Annotations? An Analysis on Missing Instances in DocRED

DocRED is a widely used dataset for document-level relation extraction. In the large-scale annotation, a recommend-revise scheme is adopted to reduce the workload. Within this scheme, annotators are provided with candidate relation instances from distant supervision, and they then manually supplement and remove relational facts based on the recommendations. However, when comparing DocRED with a subset relabeled from scratch, we find that this scheme results in a considerable amount of false negative samples and an obvious bias towards popular entities and relations. Furthermore, we observe that the models trained on DocRED have low recall on our relabeled dataset and inherit the same bias in the training data. Through the analysis of annotators’ behaviors, we figure out the underlying reason for the problems above: the scheme actually discourages annotators from supplementing adequate instances in the revision phase. We appeal to future research to take into consideration the issues with the recommend-revise scheme when designing new models and annotation schemes. The relabeled dataset is released at https://github.com/AndrewZhe/Revisit-DocRED, to serve as a more reliable test set of document RE models.


Introduction
Relation Extraction (RE) is an important task which aims to identify relationships held between entities in a given piece of text. While most previous methods focus on extracting relations from a single sentence (Lin et al., 2016;, recent studies begin to explore RE at document level (Peng et al., 2017;Zeng et al., 2020;Nan et al., 2020;Zhang et al., 2021), which is more challenging as it often requires reasoning across multiple sentences. * Corresponding author.
The rapid development of document-level RE in the past two years has benefited from the proposal of DocRED (Yao et al., 2019), the first large-scale and human-annotated dataset for this task. Noticeably, longer documents introduce an unprecedented difficulty in annotating the relation instances: as the total number of entities dramatically increases in accordance to text length, the expected number of entity pairs surges quadratically, intensively increasing the workload to check relationships between every pair. To address this problem, Yao et al. (2019) applies a recommend-revise process: in the recommendation phase, a small set of candidate relation instances is generated through distant supervision; then, annotators are required to revise the candidate set, removing the incorrect relation instances and supplementing the instances not identified in the recommendation phase.
Shifting the construction process from scratch to an edit-based task, it seems that the recommendrevise scheme cuts down the effort of annotating by a large margin. However, whether the quality of the annotation maintains a reliable standard in practice remains in doubt. To what extent can the accuracy of annotation be sacrificed due to the automated recommendation? And, how does the provided recommendation affect the behaviours of the annotators in the revision phase? Moreover, what are the real effects on the models trained on a dataset annotated with this scheme?
To answer these questions, we aim to provide a thorough comparison between careful annotations from scratch and the annotations under the recommend-revise scheme. We randomly select 96 documents from DocRED and ask two experts to relabel them from scratch independently. After annotating, the two experts come to a consensus of gold labels via discussion. This revised dataset is publicly available at https://github.com/ AndrewZhe/Revisit-DocRED, and we hope it can be used to evaluate the model's perfor-mance on real data distribution 1 . With the help of these annotations, we discovered three sobering issues regarding the effects of the recommend-revise scheme: (1) A noticeable portion of relation instances is left out, and the distributional bias in the recommendation output is inherited, even after the revision process. It is not surprising that recommendations alone fail to recognize all the relation instances, since RE models are far from perfect. Ideally, these unidentified instances should be added by human annotators during the revision phase. However, it turns out that 95.7% of these missing instances are still left out even after revision. Furthermore, while the recommendations from distant supervision favor instances associated with popular entities and relations in the source Knowledge Base (Wikidata), this bias is still maintained and inherited even after human revision, leaving less popular relations and entities to be neglected.
(2) Worryingly, we find the models trained on DocRED have low recall on our relabeled dataset and they also inherit the same bias towards popular relations and entities. We train recent models on DocRED and test them with the dataset relabeled by us. We notice that all models have much lower recalls on our dataset than previously reported on DocRED due to the numerous false negatives in training data, and those models are also biased to popular entities and relations. Further investigation reveals that the models' bias comes from the training set by comparing different strategies of negative sampling. Since one straightforward real-world application of relation extraction is to acquire novel knowledge from text, a RE model would be much less useful if it has a low recall, or perform poorly on less popular entities and relations.
(3) The recommendations actually also impacts the behaviors of annotators, making them unlikely to supplement the instances left out. This is the underlying reason for the two concerns above. We argue that the revision process fails to reach its goal, since it puts the annotators in a dilemma: while they are supposed to "add" new instances left out by the recommendations, finding these missing instances may force the annotators 1 While we cannot guarantee that the relabeled data is totally error-free, we believe the quality is high enough to be approximated as a real distribution because each entity pair is examined by two annotators. to thoroughly check out the entities pair-by-pair, which is time-consuming and against the goal of this scheme. As a result, annotators can hardly make effective supplementation and would tend to perform the easier goal of validating existing relation instances.

Recommend-Revise Annotation Scheme
The major challenge for annotating document-level RE datasets comes from the quadratic number of potential entity pairs with regard to the total number of entities in a document. As reported by Yao et al. (2019), a document in DocRED contains 19.5 entities on average, thus rendering 360 entity pairs with potential relationships. Therefore, for the 5,053 documents to be annotated, around 1,823,000 entity pairs are to be checked. Such workload will be around 14 times more than TACRED (Zhang et al., 2017), the biggest human-labeled sentence-level RE dataset. Therefore, exhaustively labeling relations between each entity pair involves intensive workload and does not seem feasible for documentlevel RE datasets.
To alleviate the huge burden of manual labeling, Yao et al. (2019) divides the annotation task into two steps: recommendation and revision. First, in the recommendation phase, Yao et al. (2019) takes advantage of Wikidata (Vrandecic and Krötzsch, 2014) and an off-the-shelf RE model to collect all the possible relations between any two entities in the same document. This process is automated and does not require human involvement. Then, during the revision phase, the relations that exist in Wikidata or are inferred by the RE model for a specific entity pair will be shown to the annotators. Rather than annotating each entity pair from scratch, the annotators are required to review the recommendations, remove the incorrect triples and supplement the missing ones. In order to explore the supplementation in the revision phase and the influence of it on the released dataset, we acquire the original recommendations generated by distant supervision from the authors of DocRED. As we focus on the effect of missing instances, we do not consider the samples removed during the revision phase. The remaining annotations in the recommendations that are not removed later are denoted as D Recommend , and the annotations after human revision are denoted as D Revise .
DocRED from scratch To analyze the effect of the recommend-revise scheme, we re-annotate a subset of the documents used in DocRED from scratch and compare it with D Recommend and D Revise . We randomly select 96 documents from the validation set of DocRED, and each document is assigned to two experts to be annotated independently. They are explicitly required to check every entity pair in the documents and decide the relationships entirely based on the original text with no recommendation. This turns out to be an extraordinarily difficult task where each document takes up half an hour for annotation on average. The inter-annotator Cohen's Kappa is 0.68 between our two experts, indicating a high annotation quality. After that, the two experts discuss the inconsistent instances together and reach an agreement on the final labels. As this paper focuses on the bias caused by false negatives in the recommend-revise scheme, we assume the labeled instances in DocRED are all correct.
For the instances labeled in DocRED but not by our experts, we add them to our annotation. We denote this new annotation set as D Scratch . Table 1 shows the statistics and comparison of D Scratch , D Recommend and D Revise on the 96 randomly-selected documents in DocRED.

False Negatives in Recommendation
Comparing D Recommend with D Scratch , it is noticeable that huge amounts of ground-truth annotation labels are left out. While D Recommend captures 1167 relation instances in the documents, a more careful, entity-by-entity examination as did in D Scratch would reveal that there are as much as 3308 relation instances within the same documents. This shocking fact reveals that almost two-thirds of the relation instances are missing and wrongly labeled as negative.
Another unexpected fact is that annotators hardly added anything during the revision phase. The final version reports 1214 relation instances, with a mere increase of 47 (1.4%) cases in total, or 0.49 instances on average for each document. This suggests that while we had great hopes of our revision process to make things right, it is not working to a sensible extent: the majority of the unlabeled instances, which take up nearly two-thirds of the instances, simply remain out there as they were.

Dataset Bias
Given the analysis above, another even more serious issue arises: since the changes introduced by the revision are so limited, the output after revision may still contain the same bias as in the recommendation. That is, if the recommendations contain a systematic flaw, the new dataset will probably keep on inheriting it. In this section, we verify that such biases largely exist in the recommendation phase and are thus inherited to the DocRED dataset.
The recommendations of DocRED are collected from two sources: Wikidata and a relation extraction model. However, if we consider the facts reserved after revision by annotators, where wrongly labeled ones get removed, the majority of them are taken directly from Wikidata 2 .
We suggest that as a collaborative knowledge base, the relation instances related to common entities and properties are more likely to be collected and added to Wikidata. In such cases, the recommendation from Wikidata will naturally favor popular entities and relations, while the less common ones would be left out. We validate this hypothesis in the following sections, where we investigate the bias of DocRED from the perspective of both relations and entities.

Bias of Relations
To determine whether the data set has a preference for popular relationships, we divide the 96 relationships in DocRED into two categories using Wikidata statistics and then compute their distribution. Specifically, we acquire the List of top 100 properties by quantity of item pages that link to them from Wikidata's official website 3 and consider a relation as popular if it appears on this list. Among the 96 relationships in DocRED, 25 are in top 100,  Table 1: Statistics of datasets. #Instance means the total relation instance in the dataset. # Pop Rel and # Unpop Rel shows the number of instances associated with popular relations and unpopular relations respectively. The last two columns represent the average entity popularity across all the relation instances, with the popularity max indicating the higher popularity of head and tail entities in an instance, and popularity min indicating the lower one.
including country, publication date, and so on. The center two columns of Table 1 illustrate the distribution of these two categories of relationships across multiple datasets. First, we can see that in the real distribution, i.e., in D Scratch , the percentages of these two types of relations are 48.8% and 51.2%, respectively, which is close to 1:1 with slightly fewer popular relations. However, the proportion of all instances belonging to the popular relationship reached 56.5% in recommendations, D Recommend , which is significantly higher than the 43.5% for unpopular ones. Further study of those instances that were mistakenly excluded during the recommendation phase, D Scratch − D Recommend , reveals that cases involving unpopular relationships are more likely to be missing. This demonstrates that the recommendation phase in DocRED does have a systematic bias related to the popularity of relations. The instances supplemented during the revision phase, D Revise − D Recommend , help to mitigate this bias marginally, whereas annotators label more instances belonging to unpopular relations. However, in comparison to D Scratch , which represents the real relation distribution, D Revise still prefers popular relations. This is because the annotators place an excessive amount of trust in the recommendations and do not add sufficient missing instances during the revision phase. According to the statistics in the Table 1, the recommendation's bias toward the relation is ultimately inherited by the dataset that passed manual inspection.

Bias of Entities
We hypothesize that the instances involving very popular entities are more likely to appear in Wikidata recommendations, whereas instances related to extremely rare entities are more likely to be disregarded. To determine whether such bias exists, we analyze the popularity of entities engaged in relation instances across multiple data sets. Each named entity in DocRED is linked with a Wikidata item based on the literal matching of names or aliases 4 . The popularity of an entity is represented by how many times the matched item appears in a relation instance in Wikidata (either as head or tail); if an entity matches more than one Wikidata items, the highest count among the matched items is taken as its popularity. For those entities that cannot be linked to Wikidata, we assign a popularity of -1.
For each relation instance, we compute two types of popularities. Since an instance contains a pair of entities (head and tail) usually with different popularities, we define popularity max to be the higher popularity of the pair of entities, and popularity min to be the lower one. We report the average popularity of relation instances in each dataset in Table 1.
Comparing D Recommend and D Scratch , we find that the former's popularity max is 294.4, far more than the latter's 266.3. This means that instances containing popular entities will be more likely to be retained during the recommendation phase. Regarding those instances that were incorrectly excluded during the recommendation phase, D Scratch − D Recommend , their popularity min is 57.7, which is less than the 67.4 in D Scratch . This demonstrates that instances involving uncommon entities are more likely to be ignored during the recommendation phase.
This entity-related bias is apparent in the revised data set as well. The popularity max kept by D Revise remains larger than that of D Scratch , while the popularity min of D Scratch − D Revise is also lower than that of D Scratch . This is mostly because the facts supplemented at the revision phase is too few to eliminate such bias.

Model Bias
To investigate if RE models trained on such data will likewise learn the same bias, we train and select RE models on the recommend-scheme-labeled dataset, D Train Revise and D Valid Revise and then assess the models' performance on the real data distribution, D Scratch . The construction process of D Train Revise and D Valid Revise is the same as D Revise , while the former is actually the original train set and the latter is the validation set in DocRED excluding the 96 documents in D Revise . In those settings, we examine the performance of recent models: BiLSTM (Yao et al., 2019), GAIN-BERT base (Zeng et al., 2020), SSAN-Roberta large (Xu et al., 2021), ATLOP-Roberta large (Zhou et al., 2021) and DocuNet-Roberta large (Zhang et al., 2021). The last three models are the most competitive ones for DocRED currently, while the others are shown to make sure that our analysis can generalize to models of smaller sizes. Table 2 summarizes the evaluation results of five models on D Revise and D Scratch . All results were reported using micro-average F1-scores as in prior literature (Zeng et al., 2020;Zhou et al., 2021).

Overall Performance
Notably, we observe a significant decline in F1 for all the 5 models on D Scratch which is mainly due to the dramatic drop in the recall. The drop is the result of the bias in training data, i.e., the model trained on biased data lacks the generalization ability to extract relation instances that are systematically missed in the dataset. We will validate this point in the following section.

Bias from Data to Model
To better understand the different performances on the two datasets, we analyze the model capability over different relations and entities. Not surprisingly, we find that models trained on D Train tional experiments suggest that this may be because missing instances are considered as negative samples during training. Given that a substantial proportion of unlabelled instances are associated with unpopular entities and relations, the model will be forced to disregard those unpopular ones under the incorrect penalty for the missing instances.
Relation Bias Figure 1 shows the recall of the models on the instances associated with popular and unpopular relations respectively. As is depicted, if an instance's relation is popular, it is almost twice as likely to be successfully extracted compared with an instance whose relation is not popular. This gap does not narrow with the improvement of the model's overall performance. The difference between the probability of successfully extracting popular and unpopular relations is 0.129 for the best model ATLOP, which is even greater than the 0.125 for BiLSTM. This indicates that all models trained on the original DocRED favor popular relations and ignore the unpopular ones. Figure 2 shows the model's recall curve as the popularity max of instances in D Scratch increases 5 . We divide all instances in D Scratch into 5 groups based on the popularity max in each instance, and we calculate the recall for each group independently. As seen in Figure 2, all the curves exhibit a clear rising trend, indicating that the probability of discovering an instance is positively correlated with its popularity max . Additionally, we can see that the middle of the ATLOP's and DocuNET's curves is nearly horizontal, which means that they are more sensitive to extremely popular or particularly rare entities. We divide all instances in D Scratch into 5 groups based on the popularity max in each instance and we can see that the probability of discovering an instance is positively correlated with its popularity max .

Missing Instances as Negative Samples
Previous works (Zeng et al., 2020;Zhou et al., 2021;Zhang et al., 2021) regard any instances that are not annotated with any relations as label no_relation, which means the missing instances are treated as negative samples during training and a model will be punished for predicting them as positive. We thus hypothesize the model's bias originates from the incorrect penalty for missing instances in the training process. To demonstrate this, we generate the negative samples in a different approach, using the instances manually eliminated during the revision step only. We denote such construction of negative samples as N Hum , and the method that treats all samples other than the positive instances as negative is called N All . Due to the fact that the sample generated by N Hum has been manually verified, there is no issue with false no_relation instances. We train the same models using D Train Revise with negative samples constructed by N Hum and N All and compare models' preference for popular entities and relations. Figure 3 depicts the fraction of instances that correspond to the popular relationship among the instances accurately predicted by GAIN trained with D Train Revise + N Hum and D Train Revise + N All . Additionally, we mark the true distribution of the data in D Scratch . As can be seen, when trained with D Train Revise + N Hum , GAIN can find more unpopular relation associated instances and the gap between the proportion of unpopular relation associated in model's prediction and D Scratch is narrowed down.
Based on the entity popularity in each instance, we partition all instances in D Scratch into five cat- egories and calculate the recall for each group independently. Figure 4 shows the improvement of GAIN's recall compared with the group which includes the instances with the most unpopular entities (0-20%). In comparison to N All , using N Hum to construct negative samples to train a model will dramatically lessen the rising trend of the model's recall as the entity's popularity grows.

Annotators' Dilemma
Finally, we move on to discuss another more implicit influence of the recommend-revise scheme on the annotators' aspect. As discussed in Section 4.1, while we expected the revision process to help supplement the instances left out, it turns out that an incredibly low number is added indeed. Given that the annotators are trained to accomplish the revision task, we wonder why they still fail in such a uniform manner. We would like to argue that it is the nature of the revision process that puts the annotators in a dilemma, where they have to choose between a huge effort and insufficiency of supplementation.
Recall that there is a distinct difference in the settings of examining a labeled relationship and supplementing an unidentified relationship. For the former, annotators are required to find evidence for a recommended relation instance and remove it if there is conflicting or no evidence. This process only requires checking a single entity pair and collecting the information related to the two specific entities. However, this is not the case for supplementing a possible, unidentified relation instance, which can exist between any entity pair. There is [20,40) [40,60) [60,80) [80,100] Popularitymin ( no clear range of searching or indicating information; all they can do is to check pair-by-pair, just like what they do from scratch. This puts annotators in an awkward dilemma, especially when they understand the motivation of this scheme: if they are to be fully responsible for the missing instances at large, they will always have to complete the thorough pairwise checking one by one; however, this would make the whole process of the recommendrevise scheme meaningless in return, as it's just like a practice from scratch. The harsh requirements of supplementing push annotators to overly rely on the recommendation results and simply examine them. This is especially worth worrying about in real practice, where annotators are recruited to complete a certain number of annotations, and typically paid according to the estimated number of hours or the total number of instances they devote to the annotation (Draws et al., 2021). Under this dilemma, it is a natural result that they are especially unmotivated to carry out the exhaustive checking for supplementation in order to get a reasonable pay in the given time.
In fact, we observe an interesting phenomenon that annotators largely tend to just pick some most obvious missing instances, convince themselves that they have accomplished the supplementation, and simply move on to the next document. This can be seen in Figure 5, where we compare the distributional characteristics of the successfully supplemented instances (D Revise -D Recommend ) and all the missing instances in general (D Scratch -D Revise ). Sub-figure (a) shows the accumulative statistics of the position of the head entity's first appearance in the document. We can see that the instances added by annotators in DocRED exhibit an extremely obvious tendency to occur earlier in the text, where more than 70% added instances are in the first 3 sentences. In contrast, all missing relation instances as a whole are almost distributed in further compares the minimum distance between the mentions of the head and tail entities of one relation instance. We once again see the interesting fact that annotators have a strong tendency to add the "most easily identifiable" instances where the head and tail entities are quite close. Specifically, the proportion of entity pairs mentioned in just one single sentence (Interval=0) is around 20% for all missing facts, but is as high as 45% for the ones chosen by annotators to be supplemented. This tells us how annotators naturally avoid burdens of reading brought by longer intervals, which possibly indicates more complicated inference with multiple sentences. From these observations, we see that there exist clear patterns among the very few instances added by human annotators. This reveals a serious fact that annotators are intentional in "pretending" to be supplementing with the least possible effort. Given the consensus behavior of annotators and the very limited number of additional, it is most likely that the nature of the annotation task pushes the annotators to this embarrassing dilemma of adding and abandoning. Thus, we propose a call to the NLP community that researchers should always be aware that annotation schemes, like the recommend-revise scheme, can have a direct impact on the annotation workers, affecting their willingness and behaviors, and thus have a deeper influence on the collected data.

Case Study
We can summarize all these problems mentioned above in the annotation with a concrete case in Do-  cRED shown in Figure 6. The figure depicts the annotations associated with the entity Michael Imperi, as well as the relation that is added in revision. Let's first focus on the red edges, which indicate the relation triples that are neither recommended nor supplemented by human. Regrettably, half of the total 18 relation triples remain missing, and just one triple is added during revision (the green edge). Compared with black edges, which indicated correctly annotated instances, the red edges are more likely to be associated with less popular entities. For example, "Sopranos" [4] and "Law & Order" [7], two popular series with at least 100K+ comments on IMDB, and about 200 edges in Wikidata, are connected with "Michael Imperioli" [0] with the relation "cast member" in the annotation, but "De-troit 1-8-7" [17] and "Mad Dogs" [20], supposed to hold the same relation to "Michael Imperioli" [0], are missed. In the text, all these series appear in similar circumstances, and the only difference is the latter ones are not recommended to the annotators, essentially because of their less popularity (less than 10K comments on IMDB, and less than 50 edges in Wikidata). We can also see the effect on the popularity of relations in the connection between [7] and [9]. "present in work" and "characters" should occur symmetrically according to the definitions, but the latter one is missed in the recommendation. Correspondingly, in Wikidata, the latter relation has 19057 links, which is less than the former's 82250 links. The last point to notice is the only green edge between Louis Fitch [15] and ABC [16], which is not recommended, but supplemented by annotators. Among all the missed instances in the recommendation, the annotators only supplement this one, which is easy to identify in the text due to both the head and tail entities being mentioned in the same sentence. This is consistent with our analysis of annotators' behavior above.

Related Works
With the advance of deep learning models, the annotation sometimes becomes the bottleneck for a machine learning system. Recently, analyzing the annotation quality has received increasing attention. Northcutt et al. (2021) collects and analyzes the label errors in the test sets of several popular benchmarks, showing label errors are ubiquitous and destabilize machine learning benchmarks. More specific to RE task, Alt et al. (2020) addresses the annotation problems in TACRED (Zhang et al., 2017), a popular sentence-level RE dataset. They find label errors account for 8% absolute F1 test error and more than 50% of the examples need to be relabeled. Stoica et al. (2021) expands this research to the whole dataset, resulting in a complete re-annotated version, Re-TACRED, and conducts thorough analysis on the models' performance. Our work differs from them in that we delve into the nature of document-level RE task, and especially explore how the error is systematically introduced into the dataset through recommend-revise scheme.
Methodologies to solve incomplete annotations for information extraction tasks have been widely discussed in previous works. Different from classification tasks, information extraction requires annotators to actively retrieve positive samples from texts, instead of just assigning a label for a given text. The problem is also attributed to the use of distant supervision (Reiplinger et al., 2014) where the linked KG is not perfect. Some works apply general approaches like positive unlabeled learning  or inference learning (Roller et al., 2015). Task-specific models are also designed, like Partial CRF (Tsuboi et al., 2008) for NER (Yang et al., 2018), and novel paradigm for joint RE . However, none of them examine the distribution bias in the training data, and those methods are not validated in the context of the document-level RE task.
Prevalent effective methods on document-level RE include graph-based models and transformer-based models. Graph-based models like Zeng et al. (2020) and Zhang et al. (2021) are designed to conduct relational reasoning over the document, and transformer-based models (Zhou et al., 2021;Xu et al., 2021) are good at recognizing long-distance dependencies. However, all previous models treat unlabeled samples in the dataset as negative samples, and do not concern the problems in annotations. We believe our analysis and re-annotated dataset will help future work focus more on the discrepancy between the annotation and real-world distribution, instead of just overfitting the dataset.

Conclusion
In this paper, we show how the recommend-revise scheme for DocRED can cause bias and false negative issues in the annotated data. The flaws of dataset affect the model's recall on real data and also teach the model the same bias in training data. As this scheme cannot reduce the human labor essentially without the loss of annotation quality, more efficient strategies for annotation are to be explored. On the other hand, considering that building a reliable training set for document RE is extremely expensive, it is also a meaningful topic that how to alleviate the dataset shift problem (Moreno-Torres et al., 2012) by injecting appropriate inductive bias into the model's structure, instead of inheriting the bias in the training data. We believe the in-depth analysis provided in this paper can benefit future designs of document-level RE models, and our Scratch dataset can serve as a fairer test set.