Revisiting the Negative Data of Distantly Supervised Relation Extraction

Distantly supervision automatically generates plenty of training samples for relation extraction. However, it also incurs two major problems: noisy labels and imbalanced training data. Previous works focus more on reducing wrongly labeled relations (false positives) while few explore the missing relations that are caused by incompleteness of knowledge base (false negatives). Furthermore, the quantity of negative labels overwhelmingly surpasses the positive ones in previous problem formulations. In this paper, we first provide a thorough analysis of the above challenges caused by negative data. Next, we formulate the problem of relation extraction into as a positive unlabeled learning task to alleviate false negative problem. Thirdly, we propose a pipeline approach, dubbed ReRe, that first performs sentence classification with relational labels and then extracts the subjects/objects. Experimental results show that the proposed method consistently outperforms existing approaches and remains excellent performance even learned with a large quantity of false positive samples. Source code is available online at https://github.com/redreamality/RERE-relation-extraction.


Introduction
Relational extraction is a crucial step towards knowledge graph construction. It aims at identifying relational triples from a given sentence in the form of subject, relation, object , in short, s, r, o . For example, given S1 in Figure 1, we hope to extract WILLIAM SHAKESPEARE, BIRTHPLACE, STRATFORD-UPON-AVON .
This task is usually modeled as a supervised learning problem and distant supervision (Mintz et al., 2009) is utilized to acquire large-scale training data. The core idea is to obtain training data is through automatically labeling a sentence with existing relational triples from a knowledge base (KB). For example, given a triple s, r, o and a sentence, if the sentence contains both s and o, distant supervision methods regard s, r, o as a valid sample for the sentence. If no relational triples are applicable, the sentence is labeled as "NA".
Despite the abundant training data obtained with distant supervision, nonnegligible errors also occur in the labels. There are two types of errors. In the first type, the labeled relation does not conform with the original meaning of sentence, and this type of error is referred to as false positive (FP). For example, in S2, the sentence "Shakespeare spent the last few years of his life in Stratford-upon-Avon." does not express the relation BIRTHPLACE, thus being a FP. In the second type, large amounts of relations in sentences are missing due to the incompleteness of KB, which is referred to as false negative (FN). For instance, in S3, "Buffett was born in 1930 in Omaha, Nebraska." is wrongly labeled as NA since there is no relation (e.g., BIRTHPLACE) between BUFFETT and OMAHA, NEBRASKA in the KB. Many efforts have been devoted to solving the FP problem, including pattern-based methods (Jia et al., 2019), multi-instance learning methods (Lin et al., 2016;Zeng et al., 2018a) and reinforcement learning methods (Feng et al., 2018). Significant improvements have been made.
However, FN problem receives much less attention (Min et al., 2013;Xu et al., 2013;Roller et al., 2015). To the best of our knowledge, none existing work with deep neural networks to solve this problem. We argue that this problem is fatal in practice since there are massive FN cases in datasets. For example, there exist at least 33% and 35% FNs in NYT and SKE datasets, respectively. We will deeply analyze the problem in Section 2.1 Another huge problem in relation extraction is the overwhelming negative labels. As is widely acknowledged, information extraction tasks are highly imbalanced in class labels (Chowdhury and Lavelli, 2012;Lin et al., 2018;. In particular, the negative labels account for most of the labels in relation extraction under almost any problem formulation, which makes relation extraction a hard machine learning problem. We systematically analyze this in Section 2.2.
In this paper, we address these challenges caused by negative data. Our main contribution can be summarized as follows.
• We systematically compare the class distributions of different problem modeling and explain why first extract relation then entities, i.e., the third paradigm (P3) in Section 2.2, is superior to the others.
• Based on the first point, we adopt P3 and propose a novel two-staged pipeline model dubbed RERE. It first detects relation at sentence level and then extracts entities for a specific relation. We model the false negatives in relation extraction as "unlabeled positives" and propose a multi-label collective loss function.
• Our empirical evaluations show that the proposed method consistently outperforms existing approaches, and achieves excellent perfor-mance even learned with a large quantity of false positive samples. We also provide two carefully annotated test sets aiming at reducing the false negatives of previous annotation, namely, NYT21 and SKE21, with 370 and 1150 samples, respectively.

Problem Analysis and Pilot Experiments
We use (c i , T i ) to denote a training instance, where c i is a sentence consisting of N tokens  , 1), ..., (c iN , N )} so that set operations can be applied. We assume r ∈ R, where R is a finite set of all relations in D. Other model/taskspecific notations are defined after each problem formulation. We now clarify some terms used in the introduction and title without formal definition. A negative sample refers to a triple t / ∈ T i . Negative label refers to the negative class label (e.g., usually "0" for binary classification), used for supervision with respect to task-specific models. Under different task formulation, the negative labels can be different. Negative data is a general term that includes both negative labels and negative samples. There are two kinds of false negatives. Relation-level false negative (S3 in Figure 1) refers to the situation where there exists t = s , r , o / ∈ T i , but r is actually expressed by c i , and does not appear in other t ∈ T i . Similarly, Entity-level false negative (S4 and S5 in Figure 1) means r appears in other t ∈ T i . Imbalanced class distribution means that the quantity of negative labels is much larger than that of positive ones.

Addressing the False Negatives
As shown in Table 1, the triples in NYT (SKE) datasets 2 labeled by Freebase 3 (BaiduBaike 4 ) is 88,253 (409,767), while the ones labeled by Wikidata 5 (CN-DBPedia 6 ) are 58,135 (342,931). In other words, there exists massive FN matches if only labeled by one KB due to the incompleteness of KBs. Notably, we find that the FN rate is underestimated by previous researches (Min 3574 et al., 2013;Xu et al., 2013), based on the manual evaluation of which there are 15%-35% FN matches. This discrepancy may be caused by human error. In specific, a volunteer may accidentally miss some triples. For example, as pointed out by Wei et al. (2020, in Appendix C), the test set of NYT11 (Hoffmann et al., 2011) missed lots of triples, especially when multiple relations occur in a same sentence, though labeled by human. That also provides an evidence that FN's are harder to discover than FP's. NYT  The "relabeled" means aligning using Wikidata and CN-DBpedia to re-label NYT and SKE datasets. In specific, we consider triples with the same subject and object to be candidate triples and use a relation mapping table to determine whether the triples match. The intersection of SKE dataset has two values because the original relation has a one-to-many mapping with relations in CN-DBpedia. FNR stands for false negative rates, calculated by using the # Triples in Original (Re-labeled) divided by the union.

Addressing the Overwhelming Negative Labels
We point out that some of the previous paradigms designed for relation extraction aggravate the imbalance and lead to inefficient supervision. The mainstream approaches for relation extraction mainly fall into three paradigms depending on what to extract first. P1 The first paradigm is a pipeline that begins with named entity recognition (NER) and then classifies each entity pair into different relations, i.e., [s, o then r]. It is adopted by many traditional approaches (Mintz et al., 2009;Chan and Roth, 2011;Zeng et al., 2014Zeng et al., , 2015Gormley et al., 2015;dos Santos et al., 2015;Lin et al., 2016).
P2 The second paradigm first detects all possible subjects in a sentence then identifies objects with respect to each relation, i.e., [s then r, o]. Specific implementation includes modeling relation extraction as multi-turn question answering (Li et al., 2019), span tagging  and cascaded binary tagging (Wei et al., 2020).
P3 The third paradigm first perform sentencelevel relation detection (cf. P1, which is at entity pair level.) then extract subjects and entities, i.e., [r then s, o]. This paradigm is largely unexplored. HRL (Takanobu et al., 2019) is hitherto the only work to apply this paradigm based on our literature review.
We provide theoretical analysis of the output space and class prior with statistical support from three datasets (see Section 5.1 for description) of the three paradigms in Table 2. The second step of P1 can be compared with the first step of P3. Both of them find relation from a sentence (P1 with target entity pair given). Suppose a sentence contains m entities 7 , the classifier has to decide relation from O(m 2 ) entity pairs, while in reality, relations are often sparse, i.e., O(m). In other words, most entity pairs in P1 do not form valid relation, thus resulting in a low class prior. The situation is even worse when the sentence contains more entities, such as in NYT11-HRL. For P2, we demonstrate with the problem formulation of CASREL (Wei et al., 2020). The difference of the first-step class prior between P2 and P3 depends on the result of comparison between # relations and average sentence length (i.e., |R| andN ), which varies in different scenarios/domains. However, π 2 of P2 is extremely low, where a classifier has to decide from a space of |R| * N . In contrast, P3 only need to decide from 4 * N based on our task formulation (Section 3.1) Other task formulations include jointly extracting the relation and entities (Yu and Lam, 2010;Li and Ji, 2014;Miwa and Sasaki, 2014;Gupta et al., 2016;Katiyar and Cardie, 2017;Ren et al., 2017) and recently in the manner of sequence tagging (Zheng et al., 2017), sequence-to-sequence learning (Zeng et al., 2018b). In contrast to the aforementioned three paradigms, most of these methods actually provide an incomplete decision space that cannot handle all the situation of relation extrac-  Table 2: Comparison of class prior under different relation extraction paradigms. |R| means the total number of relations andN is the average sentence length. π 1 (π 2 ) refers to the class prior for the first (second) task in the pipeline. π 1 for the first paradigm is omitted because it is often considered a preceding step. y is the summation of 1's in labels, of using which our intention is to represent the information a positive sample conveys. tion, for example, the overlapping one (Wei et al., 2020).

Framework of RERE
Given an instance (c i , T i ) from D, the goal of training is to maximize the likelihood defined in Eq. (1). It is decomposed into two components by applying the definition of conditional probability, formulated in Eq. (2).
where we use r ∈ T i as a shorthand for r ∈ {r | s, r, o ∈ T i }, which means that r occurs in the triple set w.
is an indicator function; 1[condition] = 1 when the condition happens. We denote by θ the model parameters. Under this decomposition, relational triple extraction task is formulated into two subtasks: relation classification and entity extraction.
Relation Classification. As is discussed, building relation classifier at entity-pair level will introduce excessive negative samples and form a hard learning problem. Therefore, we alternatively model the relation classification at sentence level. Intuitively speaking, we hope that the model could capture what relation a sentence is expressing. We formalize it as a multi-label classification task.
whereŷ j rc is the probability that c is expressing r j , the j-th relation 8 . y j rc is the ground truth from the labeled data; y j rc = 1 is equivalent to r j ∈ T i while y j rc = 0 means the opposite.
Entity Extraction. We then model entity extraction task. We observe that given the relation r and context c i , it naturally forms a machine reading comprehension (MRC) task (Chen, 2018), where (r, c i , s/o) naturally fits into the paradigm of (QUERY, CONTEXT, ANSWER). Particularly, the subjects and objects are continuous spans from c i , which falls into the category of span extraction. We adopt the boundary detection model with answer pointer (Wang and Jiang, 2017) as the output layer, which is widely used in MRC tasks. Formally, for a sentence of N tokens, where K = {s start , s end , o start , o end } represents the identifier of each pointer;ŷ n,k ee refers to the probability of n-th token being the start/end of the subject/object. y n,k ee is the ground truth from the training data; if ∃s ∈ T i |r occurs in c i at position from n to n + l, then y n,sstart ee = 1 and y n+l,s end ee = 1, otherwise 0; the same applies for the objects.

Advantages
Our task formulation shows several advantages. By adopting P3 as paradigm, the first and foremost advantage of our solution is that it suffers less from the imbalanced classes (Section 2.2). Secondly, relation-level false negative is easy to recover. When modeled as a standard classification problem, many off-the-shelf methods on positive unlabeled learning can be leveraged. Thirdly, entity-level false negatives do not affect relation classification. Taking S5 in Figure 1 as an example, even though the BIRTHPLACE relation between WILLIAM SWARTZ and SCRANTON is missing, the relation classifier can still capture the signal from the other sample with a same relation, i.e., JOE BIDEN, BIRTHPLACE, SCRANTON . Fourthly, this kind of modeling is easy to update with new relations without the need of retraining a model from bottom up. Only relation classifier needs to be redesigned, while entity extractor can be updated in an online manner without modifying the model structure. Last but not the least, relation classifier can be regarded as a pruning step when applied to practical tasks. Many existing methods treat relation extraction as question answering (Li et al., 2019;Zhao et al., 2020). However, without first identifying the relation, they all need to iterate over all the possible relations and ask diverse questions. This results in extremely low efficiency where time consumed for predicting one sample may take up to |R| times larger than our method.

Our Model
The relational triple extraction task decomposed in Eq. (2) inspires us to design a two-staged pipeline, in which we first detect relation at sentence level and then extract subjects/objects for each relation. The overall architecture of RERE is shown in Figure 2.

Sentence Classifier with Relational Label
We first detect relation at sentence level. The input is a sequence of tokens c and we denote byŷ rc = [ŷ 1 rc ,ŷ 2 rc , ...,ŷ |R| rc ] the output vector of the model, which aims to estimateŷ i rc in Eq.
(3). We use BERT (Devlin et al., 2019) for English and RoBERTa  for Chinese, pretrained language models with multi-layer bidirectional Transformer structure (Vaswani et al., 2017), to encode the inputs 9 . Specifically, the input sequence x rc = [[CLS], c i , [SEP]], which is fed into BERT for generating a token representation matrix H rc ∈ R N ×d , where d is the hidden dimension defined by pre-trained Transformers. We take h 0 rc , which is the encoded vector of the first token [CLS], as the representation of the sentence. The final output of relation classification moduleŷ rc is defined in Eq. (5).
where W rc and b rc are trainable model parameters, representing weights and bias, respectively; σ denotes the sigmoid activation function.

Relation-specific Entity Extractor
After the relation detected at sentence-level, we extract subjects and objects for each candidate relation. We aim to estimateŷ ee = [0, 1] N ×4 , of which each element corresponds toŷ n,k ee in Eq. (4), using a deep neural model. We takeŷ rc , the one-hot output vector of relation classifier, and generate query tokens q using each of the detected relations (i.e., the "1"s inŷ rc ). We are aware that many recent works (Li et al., 2019;Zhao et al., 2020) have studied how to generate diverse queries for the given relation, which have the potential of achieving better performance. Nevertheless, that is beyond the scope of this paper. To keep things simple, we use the surface text of a relation as the query.
Next, the input sequence is constructed as ]. Like Section 4.1, we get the token representation matrix H ee ∈ R N ×d from BERT. The k-th output pointer of entity extractor is defined bŷ where k ∈ {s start , s end , o start , o end } is in accordance to Eq. (4); W k ee and b k ee are the corresponding parameters.
The final subject/object spans are generated by pairing the nearest s start /o start with s end /o end . Next, all subjects are paired to the nearest object. If multiple objects occur before the next subject appears, all subsequent objects will be paired with it until next subject occurs.

T h e c o m i c b o o k c h a r a c t e r A u r a k l e s w a s c r e a t e d b y
A m e r i c a n a r t i s t D i c k D i l l i n .

Entity Extractor
Relation Classifier Figure 2: The overall architecture of RERE. In this example, there are two relations, NATIONALITYand CREATOR, can be found in the Relation Classifier, which will be sent to the Entity Extractor one by one along with the sentence. When The relation NATIONALITY is extracted, the Entity Extractor will find the position of the subject and object of Nationality. The word AMERICAN and DICK DILLIN will be found. The relation CREATOR will then be handled similarly. The values of grey blocks inŷ ee are zero.

Multi-label Collective Loss function
In normal cases, the log-likelihood is taken as the learning objective. However, as is emphasized, there exist many false negative samples in the training data. Intuitively speaking, the negative labels cannot be simply considered as negative. Instead, a small portion of the negative labels should be considered as unlabeled positives and their influence towards the penalty should be eradicated. Therefore, we adopt cPU (Xie et al., 2020), a collective loss function that is designed for positive unlabeled learning (PU learning). To briefly review, cPU considers the learning objective to be the correctness under a surrogate function, (ŷ, y) = ln(c(ŷ, y)), where they redefine the correctness function for PU learning as where µ is the ratio of false negative data (i.e., the unlabeled positive in the original paper).
We extend it to multi-label situation by embodying the original expectation at sample level. Due to the fact that class labels are highly imbalanced for our tasks, we introduce a class weight γ ∈ (0, 1) to downweight the positive penalty. For relation classifier, For entity extractor, In practice, we set µ = π(τ + 1), where τ ≈ 1 − # labeled positive # all positive is the ratio of false negative and π is the class prior. Note that µ is not difficult to estimate for both relation classification and entity extraction task in practice. Besides various of methods in the PU learning (du Plessis et al., 2015;Bekker and Davis, 2018) for estimating it, an easy approximation is µ ≈ π when π τ , which happens to be the case for our tasks.

Datasets
Our experiments are conducted on these four datasets 10 . Some statistics of the datasets are provided in Table 1 and Table 2. In relation extraction, some datasets with the same names involve different preprocessing, which leads to unfair comparison. We briefly review all the datasets below and specify the operations to perform before applying each dataset.
• NYT (Riedel et al., 2010). NYT is the very first version among all the NYT-related datasets. It is based on the articles in New York Times 12 . We use the sentences from it to conduct the pilot experiment in Table 1. However, 1) it contains duplicate samples, e.g., 1504 in the training set; 2) It only labels the last word of an entity, which will mislead the evaluation results.
• NYT10-HRL. & NYT11-HRL. These two datasets are based on NYT. The difference is that they both contain complete entity mentions. NYT10 (Riedel et al., 2010) is the original one. and NYT11 (Hoffmann et al., 2011) is a small version of NYT10 with 53,395 training samples and a manually labeled test set of 368 samples. We refer to them as NYT10-HRL and NYT11-HRL after preprocessed by HRL (Takanobu et al., 2019) where they removed 1) training relation not appearing in the testing and 2) "NA" sentences. These two steps are almost adopted by all the compared methods. To compare fairly, we use this version in evaluations.
• NYT21. We provide relabel version of the test set of NYT11-HRL. The test set of NYT11-HRL still have false negative problem. Most of the samples in the NYT11-HRL has only one relation. We manually added back the missing triples to the test set. 10 We do not use WebNLG (Gardent et al., 2017) and ACE04 11 because these datasets are not automatically labeled by distant supervision. WebNLG is constructed by natural language generation with triples. ACE04 is manually labeled. 12 https://www.nytimes.com/ • SKE2019/SKE21 13 . SKE2019 is a dataset in Chinese published by Baidu. The reason we also adopt this dataset is that it is currently the largest dataset available for relation extraction. There are 194,747 sentences in the training set and 21,639 in the validation set. We manually labeled 1,150 sentences from the test set with 2,765 annotated triples, which we refer to as SKE21. No preprocessing for this dataset is needed. We provide this data for future research 14 .

Compared Methods and Metrics
We evaluate our model by comparing with several models on the same datasets, which are SOTA graphical model MultiR (Hoffmann et al., 2011), joint models SPTree (Miwa and Bansal, 2016) and NovelTagging (Zheng et al., 2017), recent strong SOTA models CopyR (Zeng et al., 2018b), HRL (Takanobu et al., 2019), CasRel (Wei et al., 2020), TPLinker . We also provide the result of automatically aligning Wikidata/CN-KBpedia with the corpus, namely Match, as a baseline. To note, we only keep the intersected relations, otherwise it will result in low precision due to the false negative in the original dataset. We report standard micro Precision (Prec.), Recall (Rec.) and F1 score for all the experiments. Following the previous works (Takanobu et al., 2019;Wei et al., 2020), we adopt partial match on these data sets for fair comparison. We also provide the results of exact match results of the methods we implemented, and only exact match on SKE2019.

Overall Comparison
We show the overall comparison result in Table 3. First, we observe that RERE consistently outperforms all the compared models. We find an interesting result that by purely aligning the database with the corpus, it already achieves surprisingly good overall result (surpassing MultiR) and relatively high precision (comparable to CoType in NYT11-HRL). However, the recall is quite low, which is consistent with our discussion in Section 2.1 that distant supervision leads to many false negatives. We also provide an ablation result where BERT is replaced with a bidirectional   LSTM encoder (Graves et al., 2013) with randomly initialized weights. From the results we discover that even without BERT, our framework achieves competitive results against the previous approaches such as CoType and CopyR. This further prove the effectiveness of our RERE framework.

How Robust is RERE against False Negatives?
To further study how our model behaves when training data includes different quantity of false negatives, we conduct experiments on synthetic datasets. We construct five new training data by randomly removing triples with probability of 0.1, 0.3 and 0.5, simulating the situation of different FN rates. We show the precision-recall curves of our method in comparison with CASREL (Wei et al., 2020), the best performing competitor, in Figure 3. 1) The overall performance of RERE is superior to competitor models even when trained on a dataset with a 0.5 FN rate. 2) We show that the intervals of RERE between lines are smaller than CASREL, indicating that the performance decline under different FN rates of RERE is smaller.
3) The straight line before curves of our model means that there is no data point at the places where recall is very low. This means that our model is insensitive with the decision boundary and thus more robust.

Conclusion
In this paper, we revisit the negative data in relation extraction task. We first show that the false negative rate is largely underestimated by previous researches. We then systematically compare three commonly adopted paradigms and prove that our paradigm suffers less from the overwhelming negative labels. Based on this advantage, we propose RERE, a pipelined framework that first detect relations at sentence level and then extract entities for each specific relation and provide a multi-label PU learning loss to recover false negatives. Empirical results show that RERE consistently outperforms the existing state-of-the-arts by a considerable gap, even when learned with large false negative rates.