SENT: Sentence-level Distant Relation Extraction via Negative Training

Distant supervision for relation extraction provides uniform bag labels for each sentence inside the bag, while accurate sentence labels are important for downstream applications that need the exact relation type. Directly using bag labels for sentence-level training will introduce much noise, thus severely degrading performance. In this work, we propose the use of negative training (NT), in which a model is trained using complementary labels regarding that “the instance does not belong to these complementary labels”. Since the probability of selecting a true label as a complementary label is low, NT provides less noisy information. Furthermore, the model trained with NT is able to separate the noisy data from the training data. Based on NT, we propose a sentence-level framework, SENT, for distant relation extraction. SENT not only filters the noisy data to construct a cleaner dataset, but also performs a re-labeling process to transform the noisy data into useful training data, thus further benefiting the model’s performance. Experimental results show the significant improvement of the proposed method over previous methods on sentence-level evaluation and de-noise effect.


Introduction
Relation extraction (RE), which aims to extract the relation between entity pairs from unstructured text, is a fundamental task in natural language processing. The extracted relation facts can benefit various downstream applications, e.g., knowledge graph completion (Bordes et al., 2013;Wang et al., 2014), information extraction (Wu and Weld, 2010) and question answering (Yao and Van Durme, 2014;Fader et al., 2014).
A significant challenge for relation extraction is the lack of large-scale labeled data. Thus, distant * * Corresponding authors. Lived_in (unincluded label) The sentence bag of <Obama, United-States> Which label ?
Obama was back to the United States yesterday.

Obama United States
Obama was born in the United States.

Bag labels (We need) Sentence labels
Figure 1: Two types of noise exist in bag-level labels: 1) Multi-label noise: the exact label ("place of birth" or "employee of") for each sentence is unclear; 2) Wrong-label noise: the third sentence inside the bag actually expresses "live in" which is not included in the bag labels.
supervision (Mintz et al., 2009) is proposed to gather training data through automatic alignment between a database and plain text. Such annotation paradigm results in an inevitable noise problem, which is alleviated by previous studies using multiinstance learning (MIL). In MIL, the training and testing processes are performed at the bag level, where a bag contains noisy sentences mentioning the same entity pair but possibly not describing the same relation. Studies using MIL can be broadly classified into two categories: 1) the soft de-noise methods that leverage soft weights to differentiate the influence of each sentence (Lin et al., 2016;Han et al., 2018c;Li et al., 2020;Hu et al., 2019a;Ye and Ling, 2019;Yuan et al., 2019a,b); 2) the hard de-noise methods that remove noisy sentences from the bag (Zeng et al., 2015;Qin et al., 2018;Han et al., 2018a;Shang, 2019). However, these bag-level approaches fail to map each sentence inside bags with explicit sentence labels. This problem limits the application of RE in some downstream tasks that require sentencelevel relation type, e.g., Yao and Van Durme (2014) and Xu et al. (2016) use sentence-level relation extraction to identify the relation between the answer and the entity in the question. Therefore, several studies (Jia et al. (2019); Feng et al. (2018)) have made efforts on sentence-level (or instance-level) distant RE, empirically verifying the deficiency of bag-level methods on sentence-level evaluation. However, the instance selection approaches of these methods depend on rewards (Feng et al., 2018) or frequent patterns (Jia et al., 2019) determined by bag-level labels, which contain much noise. For one thing, one bag might be assigned to multiple bag labels, leading to difficulties in one-to-one mapping between sentences and labels. As shown in Fig.1, we have no access to the exact relation between "place of birth" and "employee of" for the sentence "Obama was born in the United States.". For another, the sentences inside a bag might not express the bag relations. In Fig.1, the sentence "Obama was back to the United States yesterday" actually express the relation "live in", which is not included in the bag labels.
In this work, we propose the use of negative training (NT) (Kim et al., 2019) for distant RE. Different from positive training (PT), NT trains a model by selecting the complementary labels of the given label, regarding that "the input sentence does not belong to this complementary label". Since the probability of selecting a true label as a complementary label is low, NT decreases the risk of providing noisy information and prevents the model from overfitting the noisy data. Moreover, the model trained with NT is able to separate the noisy data from the training data (a histogram in Fig.3 shows the separated data distribution during NT). Based on NT, we propose SENT, a sentencelevel framework for distant RE. During SENT training, the noisy instances are not only filtered with a noise-filtering strategy, but also transformed into useful training data with a re-labeling method. We further design an iterative training algorithm to take full advantage of these data-refining processes, which significantly boost performance. Our codes are publicly available at Github 1 .
To summarize the contribution of this work: • We propose the use of negative training for sentence-level distant RE, which greatly protects the model from noisy information.
• We present a sentence-level framework, SENT, which includes a noise-filtering and a re-labeling strategy for re-fining distant data.
• The proposed method achieves significant improvement over previous methods in terms of both RE performance and de-noise effect.  In this work, we focus on sentence-level relation extraction. Several previous studies also perform Distant RE on sentence-level. Feng et al. (2018) proposes a reinforcement learning framework for sentence selecting, where the reward is given by the classification scores on bag labels. Jia et al. (2019) builds an initial training set and further select confident instances based on selected patterns. The difference between the proposed work and previous works is that we do not rely on bag-level labels for sentence selecting. Furthermore, we leverage NT to dynamically separate the noisy data from (1) Negative training for separating the noisy data from the training data; (2) Noise-filtering and re-labeling; (3) Iterative training to further boost the performance.
the training data, thus can make use of diversified clean data.

Learning with Noisy Data
Learning with noisy data is a widely discussed problem in deep learning, especially in the field of computer vision. Existing approaches include robust learning methods such as leveraging a robust loss function or regularization method (Lyu and Tsang, 2020;Zhang and Sabuncu, 2018;Hu et al., 2019b;Kim et al., 2019), re-weighting the loss of potential noisy samples (Ren et al., 2018;Jiang et al., 2018), modeling the corruption probability with a transition matrix (Goldberger and Ben-Reuven, 2016;Xia et al.) and so on. Another line of research tries to recognize or even correct the noisy instances from the training data (Malach and Shalev-Shwartz, 2017;Arazo et al., 2019;. In this paper, we focus on the noisy label problem in distant RE. We first leverage a robust negative loss (Kim et al., 2019) for model training. Then, we develop a new iterative training algorithm for noise selection and correction.

Methodology
In order to achieve sentence-level relation classification using bag-level labels in distant RE, we propose a framework, SENT, which contains three main steps (as shown in Fig.2): (1) Separating the noisy data from the training data with negative training (Sec.3.1); (2) Filtering the noisy data as well as re-labeling a part of confident instances (Sec.3.2); (3) Leveraging an effective training algorithm based on (1) and (2) to further boost the performance (Sec.3.3).
Specifically, we denote the input data in this task as . . , C} is the bag-level label of the i th input sentence s i . Obviously, this is a noisy dataset drawn from a noisy distribution D * because these bag-level labels y * come from the distant label of each entity bag. For each s i containing a pair of entities < e 1 , e 2 >, y * i is one of the relation facts 2 that < e 1 , e 2 > participates in in the database. Such annotation method indicates that y * i is a potential noisy label for s i . Here, we denote D as the real data distribution without noise, and the clean dataset drawn from D as S = {(s 1 , y 1 ), . . . , (s N , y N )}. The ambition of this work is to find the best estimated parameters θ of the real mapping f : x → y, (x, y) ∈ D based on the noisy data S * . We design three steps for achieving this goal: (1) Recognizing the set of noisy data S * n from S * using negative (2) Refining S * by noise-filtering and re-labeling, e.g., (1) and (2) so the refined dataset S * ref ined approaches the real dataset S.

Negative Training on Distant Data
In order to perform robust training on the noisy distant data, we propose the use of negative Training (NT), which trains based on the concept that "the input sentence does not belong to this complementary label". We find that NT not only During NT, the confidence of the noisy data is much lower than that of the clean data; (c) After training with the SENT method, the clean and noisy data are further separated; (d) PT after SENT helps improve the convergence of the clean data.
provides less noisy information, but also separates the noisy and clean data during training.

Positive Training
Positive training (PT) trains the model towards predicting the given label, based on the concept that "the input sentence belongs to this label".
Here, given any input s with a label y * ∈ R = {1, 2, . . . , C}, y ∈ {0, 1} C is the C-dimension one-hot vector of y * . We denote p = f (s) as the probability vector of a sentence given by a relation classifier f (·). With the cross entropy loss function, the loss defined in typical positive training is: where p k denotes the probability of the k th label. Optimizing on Eq.1 meets the requirement of PL, as the probability of the given label approaches 1 with the loss decreasing.

Negative Training
In negative training (NT), for each input s with a label y * ∈ R, we generate a complementary label y * by randomly sampling from the label space except y * , e.g., y * ∈ R\{y * }. With the cross entropy loss function, we define the loss in negative training as: Different from PT, Eq.2 aims to reduce the probability value of the complementary label, as p k → 0 with the loss decreasing.
To further illustrate the effect of NT, we train the classifier with PT and NT respectively on a constructed TACRED dataset with 30% noise (details shown in Sec.4.1). A histogram 3 of the training data after PT and NT is shown in Figs. 3(a),(b), which reveals that, when training with PT, the confidence of clean data and noisy data increase with no difference, resulting in the model to overfit noisy training data. On the contrary, when training with NT, the confidence of noisy data is much lower than that of clean data. This result confirms that the model trained with NT suffers less from overfitting noisy data with less noisy information provided. Moreover, as the confidence value of clean data and noisy data separate from each other, we are able to filter noisy data with a certain threshold. Fig.4 shows the details of the data-filtering effect. After the first iteration of NT, a modest threshold contributes to 97% precision noise-filtering with about 50% recall, which further verifies the effectiveness of NT on noisy data training.

Noise Filtering and Re-labeling
In Section 3.1, we have illustrated the effectiveness of NT on training with noisy data, as well as the capability to recognize noisy instances. While filtering noisy data is important for training on distant data, these filtered data contain useful information that can boost performance if properly re-labeled. In this section, we describe the proposed noisefiltering and label-recovering strategy for refining distant data based on NT.

Filtering Noisy Data
As discussed before, it is intuitive to construct a filtering strategy based on a certain threshold after NT. However, in distant RE, the long-tail problem cannot be neglected. During training, the degree of convergence is disparate among different classes. Simply setting a uniform threshold might harm the data distribution with instances of longtail relations largely filtered out. Therefore, we leverage a dynamic threshold for filtering noisy data. Suppose the probability of class c of the i th instance is p i c ∈ (0, p h c ), where p h c is the maximum probability value in class c. Based on empirical experience, we assume the probability values follow a distribution where the noisy data are largely distributed in low-value areas and the clean data are generally distributed in middle-or high-value areas. Therefore, the filtering threshold of class c is set to: where T h is a global threshold. In this way, the noise-filtering threshold not only relies on the degree of convergence in each class, but also dynamically changes during the training phase, thus making it more suitable for noise-filtering on long-tail data.

Re-labeling Useful Data
After noise-filtering, the noisy instances are regarded as unlabeled data, which also contain useful information for training. Here, we design a simple strategy for re-labeling these unlabeled data. Given the set of filtered data D u = {s 1 , . . . , s m }, we use the classifier trained in this iteration to predict the probability vectors {p 1 , . . . , p m }. Then, we re-label these instances by: k is the probability of the i th instance in class k, and T h relabel is the re-label threshold.

Iterative Training Algorithm
Although effective, simply performing a pipeline of NT, noise-filtering and re-labeling fail to take full advantage of each part, thus the model performance can be further boosted through iterative training.
As shown in Fig.2, for each iteration, we first train the classifier on the noisy data using NT: for each instance, we randomly sample K complementary labels and calculate the loss on these labels with Eq.(2). After M -epochs negative training, the noise-filtering and re-labeling processes are carried out for updating the training data. Next, we perform a new iteration of training on the newly-refined data. Here, we re-initialize the classifier in every iteration for two reasons: First, re-initialization ensures that in each iteration, the new classifier is trained on a dataset with higher quality. Second, re-initialization introduces randomness, thus contributing to more robust data-filtering. Finally, we stop the iteration after observing the best result on the dev set. We then perform a round of noise-filtering and re-labeling with the best model in the last iteration to obtain the final refined data. Fig.3(c) shows the data distribution after certain iterations of SENT. As seen, the noise and clean data are separated by a large margin. Most noisy data are successfully filtered out, with an acceptable number of clean data mistaken. However, we can see that the model trained with NT still lacks convergence (with low-confidence predictions). Therefore, we train the classifier on the iteratively-refined data with PT for better convergence. As shown in Fig.3(d), the model predictions on most of the clean data are in high confidence after PT training.

Experiments
The experiments in this work are divided into two parts, respectively conducted on two datasets: the NYT-10 dataset (Riedel et al., 2010) and the TACRED dataset (Zhang et al., 2017).
The first part is the effectiveness study on sentence-level evaluation for distant RE. Different from bag-level evaluation, a sentence-level evaluation compute Precision (Prec.), Recall (Rec.) and F1 metric directly on all of the individual instances in the dataset. In this part, we adopt the NYT-10 data set for sentence-level training, following the setting of Jia et al. (2019), who publishes a manually labeled sentence-level test set. 4 Besides, they also publish a test set for evaluating noisefiltering ability. Details of the adopted dataset are shown in Table 1.
We construct the second part of experiments (Sec.4.4) to better understand SENT's behaviors. Since no labeled training data are available in the distant supervision setting, we construct a noisy dataset with 30% noise from a labeled dataset, TACRED (Zhang et al., 2017) Table 1: Statistics of datasets 6 . "Positive" means positive instances that are not labeled as "NA". Note that the positive instances of noisy-TACRED include false-positive noise and the noise number in NYT-10 is unknown due to the inaccurate annotations.
we choose this dataset is that 80% instances in the training data are "no relation". This "NA" rate is similar to the NYT data which contains 70% "NA" relation type, thus analysis on this dataset is more credible.
When constructing noisy-TACRED, the noisy instances are uniformly selected with 30% noise ratio. Then, each noisy label is created by sampling a label from a complementary class with a weight of class frequency (in order to maintain the data distribution). Note that the original dataset consists of 80% "no relation" data, which means 80% of the noisy instances are "false-positive" instances, corresponding to the large amount of "false-positive" noise in NYT-10. Details of the noisy-TACRED are also shown in Table 1.

Baselines
We compare our SENT method with several strong baselines in distant RE. These compared methods can be categorized as: bag-level denoising methods, sentence-level denoising methods, sentence-level non-denoising methods.
PCNN+SelATT (Lin et al., 2016): A bag-level RE model which leverages an attention mechanism to reduce noise effect.
PCNN+RA BAG ATT (Ye and Ling, 2019) short for PCNN+ATT RA+BAG ATT, a bag-level model containing both intra-bag and inter-bag attentions to alleviate noise.
CNN+RL 1 (Qin et al., 2018): A RL-based bag-level method. Different from CNN+RL 2 , they redistribute the filtered data into the negative examples.
CNN+RL 2 (Feng et al., 2018): A sentence-level RE model. It jointly train a instance selector and a 6 Statistics of NYT-10 are quoted from (Jia et al., 2019). CNN classifier using reinforcement learning (RL).
ARNOR (Jia et al., 2019): A sentence-level RE model which selects confident instances based on the attention score on the selected patterns. It is the state-of-the-art method in sentence level.
BiLSTM+ATT (Zhang et al., 2017) leverages an attention mechanism based on BiLSTM to capture useful information.

Implementation Details
As SENT is a model-agnostic framework, we implement the classification model with two typical architectures: BiLSTM and BiLSTM+BERT. Since BiLSTM is also the base model of ARNOR, we can compare these two methods more fairly. During SENT training, we use the 50-dimension glove vectors as word embedding. While for PT after SENT, we randomly initialize the 50-dimension word embedding as the same in ARNOR. In both training phases, we use 50-dimension randomlyinitialized position and entity type embedding. We train a single-layer BiLSTM with hidden size 256 using the adam optimizer at a learning rate of 5e-4. When implemented with BiLSTM+BERT, the setting is the same as those with BiLSTM except that we use a 768-dimension fixed BERT representation as word embedding (we use the "bert-base-uncased" pre-trained model). We tune the hyperparameters on the development set via a grid search. Specifically, when training on the NYT dataset, we train the model for 10 epochs in each iteration, with the global data-filtering threshold T h = 0.25, the re-labeling threshold T h relabel = 0.7 and negative samples number K = 10. When training on the noisy-TACRED, we train for 50 epochs in each iteration, with T h = 0.15, T h relabel = 0.85 and K = 50.
To deal with the multi-label problem, we utilize a simple method by randomly selecting one of the bag labels for each sentence. Such random selection turns the multi-label noise into the wronglabel noise, which is easier to handle. According to Surdeanu et al. (2012), there are 31% wrong-label noise and 7.5% multi-label noise in NYT-10, and incorrect selection may result in 4% extra wrong-  label noise, which can be filtered out through NT identically with wrong-label instances. Table 2 shows the results of SENT and other baselines on sentence-level evaluation, where the results of SENT are obtained by PT after SENT. We can observe that: 1) Bag-level methods fail to perform well on sentence-level evaluation, indicating that it is difficult for these bag-level approaches to benefit downstream tasks with exact sentence labels. This result is consistent with the results in Feng et al. (2018). 2) When performing sentence-level training on the noisy distant data, all baseline models show poor results, including the preeminent pre-trained language model BERT. These results indicate the negative impact of directly using bag-level labels for sentence-level training regardless of noise.

Sentence-Level Evaluation
3) The proposed SENT method achieves a significant improvement over previous sentencelevel de-noising methods. When implemented with BiLSTM, the model obtains a 4.09% higher F1 score than ARNOR. Moreover, when implemented with BiLSTM+BERT, the F1 score is further improved by 8.52%. 4) The SENT method achieves much higher precision than the previous de-noising methods when maintaining comparable or higher recall, indicating the effectiveness of the noisefiltering and re-labeling approaches.

Noise-Filtering Effect on Distant Data
In order to prove the effectiveness of SENT in denoising distant data, we conduct a noise-filtering experiment following ARNOR. We use a test set published by ARNOR, which consists of 200 randomly selected sentences with an "is noise" annotation. We perform a noise-filtering process as described in Sec.3.2.1, and calculate the de-noise accuracy. As seen in Table 3, the SENT method achieves remarkable improvement over ARNOR in F1 score by 12%. While improving in precision, SENT achieves 20% gain over ARNOR in recall. As ARNOR initializes the training data with a small part of frequent patterns, these patterns might limit the model from generalizing to various correct data. Different from ARNOR, SENT leverages negative training to automatically learn the correct patterns, showing better ability in diversity and generalization.

Analysing SENT on "Labeled Noise"
In this section, we analyze the effectiveness of the data-refining process with a self-constructed noisy data set: noisy-TACRED (details in Table 1). Table 4 shows the results of training on TACRED and noisy-TACRED. As seen, the baseline model degrades dramatically on the noisy data, with the LSTM dropping by 20.2%. However, after training with SENT, the BiLSTM model can achieve comparable results with the model that trained on the clean data. Note that the de-noising method is quite helpful in promoting the precision score, yet the recall is still lower than that on clean data.

Effects of Data-Refining
We also evaluate the noise-filtering and labelrecovering ability on the noisy-TACRED training set, as shown in Fig.4. We can observe that: 1) SENT achieves about 85% F1 score on the noisy-TACRED data. This result is consistent with the noise-filtering results obtained on the NYT dataset (with 200 sampled instances), validating the denoising ability of SENT on different datasets. 2) As the training iteration progressed, the precision of noise-filtering decreases with the recall promoting. More noise-filtering contributes to a cleaner dataset, while it might bring more false-noise mistakes. Therefore, we stop the iteration when the model reaches the best score on the development set. 3) As for label-recovering, SENT can achieve about 70% precision with about 25% recall. Here, the threshold setting is also a trade-off that we prefer to adopt a modest value for more accurate re-labeling.

Effects of Dynamically Filtering
As described in Sec.3.2, we design a dynamic filtering threshold for long-tail data. The effect of this strategy is shown in Fig.5. As seen, the degree of convergence of the long-tail relation "per:cause of death" is much lower than that of the head relation. Simply setting a uniform threshold would harm the data distribution with instances of "per:cause of death" largely filtered. While with a dynamically determined threshold, both data from the head and the long-tail relations are appropriately filtered.

Ablation Study
To better illustrate the contribution of each component in SENT, we conduct an ablation study by removing the following components: final PT, re-labeling, dynamic threshold, re-initialization, NT. The test results are shown in Table 6. We can observe that: 1) Removing the final positive training affects little to the performance. This is because the model trained with NT already reaches high accuracy and the purpose of final PT is only to achieve more confidential predictions.
2) Removing the re-labeling process harms the performance, as the filtered instances are simply discarded regardless of the useful information for   training. 3) Without dynamic threshold, clean instances from the tail classes are incorrectly filtered out, which severely degrades the performance. 4) Re-initialization also contributes a lot to the performance. The model trained on the original noisy data inevitably fits to the noisy distribution, while re-initialization helps wash out the overfitted parameters and eliminate the noise effects, thus contributing to better training and noise-filtering. 5) Training with PT instead of NT causes a dramatic decline in performance, especially on the precision, which verifies the effectiveness of NT to prevent the model from overfitting noisy data.

Case Study
As discussed, SENT is able to refine the distant RE dataset. In fact, there exists much noise in the NYT data that is difficult to tackle with bag-level methods. In Table 5, we show some examples. (1) The first two rows are the sentences in a multi-label bag. We randomly choose one of the bag labels for each sentence, and the model is able to correct the bad choice (by correcting the second sentence with "place lived" and the first sentence with "NA").
(2) The following three rows show a bag with label "place of death", while this whole bag is actually a "NA" bag incorrectly labeled positive.
(3) SENT can also recognize the positive samples in "NA". As shown in the last three rows, each sentence labeled as "NA" is actually expressing a positive label. In fact, such false-negative problem is frequently seen in the NYT data, which contains 70% negative instances that were labeled "NA" only because the entity pairs do not participate in a relation in the database. We believe the capacity to recognize these false-negative samples can significantly boost the performance.

Conclusion
In this paper, we present SENT, a novel sentencelevel framework based on Negative Training (NT) for sentence-level training on distant RE data. NT not only prevent the model from overfitting noisy data, but also separate the noisy data from the training data. By iteratively performing noisefiltering and re-labeling based on NT, SENT helps re-fine the noisy distant data and achieves remarkable performance. Experimental results verify the improvement of SENT over previous methods on sentence-level relation extraction and noise-filtering effect.