Knowing False Negatives: An Adversarial Training Method for Distantly Supervised Relation Extraction

Distantly supervised relation extraction (RE) automatically aligns unstructured text with relation instances in a knowledge base (KB). Due to the incompleteness of current KBs, sentences implying certain relations may be annotated as N/A instances, which causes the so-called false negative (FN) problem. Current RE methods usually overlook this problem, inducing improper biases in both training and testing procedures. To address this issue, we propose a two-stage approach. First, it finds out possible FN samples by heuristically leveraging the memory mechanism of deep neural networks. Then, it aligns those unlabeled data with the training data into a unified feature space by adversarial training to assign pseudo labels and further utilize the information contained in them. Experiments on two wildly-used benchmark datasets demonstrate the effectiveness of our approach.


Introduction
Relation extraction (RE), defined as the task of identifying relations from unstructured text given two entity mentions, is a key component in many NLP applications such as knowledge base (KB) population (Ji and Grishman, 2011;De Sa et al., 2016) and question answering (Yu et al., 2017). Supervised RE requires a large amount of human labeled data, which is often labor-intensive and time-consuming. Distant supervision (DS), proposed by (Mintz et al., 2009), deals with this problem by aligning a large corpus with a KB such as Freebase to provide weak supervision. It relies on the assumption that if two entities participate in a relation in a KB, then any sentence containing this entity pair expresses this relation.
However, this assumption is too strong and brings in noises, particularly when the training KB is an external information source and not primarily derived from the training corpus. To alleviate * Corresponding author this problem, Riedel et al. (2010) relax this assumption to if two entities participate in a relation, at-least-one sentence that mentions these two entities expresses this relation. Bunescu and Mooney (2007) propose multi-instance learning by organizing sentences with the same entity pair into one set, referred to as a bag. Following this setting, Riedel et al. (2010); Hoffmann et al. (2011); Surdeanu et al. (2012) propose diverse hand-crafted features. Zeng et al. (2014Zeng et al. ( , 2015 leverage convolutional neural networks (CNN) to learn the representations of instances. Lin et al. (2016) apply an attention mechanism to select informative sentences from a bag. Recently, graph convolutional networks (GCN) have been effectively employed for capturing the syntactic information from dependency trees (Vashishth et al., 2018). Most of the above works focus on the false positive (FP) problem (which is caused by the strong DS assumption), but totally neglect the false negative (FN) problem, which is also important and induces improper biases in both training and testing procedures. DS treats a sentence as a negative sample (marked as N/A) whose entity pair does not have a known relation in the KB. However, due to the incompleteness of current KBs, sentences implying predefined relations are likely to be mislabeled as N/A. For example, over 70% of people in Freebase do not have birth places (Dong et al., 2014). As shown in Table 1, we have annotated the ground-truth relations for two FN sentences. In fact, the missing facts in KBs yield plenty of FN sentences in the automatically annotated datasets.
Generative adversarial networks (GAN) (Goodfellow et al., 2014) is first introduced into RE by (Qin et al., 2018) to learn a sentence-level FP filter. However, they only focus on the FP problem and their model cannot be generalized to other scenarios. Li et al. (2019) attempt to solve the FN problem using entity descriptions from Wikipedia to filter FN samples from the N/A set and further utilize GAN in a semi-supervised manner. However, their method heavily depends on external resources and is not applicable to all situations. Moreover, their filtering heuristic "two entities mention each other on their Wikipedia pages may imply the predefined relations" is even stronger than that of (Riedel et al., 2010). It may filter some noisy data but at the same time introducing even more noises. For example, in the sentence "... to the poetry from [India], China and [Korea] that stretches the book ..." excerpted from NYT10, India and Korea mention each other on Wikipedia, but this sentence implies no relation.
In this paper, we propose a novel two-stage approach for distantly supervised RE. In the first stage, called mining, we find out possible FN samples from the N/A set by heuristically leveraging the memory mechanism of deep neural networks. According to (Arpit et al., 2017), while deep neural networks are capable of memorizing noisy data, they tend to preferentially learn simple patterns first, that is to say, deep neural networks tend to learn and memorize patterns from clean instances within a noisy dataset. We design a Transformerbased (Vaswani et al., 2017) deep filter to mine FN samples from the N/A set. In the second stage, called aligning, we formulate the problem into a domain adaptation (DA) paradigm. We exploit a gradient reversal layer (GRL) to align the mined unlabeled data from stage one with the training data into a unified feature space. After aligning, each sentence is assigned with a pseudo label and a confidence score, which provide extra information and attenuate incorrect biases in both training and testing procedures.
In summary, our main contributions are fourfold: • We propose a simple yet effective method to filter noises in a DS dataset by leveraging the memory mechanism of deep neural networks, without any external resources. • We formulate distantly supervised RE as a DA paradigm and utilize adversarial training to align unlabeled data with training data into a unified space, and generate pseudo labels to provide additional supervision. • We achieve new state-of-the-art on two popular benchmark datasets NYT10 (Riedel et al., 2010) and GIDS (Jat et al., 2018 Surdeanu et al. (2012) propose the multiinstance multi-label learning paradigm. Currently, DS is already a common practice in RE.
Neural relation extraction. The above works strongly rely on the quality of hand-engineered features. Zeng et al. (2014) first propose an end-to-end CNN-based neural network to automatically capture relevant lexical-level and sentence-level features. Zeng et al. (2015); Lin et al. (2016) further improve this through piecewise max pooling and selective attention (Bahdanau et al., 2015). Zhou et al. (2016) propose an attention-based LSTM to capture the most important semantic information in sentences. External knowledge like entity descriptions and type information has been used for RE (Yaghoobzadeh et al., 2017;Vashishth et al., 2018). Pre-trained language models contain a notable amount of semantic information and commonsense knowledge, and several works have applied them to RE (Alt et al., 2019;Xiao et al., 2020).
Adversarial training. Adversarial training is a machine learning technique that improves the networks using an adversarial objective function or deceptive samples. Wu et al. (2017) bring it in RE by adding adversarial noises to the training data. Qin et al. (2018) propose DSGAN to learn a generator that explicitly filters FP instances from the training dataset. Li et al. (2019) propose a semi-distant supervision method by first splitting a dataset through entity descriptions and then using GAN to make full use of unsupervised data. Luo et al. (2020) learn the distribution of true positive instances through adversarial training and select valid instances via a rank-based model.
Learning with noisy labels. As noisy labels degrade the generalization performance of deep neural networks, learning from noisy labels (a.k.a. robust training) has become an important task in modern deep learning (Song et al., 2021). Arpit et al. (2017) find out that, although deep networks are capable of memorizing noisy data, they tend to learn simple patterns first. Co-teaching (Han et al., 2018) trains two deep neural networks simultaneously and lets them teach each other given every mini-batch. Yan et al. (2016); Nguyen et al. (2020); Li et al. (2020) transform the problem into a semisupervised learning task by treating possibly falselabeled samples as unlabeled.

Methodology
In this section, we introduce our two-stage framework called FAN (False negative Adversarial Networks) in detail. First, we describe how we discover possibly wrong-labeled sentences from the N/A set. Then, we introduce our adversarial DA method, which assigns pseudo labels to unlabeled data with confidence scores.

Stage I: Mining
We define a distantly supervised dataset D = {s 1 , s 2 , . . . , s N }, where each sample s i is a quadruple, consisting of an input sequence of tokens t i = [t 1 i , . . . , t n i ], head i and tail i for head and tail entity positions in sequence t i , respectively, and the corresponding relation r i assigned by DS. We split D into two parts: sentences with predefined relations are divided into the positive set, denoted by P; sentences implying no relations are divided into the negative set, denoted by N , D = P ∪ N . In this work, we focus on the noises in N , where sentences may be mislabeled by incomplete KBs and useful information is not yet fully discovered.
According to (Arpit et al., 2017), deep neural networks are prone to learn clean samples first, and then gradually learn noisy samples. Following (Malach and Shalev-Shwartz, 2017;Jiang et al., 2018;Han et al., 2018;Nguyen et al., 2020), after proper training, we filter samples in N with logits larger than threshold θ as possible FN samples. All semantic patterns to differentiate between P and N . For the samples in N with logits larger than θ, they may imply predefined relations but only the DS annotations are inaccurate. We argue that M can be considered as unlabeled data in a semi-supervised way to provide supplementary information.

Sentence Encoder
As for the mining step, we take a mapping function f (·) to map r i to a binary label.
We use pretrained language model BERT (Devlin et al., 2019) as our embedding module, as it contains a large amount of semantic information and commonsense knowledge. Token sequence t i = [t 1 i , . . . , t n i ] is fed into the pre-trained model and the last hidden is used as our token embedding.
CNN is a widely used architecture for capturing local contextual information (Zeng et al., 2014). The convolution operation involves taking the dot product of the convolutional filter W with each k-gram in the sequence h i to obtain a new representation p i : In order to capture diverse features and extract the local information at different levels, we make use of multiple filters and varied kernel sizes.
In RE, representation p i can be partitioned into three parts according to head i and tail i , i.e., p i = To capture the structural information between two entities and obtain fine-grained features, we take piecewise max pooling as (Zeng et al., 2015). For each convolutional filter, we can obtain a three-dimensional vector q i : We concatenate vectors from multiple convolutional filters and get the sentence representation v i ∈ R 3m , where m is the number of convolutional filters with varied kernel sizes. The architecture of the sentence encoder is shown in Figure 1.

Noise Filter
Through sentence encoding, each sample We feed it into a multilayer perceptron to obtain the probability: where W f ∈ R 1×3m is the transformation matrix and b f is the bias. σ(·) denotes the sigmoid activation. Output o i implies the probability of current sentence belonging to P. We use the binary cross-entropy as our loss to train the deep noise filter in an end-to-end way:

Stage II: Aligning
Instances in M are mislabeled due to the incompleteness of KBs and the distribution may be different from the original training dataset . A naive way is just dropping those data and using adjusted dataset D for training. Doing like this would lose useful information contained in M and thus is not optimal. Unlabeled data can be annotated by humans, but it is time-consuming and is not applicable to large datasets. In fact, these unlabeled samples imply predefined relations and can be used together with D in a semi-supervised learning paradigm. We formulate this problem as a DA task and the objective is aligning the distributions of M and P into a unified feature space. To achieve this objective, we propose a method inspired by GAN. The generator tries to fool the discriminator so that it cannot distinguish the samples in M and P. On the contrary, the discriminator tries its best to differentiate them. The training procedure forms a classic minmax game by adversarial objective functions. The overall architecture is shown in Figure 2.

Bag Encoder
The sentence encoding layer reuses the architecture in Section 3.1.1. Due to the noisy DS annotations, multi-instance learning is introduced and relation classification is applied on the bag level. One bag B = {s 1 , . . . , s t } contains t sentences for the same entity pair. Bag representation g i is derived from a weighted sum over the individual sentence representations: where α j is the weight assigned to the corresponding sentence computed through selective attention (Lin et al., 2016). The weight is obtained by computing the similarity between the learned relation query representation r i ∈ R 3m and each sentence:

Relation Classifier
For a bag in P, its DS label is known. To compute the probability distribution over relations, a linear layer followed by a softmax layer is applied to a bag representation g i ∈ R 3m : where W c ∈ R l×3m , and l is equal to the number of predefined relations. During training, we aim to optimize the following cross-entropy loss:

Gradient Reversal Layer
Pre-trained language models have shown great power in many NLP tasks. They are huge in size and have tremendous ability to fit distributions. Following (Ganin et al., 2016;Chen et al., 2018), we use a GRL after the bag encoder. When forward passing, it works as an identity function, while when back propagating, it reverses gradients to their opposite: where Θ denotes parameters of the bag encoder. When forward passing, I(·) = 1, and when back propagating, I(·) = −1.

Discriminator
Given a bag representation g i ∈ R 3m , the discriminator first implements an affine transformation and then uses sigmoid function σ(·) to obtain the probability distribution:

Training Objective
We use adversarial training to generate a unified data distribution. To enforce instances of the same class closer and push away instances from different classes, a contrastive loss is designed for better feature representation.

Adversarial Loss
The bag encoder is optimized to give separate representations for instances in P, so that samples from different classes can be easily distinguished by the relation classifier. In the meantime, it forces the distribution of M to fit into the distribution of P. The encoder here plays two roles: representation learner and distribution adapter. The classification learning objective is Eq. (8), and the generator objective is On the contrary, the discriminator attempts to distinguish samples from M with P. Straightforwardly, for generator, the labels of samples in M are 1; but for discriminator, the labels of samples in M are 0. The labels of instances in P are always 1. The discriminator objective is Generator and discriminator improve each other in iterations.

Contrastive Loss
Bag representations are expected to be able to cluster instances with the same relation. We aim to increase the distances between samples with different relations and reduce the variance of distances with the same relation.
Simply put, given two instances, their similarity score should be high if they belong to the same relation and low otherwise. We use the contrastive loss (Neculoiu et al., 2016) for this objective. Following (Zhou et al., 2020), given a bag representation g, we divide all the other instances with the same relation type as Q + and the ones with different types as Q − . The contrastive loss can be formulated as where the measurement of distance is defined as follows: where τ is a hyperparameter for avoiding collapse of the representations of bags, cos(·) denotes the cosine function.

Overall Loss
The adversarial training procedure is modeled as multi-task learning and trained in an end-to-end way. The overall objective function is where α, β, γ are hyperparameters.

Experiment Setup
We conduct the experiments on two widely used benchmark datasets: NYT10 and GIDS (Google-IIsc Distant Supervision dataset). The source code is publicly available 1 .

Datasets
The statistics of two datasets are listed in Table 2. We briefly describe them below: • NYT10 is developed by (Riedel et al., 2010) (Finkel et al., 2005) and linked to Freebase. The dataset has been broadly used for RE (Hoffmann et al., 2011;Surdeanu et al., 2012;Vashishth et al., 2018;Alt et al., 2019). • GIDS is built by extending the Google RE corpus 2 with additional instances for each entity pair (Jat et al., 2018). It assures that the atleast-one assumption of multi-instance learning holds, which makes the automatic evaluation more accurate and reliable.

Comparative Models
To evaluate the proposed FAN, we compare it with the following seven representative models: • PCNN-ONE (Zeng et al., 2015): A CNNbased neural RE model using piecewise max pooling for better sentence representations. • PCNN-ATT (Lin et al., 2016)

Criteria
Following the conventions (Zeng et al., 2015;Lin et al., 2016;Vashishth et al., 2018;Alt et al., 2019), we use the held-out evaluation. For each model, we compute the precision-recall (PR) curves and report the area under curve (AUC) scores for straightforward comparison. We also report P@N, which measures the percentage of correct classifications in the top-N most confident predictions. Additionally, micro-F1 is measured at different points along the PR-curve and the best is reported. For NYT10, we compare with all models listed above. We reuse the results reported in the original papers for BGWA, RESIDE, DISTRE and DS-VAE, and implement PCNN-ONE, PCNN-ATT, DS-GAN by ourselves. Because the source code of DS-VAE is not released yet, we cannot obtain its PR-curve. For GIDS, we only compare with PCNN-ONE, PCNN-ATT, BGWA and RESIDE, as other models are not applicable to this dataset.

Overall Results
The PR-curves on NYT10 and GIDS are shown in Figure 3. On both datasets, FAN achieves the best results. On NYT10, we get visibly higher recall, especially when precision is higher than 75.0, which indicates that our model can find more informative samples along with correct labels. It makes sense because FAN knows more information by digging   informative samples from the N/A set and weakens improper biases in the training procedure. On NYT10, our AUC is 45.5, improving 3.3 on the basis of the second best model. On GIDS, our AUC is 90.3, improving 1.2 compared with the second best. Please refer to Table 3 for details. Table 4 shows P@N in top ranked samples and micro-F1 values on NYT10. Our model improves 3.8 on average for P@N, indicating that our model does not reduce the precision while improving the recall. Table 5 shows the results on GIDS. From the table, we observe that FAN achieves the highest score in micro-F1. The scores of P@N are also comparable to RESIDE and BGWA.

Mining Results
In the mining step, we filter FN samples from N with logits larger than threshold θ. As a result, we discover 4,556 FN samples from NYT10, which refer to 3,733 entity pairs; and 238 FN samples from GIDS, which refer to 225 entity pairs.
To evaluate the quality of M, we choose five relations assigned by FAN with most samples, each   selecting 100 sentences with highest confidence scores. Three well-trained NLP annotators are asked to annotate sentences in a binary way, to see whether the assigned pseudo labels are correct. The results are shown in Table 6. The average precision is 87.0, improving around 17.0 compared with the original NYT10 dataset (Riedel et al., 2010). It verifies both the quality of the mined data and the effectiveness of the aligning step. The "nationality" relation gets relatively lower precision because for some sentences about sports, there are usually more than one person and one country mentioned, the model gets confused in this scenario.

Adversarial Domain Adaptation
The label distributions may shift between M and P. The generator aligns the two distributions into a unified space. We use bag representations obtained through the bag encoder as the input of T-SNE to perform dimension reduction and obtain two-dimensional representations. As seen from Figure 4, the feature distributions before aligning are overlapped and the classification boundary is not clear. After aligning, the samples are better clustered.

Ablation Study
We conduct an ablation study to verify the effectiveness of submodules in FAN.

False Negatives in the Test Set
We also investigate the impact of FN on the test set during evaluation. As what we do in the training phase, we train a deep filter to mine FN from the test N/A set. As a result, we obtain 6,468 sentences, containing 4,951 entity pairs. By simply removing these data, we obtain huge improvements on AUC by around 20% higher than the original to 54.6, and 12% on micro-F1 to 54.5. See Figure 5. Similar phenomena occur on different baseline models. This indicates that the FN samples in the N/A set greatly affect the evaluation procedure and bring in improper biases for model selection. For comparison, we randomly remove the same number of sentences from the test N/A set, as a result, AUC increases 0.4 and micro-F1 increases 0.9. This result is rational because randomly removing 6,468 samples is negligible in comparison with 166,004 N/A samples in the test set.

Case Study
In Table 8, we show several cases of mined data which are excerpted from NYT10.
1. The first sentence is correctly assigned with label "/people/person/place_lived" by FAN, but it is missed by DS-GAN. Because one person is unusual to be mentioned on a Wikipedia page of a location, samples with relations between people and location are greatly omitted.  2. For less-frequent relations such as "/sports/sports_team/location", FAN can still identify it to enlarge the training data and weaken the imbalance between relations. 3. In the third sentence, Queens contains Corona, but not reversely. FAN incorrectly assigns "/location/location/contains" from Corona to Queens. In fact, differentiating relation directions is a hard task and needs further study.

Conclusion and Future Work
In this paper, we propose FAN, a two-stage method using adversarial DA to handle the FN problem in distantly supervised RE. We mine FN samples using the memory mechanism of deep neural networks. We use GRL to align unlabeled data with training data and generate pseudo labels to correct improper biases in both training and testing procedures. Our experiments show the superiority of FAN against many comparative models. In future work, we plan to use the teacher-student model to deal with FP and FN simultaneously.

A Experiment Setup
In this section, we provide more details of our experiments. We implement FAN with PyTorch 1.6 and train it on a server with an Intel Xeon Gold 5117 CPU, 120GB memory, two NVIDIA Tesla V100 GPU cards and Ubuntu 18.04 LTS. The parameters of FAN are initialized with Xavier (Glorot and Bengio, 2010) using a fixed initialization seed. We train FAN by SGD optimizer with mini-batch of size 160. In the convolutional module, we set kernel size in {2, 3, 4, 5} and filter size to 230. For details, please refer to

B Dataset Availability
The NYT10 dataset (Riedel et al., 2010) is available at http://iesl.cs.umass.edu/riedel/ecml/. We use the version adapted by OpenNRE (Han et al., 2019), which has removed the overlapped samples between the training set and the test set. The data is available at https://github.com/thunlp/OpenNRE. The GIDS dataset is available at https://github.com/SharmisthaJat/RE-DS-Word-Attention-Models.

C False Negatives in N/A
In this section, we give several examples of FN samples mined from the training N/A set. Those samples are assigned with pseudo-labels and confidence scores by FAN. They are diverse and contain many relation types, which verifies that the noises in N/A is not negligible and deserves further study.