Distantly Supervised Relation Extraction using Multi-Layer Revision Network and Confidence-based Multi-Instance Learning

Distantly supervised relation extraction is widely used in the construction of knowledge bases due to its high efficiency. However, the automatically obtained instances are of low quality with numerous irrelevant words. In addition, the strong assumption of distant supervision leads to the existence of noisy sentences in the sentence bags. In this paper, we propose a novel Multi-Layer Revision Network (MLRN) which alleviates the effects of word-level noise by emphasizing inner-sentence correlations before extracting relevant information within sentences. Then, we devise a balanced and noise-resistant Confidence-based Multi-Instance Learning (CMIL) method to filter out noisy sentences as well as assign proper weights to relevant ones. Extensive experiments on two New York Times (NYT) datasets demonstrate that our approach achieves significant improvements over the baselines.


Introduction
Relation Extraction (RE), which aims to classify the relations between a pair of entities in a sentence, is crucial to various applications like questionanswering and construction of knowledge bases. However, supervised relation extraction requires large amounts of manually labeled training data, which is hard to obtain. Therefore, Mintz et al. (2009) proposed Distantly Supervised Relation Extraction (DSRE) to automatically generate training data by aligning the knowledge base with text corpus. However, DSRE is based on the strong assumption that for an entity pair participating in a relation in the knowledge base, all sentences mentioning this entity pair in the corpus express the same relation. This brings a large number of noisy sentences into the generated data. The worst case is the noisy bag problem where all the sentences in the bag are mislabeled. On the other hand, low-quality sentences in the corpus contain a large proportion of irrelevant words, meaning even correctly labeled sentences may be filled with inner-sentence noise. To better present the impact of both sentence-level noise and word-level noise (inner-sentence noise), we select a sentence bag from New York Times (NYT) corpus as shown in Figure 1. Among the three sentences, only S2 expresses the label relation, meaning S1 and S3 are both noisy sentences. What's worse, in S2, the relation is indicated by a single word co-founders and the rest of the words can be regarded as noise. To tackle sentence-level noise, various multiinstance learning (Riedel et al., 2010) methods are proposed to reduce the effects of noisy sentences. Some methods filter out noisy sentences and keep the relevant ones (Zeng et al., 2015;Qin et al., 2018;Feng et al., 2018), but they may filter out relevant sentences as well. Some other methods apply soft labels or weights to limit the impact of noisy sentences (Lin et al., 2016;Liu et al., 2017;Yuan et al., 2019a), yet still at risk of being influenced by sentence-level noise because of the soft method. Therefore, a more balanced multi-instance learning strategy should be designed to avoid noisy sentences as well as fully exploit information in relevant ones.
To address word-level noise, a robust encoder is needed for capturing relevant information in a noisy context. Most of the previous work uses encoders based on Recurrent Neural Network (RNN) or Convolutional Neural Network (CNN) (Zeng et al., 2015;Zhou et al., 2016). However, both RNN and CNN models have shortcomings. In RNN encoders, irrelevant words can easily spread noise to its context since they are not isolated from relevant ones. As for CNN encoders, they may lose the salient information because of their pooling layers. To sum up, previous methods are not able to handle noisy words without information loss. Based on the idea that noisy words have weaker correlations with others, we try to fill this gap by modeling the correlations within sentences using attention mechanism.
We propose a novel Multi-Layer Revision Network (MLRN) with Confidence-based Multi-Instance Learning (CMIL) to tackle both wordlevel and sentence-level noise. In MLRN, we employ the novel revision layers, which alleviate noise by emphasizing inner-sentence correlations, to extract relevant information from the sentence. In each revision layer, we first model the correlations between words using self-attention, then emphasize the correlations by revising the attention weights, and finally apply a Translation Query (TRQ) for information extraction. By stacking multiple revision layers, implicit correlations between words are addressed. To alleviate sentence-level noise and tackle the noisy bag problem, we devise a confidence vector to measure the relevance of sentences to the relation classes and further utilize it to guide sentence filtering and weighting. Our contributions can be summarized as follows: • To our best knowledge, MLRN is the first model to utilize implicit correlations between words and the first DSRE network based solely on attention mechanism without RNN/CNN encoder layer or extra linguistic information.
• We propose a confidence-based multi-instance learning strategy that is able to (1) conduct sentence filtering independent of DS label to address noisy sentences, and (2) assign proper weights to relevant sentences based on their relevance to the bag prediction.
• Extensive experiments show that our approach achieves significant improvements over the baselines.

Related Work
Distant supervision (DS) for relation extraction (Mintz et al., 2009) is proposed for efficient knowledge base construction. However, DS brings about the wrong labeling problem as well. Riedel et al. (2010) proposes multi-instance learning for DSRE to address this issue. Most of the current work uses two types of MIL strategies: to remove noisy sentences or to apply soft weights. Following the at-least-one assumption, Zeng et al. (2015) selects the instance with the highest probability within the bag. Qin et al. (2018) and Feng et al. (2018) employ reinforcement learning for instance selection. For better information utilization, Lin et al. (2016) applies selective attention on the sentence level to dynamically adjust the attention weights. Yuan et al. (2019a) calculates the similarity between instances and the best sentence. DS may create noisy bags where all the sentences are mislabeled. Yuan et al. (2019b) and Ye and Ling (2019) use bag-level attention to address this issue. CNN-based (Zeng et al., 2015) and RNNbased (Zhou et al., 2016) networks are widely used for capturing information within the sentences. In addition, Xu et al. (2015) and Liu et al. (2018) integrate extra linguistic information into the model to address word-level noise. Since attention mechanism has been proved effective for modeling longrange dependencies in the sequence (Vaswani et al., 2017), attention-based models Du et al., 2018;Huang and Du, 2019;Zhang et al., 2020) are also introduced into DSRE.
In our work, we further make use of attention mechanism to emphasize correlations between words and devise a balanced confidence-based strategy to address noisy sentences.

Methodology
The overall structure of our model is shown in Figure 2. Our model can be divided into three parts: embedding layer, revision network and multiinstance learning layer. In this section, we introduce them respectively.

Embedding Layer
Before being fed into the revision network, the input instances are transformed into distributed representations. The representation of each word token consists of two parts: word embedding and position embeddings.
Word Embeddings are distributed representations for word tokens. Formally, we define j th word token in the i th sentence as w ij , which is mapped to a d w -dimensional word vector v ij ∈ R dw . The same as previous studies, We adopt Skip-Gram method to obtain the pre-trained word embedding matrix.
Position Embeddings are distributed representations for the relative distances from each word to the two entities, which are represented as lowdimensional vectors p e 1 ij , p e 2 ij ∈ R dp . Finally, the input embedding x ij is generated by concatenating word embedding v ij , position embeddings p e 1 ij and p e 2 ij , which is formulated as below: x

Revision Network
Formally, the revision network takes a sequence of word representations ., x il } with length l as the input, and outputs an d-dimensional representation U S i ∈ R d for the sentence. The revision layer for word-level noise reduction includes two types of attention sub-layers: self-attention layer and query-attention layer. By applying self-attention on the input, the correlations between each pair of tokens are calculated. In order to emphasize the correlations, the attention weights are revised in a query-attention layer before updating the representations. Afterwards, we apply a Translation Query (TRQ) inspired by TransE (Bordes et al., 2013) to extract relevant information as the record for each layer. Finally, these records are concatenated to form the sentence representation used for multi-instance learning.
The compositions of revision network will be discussed in detail in this section.

Input Layer
The input layer serves as an encoding layer which calculates feature representations from input embeddings. The input is the embeddings of i th instance, denoted as X i . For convenience, the subscript i is omitted in the equations of this part.
Instead of using CNN or RNN input layers as in most of the previous work, we apply an attention layer to model the long-distance dependencies in the sentence. The attention mechanism used can be formulated as follows: where Q is the query, K is the key and V is the value as Vaswani et al. (2017). d k is the dimension of key as well as a scaling factor. In order to explore various semantic spaces of the sentence, we use Multi-Head Self-Attention (MHSA) in the input layer, which is shown as follows: where σ is the sigmoid activation function, The input layer maps the input embedding onto the feature space, and generates the feature of the instance: The feature S of the instance is then passed to the first revision layer.

Revision Layer
There are totally three steps in revision layer: Self-Attention, Revision and Recording. Formally, the i th input to the k th revision layer is the output from last layer, denoted as S k−1 i . The two entities' representations in the sentence are presented as E i1 and E i2 respectively. The i th output of k th revision layer is denoted as S k i and the record of the k th layer as U k . For convenience, the superscript k and subscript i are omitted in the following equations except Eq. 10.
Self-Attention layer first calculates the innersentence correlations from the input S using selfattention, which is shown as follows: where we omit repetition of S as we have identical query, key and value. Note that different from the input layer, the self-attention layer does not introduce extra weight matrices so that the operation is on the same feature space. Viewed in a word-level perspective, selfattention layer updates the representation of a token as the weighted sum of the representations of all tokens in the sentence, where those similar to the inspected token have larger weights. In the feature space, closely correlated words tend to have similar representations. Since noisy words have weaker correlations with others, their attention weights assigned by other tokens are smaller. Therefore, noisy words are marginalized throughout the process, making them unlikely to spread noise to the rest of the sentence. However, since each token always has the highest similarity with itself, the weights assigned to other relevant words are relatively small, which limits the modeling of innersentence correlations. In other words, we need to assign larger weights to relevant words for stronger inner-sentence correlations.
To address this issue, Revision process is conducted using a query-attention layer, which takes Q R , the output from the self-attention layer, as the query and the layer input S as the key and value. The calculation is formulated as follows: where we also omit repetition of S since it serves as both key and value. We use the output from the self-attention layer as the query because it has already partially modeled the inner-sentence correlations, meaning that in the feature space, relevant words become closer to each other. Therefore, the revision query-attention layer assigns larger weights to relevant words to emphasize the innersentence correlations. At the same time, noisy words are further marginalized in this process. However, not all the words in the sentence are relevant to the relation, as shown in Figure 1. Hence we carry out Recording process to extract relevant information from the sentence. In order to represent the relation feature, we employ the TRQ inspired by TransE (Bordes et al., 2013) which uses the difference of two entities' representations as the relation feature. Similar method has been proved effective in . Here, we use multi-head attention to explore multiple semantic sub-spaces, the process is formulated as follows: where σ is the activation function, E 1 and E 2 are representations of entity pair, W Q i , W K i , W V i are weight matrices for i th head and h is the number of heads. The translation query E 1 − E 2 ∈ R d h and the updated representations O ∈ R l×d h are first mapped onto the same vector space, then the record U ∈ R d h is calculated as the weighted sum of the tokens' representations according to their relevance to the relation.
However, implicit correlations may not be considered within a single revision layer. As a simple example, given the sentence "[Joe] is the father of John, father of [Amy]", the model calculates the correlation between Joe and John as well as the correlation between John and Amy, but may be unable to observe the correlation between Joe and Amy in the first layer because they are not directly related. Therefore, we stack multiple revision layers to capture more implicit correlations between the words.
At the end of the revision network, all extracted records are concatenated together to form the final representation for the instance, which is formulated as below: where L is the number of revision layers and U S ∈ R Ld h is the final sentence representation. For convenience, in the following sections, we use d = Ld h to represent the dimension of sentence representations.

Confidence-based Multi-Instance Learning
After obtaining the representation for each sentence in the bag, we generate the bag representation using the CMIL strategy. First, we filter out noisy sentences according to the prediction of the bag. Afterwards, we emphasize the sentences with higher relevance according to the confidence vector. Formally, the input is a bag of sentence representations: where N is the number of sentences in the bag. The output of multi-instance learning layer is the bag representation U B .
Firstly, as shown in Figure 2, the score of i th sentence in the bag is calculated as follows: where W r ∈ R d×c and b r ∈ R c are weight matrices and c is the number of relation classes. The confidence vector C i ∈ R c measuring the relevance to each of the relation classes is calculated from the scores as follows: where W c ∈ R c is a weight matrix that represents the reliability of DS labels. In other words, it reflects the model's confidence towards the DS labels. Reliable DS labels have higher possibility to have true positive sentences, therefore, the model becomes more confident towards these labels so they have higher weights in W c . Afterwards, we obtain the adjusted score, which is the sum of original score and confidence vector, to generate the bag prediction and select relevant instances into the new bag as follows: (14) where C ij + F ij is the adjusted score which applies different thresholds on different relation classes. As shown above, the filtering process is guided by j * , which is the prediction made by the model. Therefore, in our model, instance selection is guided by the true relation class expressed in the bag instead of being misled by the DS label as in most of the previous methods. In this way, our model is able to alleviate the noisy bag problem.
Finally, in order to obtain the bag representation, we apply weighted sum on the instances in the new bag according to their confidence values: where U S i is the representation of i th instance in the new bag and U B is the final bag representation.

Optimization
Our goal is to maximize the conditional probability for the target relation given the bag of sentences. The probability p(y|U B , θ) is calculated from the bag representation as below: where W r and b r are the same weight matrices as Eq. 11. Then we employ a negative log-likelihood loss function with L 2 regularization to train the model: where β is a hyper-parameter to restrict the L 2 term. In our work, we use Adam (Kingma and Ba, 2014) to optimize our model.

Experiments
In this section, we first introduce the datasets and evaluation metrics used in the experiments. Then, we provide our experimental settings. Afterwards, we compare our model with baselines using the evaluation metrics. Finally, we discuss the effects of the revision layer and the CMIL strategy.

Datasets and Evaluation Metrics
In order to evaluate the performance of our model, we conduct experiments on widely used NYT-10 dataset (Riedel et al., 2010) and complex NYT-18 dataset (Zhang et al., 2020). NYT-10 is a standard dataset constructed by aligning relation facts in Freebase (Bollacker et al., 2008)   Following previous work (Mintz et al., 2009), we evaluate our model in the held-out evaluation, in which the relations extracted are automatically compared with those in Freebase. PR curves, area under curve (AUC) and Precision at top 100 predictions (P@100) are adopted as the evaluation metrics in our experiments. We employ three test settings which are One, Two and All.
• One: For each entity pair, we randomly select one instance to express the relation.
• Two: For each entity pair, we randomly select two instances to express the relation.
• All: In testing process, all instances mentioning the entity pair are selected.

Experimental Settings
In the experiments, the word embeddings are pretrained using word2vec (Mikolov et al., 2013). In Table 2, we list our parameters for the best model. We use two attention heads because we have a pair of mentioned entities in each sentence. The number of revision layers depends on the number of entity pairs and sentences in the dataset, and the difficulty of understanding them. In CMIL process, if all sentences in the bag are filtered, the model will assign average weights to them. To evaluate our approach, we select the following methods as our baselines:

Evaluation on NYT-10
PCNN+MAX (Zeng et al., 2015) proposes a piecewise CNN model which selects the instance with the largest logit value.
PCNN+RL (Feng et al., 2018) presents a reinforcement learning method for instance selection.
PCNN+BATT (Ye and Ling, 2019) employs both sentence-level and bag-level attention to emphasize correctly labeled sentences and bags.
QARE+MMIL (Zhang et al., 2020) presents a QA-based relation extractor with transfer learning.   As shown in Figure 3, MLRN+CMIL out-performs all the baselines significantly without any additional information(e.g. entity types in BGRU+SET). The P@100 values and AUC are shown in Table 3 and Table 4

Evaluation on NYT-18
As presented in Table 4 and Fig 4, our model also significantly outperforms all the baselines on complex NYT-18 dataset. BGRU+SET fails in NYT-18 because the complex instances are difficult to be parsed precisely using the conventional parser. The results prove the robustness of MLRN+CMIL in handling complex instances with inner-sentence noise.

Ablation Study
In order to evaluate the effects of our approach, we devise the following variations for comparison: • PCNN+CMIL: PCNN network with CMIL.
• OneLayer: Using only one revision layer.
• MLRN+MAX: MLRN which selects the instance with the largest logit value.
As shown in Table 3, MLRN models achieve significant improvements over PCNN models when using the same multi-instance learning strategies (MAX,ATT and CMIL). The complete model outperforms SelfAtt+CMIL significantly, showing that the revision mechanism is crucial for modeling inner-sentence correlations. OneLayer suffers from a dramatic drop in performance because of its incapability in modeling implicit correlations between the words. These results demonstrate that by strengthening correlations in revision process and modeling implicit correlations with multiple revision layers, MLRN becomes more robust and effective in DSRE.
Without bag-level operations in PCNN+BATT, PCNN+CMIL still achieves the highest P@100 mean value among all the PCNN-based models, showing that CMIL can effectively alleviate sentence-level noise and utilize information in the sentence bag. We also test multiple multi-instance learning strategies on MLRN model (MAX, ATT and NIID), and the results in Table 3 and Figure 5 show that CMIL outperforms all of them.

Case Study
In Figure 6, we select a bag of instances from test set and present their assigned weights in different models . Among the four sentences, S1, S2 and S4 are all relevant to the relation. But in S4, the relation is indicated in an implicit way by the word "israeli". S3 is an irrelevant sentence that does not mention the nationality of the entity david_bengurion.
As the example shows, all the three methods are able to correctly handle the irrelevant sentence S3. Although SelfAtt+CMIL works fine when S1 uses the phrase "israel 's first leader", it wrongly filters out S4 when encountered with the phrase "israeli leaders". It is because that SelfAtt+CMIL is unable to detect the relevance between "israel" and "israeli" in S4. The selective attention method is designed to exploit relevant sentences, but in an attempt to avoid sentence-level noise, it may also down-weight the relevant sentences it has less confidence in, such as S1 and S4. Our complete model (MLRN+CMIL) successfully detects the relevance between "israel" and "israeli", therefore regards S4 as a relevant sentence. Moreover, MLRN+CMIL assigns more balanced weights to relevant sentences comparing with other methods.
This example verifies MLRN's ability to capture implicit correlations between sentences. It also proves that CMIL not only alleviates sentence-level noise, but also makes further progress in information utilization.

Conclusion and Future Work
In this paper, we propose a novel MLRN+CMIL model for distantly supervised relation extraction. The MLRN structure is able to alleviate noise by modeling inner-sentence correlations and extract relevant information. The CMIL strategy is a balanced and robust way to avoid noisy sentences as well as assign proper weights to relevant ones. The experimental results show that our approach achieves significant improvements over the baselines and is effective in handling both word-level and sentence-level noise.
In the future, we will try to extend our confidence-based method to bag-level, and experiment with the novel revision network on other tasks to further prove its effectiveness.