How Knowledge Graph and Attention Help? A Quantitative Analysis into Bag-level Relation Extraction

Knowledge Graph (KG) and attention mechanism have been demonstrated effective in introducing and selecting useful information for weakly supervised methods. However, only qualitative analysis and ablation study are provided as evidence. In this paper, we contribute a dataset and propose a paradigm to quantitatively evaluate the effect of attention and KG on bag-level relation extraction (RE). We find that (1) higher attention accuracy may lead to worse performance as it may harm the model's ability to extract entity mention features; (2) the performance of attention is largely influenced by various noise distribution patterns, which is closely related to real-world datasets; (3) KG-enhanced attention indeed improves RE performance, while not through enhanced attention but by incorporating entity prior; and (4) attention mechanism may exacerbate the issue of insufficient training data. Based on these findings, we show that a straightforward variant of RE model can achieve significant improvements (6% AUC on average) on two real-world datasets as compared with three state-of-the-art baselines. Our codes and datasets are available at https://github.com/zig-kwin-hu/how-KG-ATT-help.


Introduction
Relation Extraction (RE) is crucial for Knowledge Graph (KG) construction and population. Most recent efforts rely on neural networks to learn efficient features from large-scale annotated data, thus correctly extract the relationship between entities. To save the manual annotation cost and alleviate the issue of data scarcity, distant supervision relation extraction (DSRE) (Mintz et al., 2009) is proposed and becomes increasingly popular as it can automatically generate large-scale labeled data. DSRE is based on a simple yet effective principle: if there is a relation between two entities in KG, then all sentences containing mentions of both entities are assumed to express this relation and will form a sentence bag as its annotations. Although effective, distant supervision may introduce noise to a sentence bag when the assumption fails -some sentences are not describing the target relation (Zeng et al., 2015) (a.k.a. noisy annotation). To alleviate the negative impacts of noise, recent studies (Lin et al., 2016;Ji et al., 2017;Du et al., 2018; leveraged attention to select informative instances from a bag. Furthermore, researchers introduced KG embeddings to enhance the attention mechanism (Hu et al., 2019;Han et al., 2018a). The basic idea is to utilize entity embeddings as the query to compute attention scores, so that the sentences with high attention weights are more likely to be valid annotations . Previous studies have shown performance gain on DSRE with attention module and KG embeddings, however, it's still not clear how these mechanisms work, and, are there any limitations to apply them?
In this paper, we aim to provide a thorough and quantitative analysis about the impact of both attention mechanism and KG on DSRE. By analyzing several public benchmarks including NYT-FB60K (Han et al., 2018a), we observe lots of disturbing bags -all of the bag's sentences are valid or noisy annotations, which shall lead to the failure of attention. As shown in Figure-1, all of annotations in the first disturbing bag are valid, while the learned attentions assign the second annotation with a very low weight, which suggests an inefficient utilization of annotations and exacerbates the data sparsity issue. Or, in the second bag, all sentences are noisy, can attention and KG still improve the performance? If so, how do they work and to what extent can they tolerate these disturbing bags? Answering these questions is crucial since this type of noise is common in practice. The unveiling of their working mechanism shall shed light on future research direction, not limited to DSRE.
To achieve this, we propose a paradigm based on newly curated DSRE benchmark, BagRel-Wiki73K extracted from FewRel (Han et al., 2018b) and Wikidata 1 , for quantitative analysis of attention and KG. With extensive experiments, we conclude the following innovative and inspiring findings: (1) The accuracy of attention is inversely proportional to the total noise ratio and disturbing bag ratio of training data; (2) attention effectively selects valid annotations by comparing their contexts with the semantics of relations, thus tends to rely more on the context to make predictions. However, it somehow lowers the model's robustness to noisy sentences that do not express the relation; (3) KG-enhanced attention indeed improves RE performance, surprisingly not via enhanced attention accuracy, but by incorporating entity features to reduce the demand of contexts when facing noise; (4) attention could hurt the performance especially when there is no sufficient training data.
Based on the above observations, we propose a new straightforward yet effective model based on pre-trained BERT (Devlin et al., 2018) for RE with Concatenated KG Embedding, namely BRE+CE. Instead of in-bag attention, it breaks the bag and ensembles the results of all sentences belonging to the bag. For each sentence, we directly incorporate entity embeddings into BERT, rather than to enhance attentions, to improve the robustness of extracting both context and mention features. BRE+CE significantly outperforms existing state-of-the-arts on two publicly available datasets, NYT-FB60K (Han et al., 2018a) and GIDS-FB8K (Jat et al., 2018), by 6% AUC on average. We summarize our contributions as follows: • To the best of our knowledge, our proposed framework is the first work to quantitatively analyze the working mechanism of Knowledge Graph and attention for bag-level RE. 1 dumps.wikimedia.org/wikidatawiki/entities/20201109/ • We have conducted extensive experiments to inspire and support us with the above findings.
• We demonstrate that a straightforward method based on the findings can achieve improvements on public datasets.

Related Work
To address the issue of insufficient annotations, Mintz et al. (2009) proposed distant supervision to generate training data automatically, which also introduces much noise. From then, DSRE becomes a standard solution that relies on multi-instance learning from a bag of sentences instead of a single sentence (Riedel et al., 2010;Hoffmann et al., 2011). Attention mechanism (Lin et al., 2016) accelerates this trend via strong ability in handling noisy instances within a bag Du et al., 2018). Aside from intra-bag attention, Ye and Ling (2019) also designed inter-bag attention simultaneously handling bags with the same relation. To deal with only-one-instance bags,  utilized a new selective gate (SeG) framework to independently assign weights to each sentence. External KG is also incorporated to enhance the attention module (Han et al., 2018a;Hu et al., 2019). However, due to the lack of sentencelevel ground truth, it is difficult to quantitatively evaluate the performance of the attention module. Previous researchers tend to provide examples as case study. 2 Therefore, we aim to fill in this research gap by constructing a dataset and providing a framework for thorough analysis.

Preliminary
Knowledge Graph (KG) is a directed graph G = {E, R, T }, where E denotes the set of entities, R denotes the set of relation types in G, and T = {(h, r, t)} ⊆ E × R × E denotes the set of triples. KG embedding models, e.g., RotatE , can preserve the structure information in the learned vectors e h , e t and e r . We adopt TransE (Bordes et al., 2013)

in experiments.
Bag-level relation extraction (RE) takes a bag of sentences B = {s 1 , s 2 , . . . , s m } as input. Each sentence s i in the bag contains the same entity pair (h, t), where h, t ∈ E. The goal is to predict a relation y ∈ R between (h, t).
Attention-based Bag-level RE uses attention to assign a weight to each sentence within a bag.
Given a bag B from the dataset D, an encoder is first used to encode all sentences from B into vectors {s 1 , s 2 , . . . , s m } separately. Then, an attention module computes an attention weight α i for each sentence and outputs the weighted sum of {s i } as s to denote B: where v y is the label embedding of relation y in the classification layer, we denote this attention module as ATT in the rest of paper. KG-enhanced attention aims to improve v y with entities e h and e t (Han et al., 2018a): where r ht is regarded as latent relation embedding. We mark this way of computing ω i as KA. W s and b s are learnable parameters. Given a bag representation s, the classification layer further predicts a confidence of each relation: where o is a logit vector. W b and b b are learnable parameters. During training, the loss is computed by: where n is the number of training bags in D. Since the classification layer is linear, we can rewrite the bag's logit vector o using a weighted sum of each sentence's logit vector o: From equation 10, we can see that the model's output on the whole bag depends on three aspects: (1) the model's output on valid sentences within the bag; (2) the model's output on noisy sentences within the bag; (3) the attention weight assigned to valid sentences and noisy ones.

Benchmark
To quantitatively evaluate the effect of attention and KG on Bag-level RE, we first define two metrics to measure the noise pattern (Section 4.1). Then, we construct a KG and a Bag-level RE dataset (Section 4.2). Finally, we introduce a general evaluation framework to assess attention, KG and the entire RE model (Section 4.3).

Metrics Describing Noise Pattern
To analyze how attention module functions on different noise patterns, we first design 2 metrics to describe the noise pattern: Noise Ratio (NR) and Disturbing Bag Ratio (DR).
Noise Ratio (NR) represents the proportion of noisy sentences in the dataset. Given a bag B i and its relation label y i , a sentence s ij ∈ B i is noisy if its context does not express y i . Suppose Isn(s ij , y i ) is an indicator function to tell whether s ij is noise. Then NR is defined as: where |B i | is the size of B i , n is the total number of bags.
Disturbing Bag Ratio (DR) means the proportion of disturbing bags in the dataset. A bag is disturbing if all sentences in it are valid or all sentences are noisy. Formally, we use function Isd(B i ) to indicate whether a bag is disturbing or not: (1 − Isn(s ij , y i )) (12) Then we define DR as follows:

Dataset Construction
Based on FewRel and Wikidata, we construct a Bag-level RE dataset containing multiple training sets with different noise patterns, a test set and a development set. For each sentence in the bags, there is a ground truth attention label indicating whether it is a valid sentence or noise. We also construct a KG containing all entities in the RE dataset by retrieving one-hop triples from Wikidata.

Valid
This road begins at the end of the toll bridge over the Wabash River.
Guillemard Bridge is a railway bridge across Sungai Kelantan.
Oru Kai Osai was a Tamil soap opera that aired on Zee Tamil.
This road begins at the end of the Guillemard Bridge over the Sungai Kelantan.
Guillemard Bridge was a Sungai Kelantan soap opera that aired on Zee Tamil.  Synthesize Sentence FewRel is a sentence-level RE dataset, including 80 relations. For each relation, there are 700 valid sentences. Each sentence has a unique entity pair. Every sentence along with its entities and relation label form a tuple (s, h, t, y). We thus synthesize valid and noisy sentences for the same entity pair for data augmentation.
The first step is to divide sentences of each relation into 3 sets: train FewRel , test FewRel and dev FewRel , where each set has 500, 100 and 100 sentences. Then, for each tuple (s, h, t, y) in the set, we aim to augment it to a bag B, where all of its sentences contain (h, t). Besides, the sentences in B are either the original s, or a synthesized valid sentence, or a synthesized noisy sentence. We synthesize sentences in the form of (s , h, t, y, z), where z denotes the attention label (1 for valid, 0 for noisy). In specific, to synthesize a sentence, we randomly replace the source pair of entity mentions with other target entity pairs while keeping the context unchanged. Thus, if the contexts express the same relation type with the entity pair, we can automatically assign an attention label.
We illustrate the synthesizing process in Figure 2.
To generate a valid sentence, we randomly select another sentence (s 1 , h 1 , t 1 , crosses) which is labeled with the same relation as s 2 from train FewRel . Then we replace its entity mentions h 1 and t 1 as h 2 and t 2 . The output is (s 4 , h 2 , t 2 , crosses, 1). Since its context correctly describe crosses, we regard s 4 as valid. For the noisy sentence, we randomly select a sentence (s 3 , h 3 , t 3 , isA) under another relation. With similar process for s 4 , we get a synthesize sentence (s 5 , h 2 , t 2 , crosses, 0). Because the context of s 5 does not express target relation, we label it as a noise.
Training Sets with Different Noise Patterns As defined in Section 4.1, we use NR and DR to measure the noise pattern of Bag-level RE dataset. By controlling the number of synthesized noisy sentences in each bag and the total ratio of noise among all sentences, we can construct several training sets with different patterns. In the following sections, we denote a training set of which the NR is x and DR is y as train x,y . Higher x and y indicate noisy sentences and disturbing bags account for larger proportion.
For example, in Figure 2, assuming there are 4 sentences in train FewRel , for each sentence, we synthesize two noisy sentences that form the bag together with the original sentence. Thus each bag contains 3 sentences: 1 valid and 2 noisy, and its NR is 2/3 and DR is 0. For the other 3 sets, the number of synthesized noisy sentences equals the sum of original valid sentences and synthesized valid sentences. Thus they all have a NR of 1/2. Since we define bags containing no valid sentences or no noisy sentences as disturbing bags, the third set and fourth set have 2 and 4 disturbing bags, with a DR of 1/2 and 1, respectively.
Test Set and Development Set We also construct a test and a development set. Similar as the second set in Figure 2, each bag in the test/dev sets contains two sentences, the NR of both sets is 1/2 while the DR is 0. I.e., in every bag of test/dev sets, there is one valid sentence and one noisy sentence. Instead of multiple test sets of different noise patterns, we only have one test set so that the evaluation of different models is consistent. To avoid information leak, when construct train x,y , test and development sets, the context of synthesized sentences only come from train FewRel , test FewRel and development FewRel , respectively.
The final BagRel contains 9 train sets, 1 test and 1 development set, as listed in Table 1. The NR of the training sets has three options: 1/3, 1/2 or 2/3, and similarly, DR can be 0, 1/2 or 1. The NR of both test and development sets are 1/2, while their DR are 0. All data sets contain 80 relations. For training sets whose NR are 1/3, 1/2 and 2/3, every bag in these sets contains 3, 2 and 3 sentences, respectively.

KG Construction
To evaluate the impact of KG on attention mechanism, we also construct a KG based on Wikidata. Denoting the set of entities appearing in FewRel as E, we link each entity in E to Wikidata by its Freebase ID, and then extract all To evaluate the effect of structural information from KG, we also construct a random KG whose triple set isT . Specifically, for each triple (h, r, t) in T , we corrupt it into (h,r, t) by replacing r with a random relationr = r. Thus the prior knowledge within the KG is destroyed. KG-73K and KG73Krandom have the same scale: 72,954 entities, 552 relations and 407,821 triples. Finally, we obtain BagRel-Wiki73K, including the Bag-level RE sets and KG-73K.

Evaluation Framework
We first define several measurements to evaluate the effect of the attention mechanism and KG: Attention Accuracy (AAcc), Area Under precisionrecall Curve (AUC), AUC on Valid sentences (AUCV) and AUC on Noisy sentences (AUCN).
AAcc measures the attention module's ability to assign higher weights to valid sentences than noisy sentences. Given a non-disturbing bag (a bag containing both valid and noisy sentences) B i = {(s j , h i , t i , y i , z j )} and the predicted probability distribution p i , the AAcc of this bag is calculated by the following formula: AAcc is designed specifically for non-disturbing bags. On disturbing bags, with all sentences noisy or valid, it is meaningless to evaluate attention module's performance. So in test/dev sets of our BagRel-Wiki73k, all bags are non-disturbing bags. Then without distraction, the evaluation results can better present how the attention module works.
AUC is a standard metric to evaluate DSRE model's performance on bag-level test set. As mentioned in section 3, attention-based model's performance on non-disturbing bags relies on three aspects: (1)AAcc, (2) model's performance on valid sentences and (3) model's performance on noisy sentences. So we use AUCV and AUCN to measure the second and the third aspects, respectively. The difference between AUC and AUCV is that AUC is computed on the original test set i has the same label but removes all noisy sentences within it. Thus there is no noisy context feature in D v , then models can utilize both entity mentions and contexts to achieve a high AUCV. On the opposite, AUCN is AUC computed on the Noise-only test set D n = {B n i }, where B n i removes all valid sentences in B i . Since all context features in D n are noisy, to achieve a high AUCN, models have to ignore context and rely more on mention features to make predictions.
AUC, AUCV and AUCN range from 0 to 1, and a higher value of the 3 metrics indicates that a model makes better prediction on the whole bag, valid sentences and noisy sentences, respectively.
To evaluate the effects of attention and KG, we design two straightforward Bag-level RE models without the attention module, BRE and BRE+CE. By comparing their performance with BRE+ATT (BRE with attention module) and BRE+KA (BRE with KG-enhanced attention module), we can have a better understanding of the roles of ATT and Knowledge-enhanced ATT.
BRE uses BERT (Devlin et al., 2018) as the encoder. Specifically, we follow the way described in (Peng et al., 2020;Soares et al., 2019): entity mentions in sentences are highlighted with special markers before and after mentions. Then the concatenation of head and tail entity representations are used as the representation s . Since BRE does not have attention mechanism, it breaks the bags and compute loss on each sentence: BRE can be viewed as a special case of BRE+ATT. Its attention module assigns all sentences in all bags with the same attention weight 1. During inference, given a bag, BRE uses the mean of each sentence's prediction as the whole bag's prediction: BRE+CE concatenates an additional feature vector r ht with BERT output, where r ht is defined based on entity embeddings of h and t. The concatenated vector is used as the representation of the sentence and fed into the classification layer.

Experiment
We apply our proposed framework on BagRel-Wiki73K and two real-world datasets to explore the following questions: • How noise pattern affects the attention module?
• Whether attention mechanism promotes RE model's performance?
• How KG affects the attention mechanism?

Experimental Setup
For fair comparison, all of baselines share the same encoding structure as BRE. The attentionbased models include BRE+ATT,BRE+KA and BRE+SeG, where SeG  is an advanced attention mechanism which achieves the state-of-the-art performance on NYT-FB60K. Briefly, SeG uses sigmoid instead of softmax to compute attention weights of each instance in a bag. The models without attention are BRE and BRE+CE. To check the effect of noise pattern, we train model on different train sets. As a reminder, train x,y is a train set whose NR and DR is x and y, respectively.

Noise Pattern v.s. Attention Accuracy
We train BRE+ATT on 9 different training sets with different noise patterns. As shown in Figure 3, we can see that: (1) higher noise ratio (NR) makes the model harder to highlight valid sentences, leading to a lower attention accuracy (AAcc); (2) higher disturbing bag ratio (DR) results in lower AAcc, indicating that disturbing bags challenge the attention module. Based on these results, we claim that the noise pattern within the training set largely affects the attention module's effectiveness.

Attention v.s. RE Performance
To quantitatively analyze the effect of attention mechanism, we compare the performance of BRE and BRE+ATT in  Table 2: Test results of models trained on different train set. In the Model column, X-Y means model X trained on train set Y. Among 3 train sets, train 1 2 ,1 has the most disturbing bags, while train 1 2 ,0 has no such bag.
AUCV indicates the stronger ability of the model itself -in an ideal setting without any noise, and a higher AUCN indicates higher robustness of model to noise. Surprisingly, when using the same training set train 1 2 ,0 , the AUC of the attention-enhanced model is lower than the AUC of the model without attention (0.878 v.s. 0.910). In addition, BRE+ATT has lowest AUC using train 1 2 ,0 , which has no disturbing bags. The highest AAcc (0.881) also suggests that the attention module does effectively select valid sentences. Why the most effective attention module leads to the worst performance? The reason is that BRE+ATT-train 1 2 ,0 has a much lower AUCN, which indicates that it is less robust to noisy sentences.
Is it true that an effective attention module shall hurt model's robustness to noise? This is actually against our intuition. To answer it, we draw Figure 4 by assigning fixed attention weights to sentences during training. Specifically, each bag in train 1 2 ,0 has a valid sentence and a noisy sentence, and we assign fixed attention weight α to the valid and 1 − α to the noisy one, instead of computing α with attention module. Then we test the resulting model's AUCN and AUCV performance. We can see that when the valid sentences receive higher attention weights, the AUCV curve rises slightly, indicating the model's performance indeed gets enhanced. Meanwhile, the AUCN curve goes down sharply. This demonstrates the effective attention weakens the model's robustness to noise. The reason is that the model with a high-performance attention module prefers to utilize context information instead of entity mention features. Thus, it usually fails if most contexts are noisy. Thus we can explain the results in Table 2. train 1 2 ,0 has the highest AAcc, indicating that it assigns very low weights to noisy sentences. Thus the gain from AUCV can not make up the loss from AUCN, resulting a worse AUC.
In conclusion, attention module can effectively select valid sentences during training and test. But it has an underlying drawback that it might hurt the model's ability to predict based on entity mention features, which are important in RE tasks ) (Peng et al., 2020), leading to worse overall performance.  To measure KG's effect on the combined with attention mechanism, we compare the results of KA with ATT, while keeping other parts of the model unchanged. As shown in Table 3. When trained on train 1 2 ,0 , the KG-enhanced model (KAtrain 1 2 ,0 ) has lower AAcc than the model without KG (ATT-train 1 2 ,0 ) (0.857 v.s. 0.881), while the AUC is higher (0.932 v.s. 0.878). This is because the KA version has a higher AUCN (0.560) and comparable AUCV and AAcc. Thus, the KGenhanced model achieves better performance on noisy bags, leading to a better RE performance.
In addition, comparing Table 2 and Table 3, KA shows lower AAcc and higher AUCN than ATT on all three train sets. This also demonstrates that KG does not promote model's performance by improving attention module's accuracy, but by enhancing the encoder and classification layer's robustness to noisy sentences. This makes sense because the information from KG focuses on entities instead of contexts. By incorporating KG, the model relies more on entity mention features instead of noisy contexts feature, thus becomes better at classifying noisy sentences.
Moreover, comparing BRE+KA rand 's performance with BRE+KA on train 1 2 ,0 , we can observe that after incorporating entity embeddings learned from a random KG, BRE+KA rand has a much lower attention accuracy. This indicates that misleading knowledge would hurt attention mechanism.

Attention v.s. Data Sparsity
Attention module assigns low weights to part of training sentences. When training data is insufficient, not making full use of all training examples could aggravate the data sparsity issue. Thus we compare performance of models trained on subsets of train 1 2 , 1 2 . From Figure 5, we can see that along with the decreasing size of training data, the performance gap between BRE+ATT and BRE+CE becomes larger. This is because the latter one fully utilizes every example by assigning the same weight 1 to all sentences. We also check each model's attention weights. BRE+SeG assigns all sentences with weights > 0.9, so its performance drop is similar to the model without attention. Thus, we claim that traditional attention mechanism could exacerbate the model's ability to insufficient data. This motivates us a better attention mechanism for few-shot settings. We leave it in the future.

Stability of Attention v.s. Noise Pattern
From results in Table 2 and   Based on previous observations, we find that BRE and BRE+CE could avoid latent drawbacks of attention mechanism and have a stable performance on datasets with different noise patterns, thus they are competitive methods compared with prior baselines. To examine whether they work on the real-world Bag-level DSRE datasets, we compare our method to 3 previous baselines on NYT-FB60K (Han et al., 2018a) and GIDS-FB8K (Jat et al., 2018). We select JointE (Han et al., 2018a), RELE (Hu et al., 2019) and SeG  as baselines, because they achieve state-of-the-art performance on bag-level RE. To collect AUC results, we carefully re-run published codes of them using suggested hyperparameters from the original papers. We also draw precision-recall curves following prior works. As shown in Table 4 and Figure 6, our method BRE+CE largely outperforms existing methods on NYT-FB60K and has comparable performance on GIDS-FB8K. Such result demon-strates that we avoid attention mechanism's latent drawback of hurting model's robustness. Furthermore, the model's improvement on NYT-FB60K is promising (around 13% AUC). This is due to two reasons: (1) NYT-FB60K is a noisy dataset containing prevalent disturbing bags, which is similar to our synthesized datasets. (2)NYT-FB60K is highly imbalanced and most relation types only have limited training data, while all relation types in our balanced datasets have the same number of training examples; thus BRE+CE and BRE achieve much higher improvement on NYT-FB60K compared with synthesized datasets. In conclusion, the high performance not only validates our claim that attention module may not perform well on noisy and insufficient training data, but also verifies that our thorough analysis on attention and KG have practical significance.  From results in Table 5, we provide a straight comparison between models with KG (BRE+KA, BRE+CE) and models without KG (BRE+ATT, BRE). Apparently, both methods of utilizing KG (combined with attention and concatenated as additional features) outperforms methods not using KG. This demonstrates the prior knowledge from KG is beneficial for relation extraction task. Except our naive BRE+CE, we expect that a carefully designed mechanism incorporating KG can lead to higher improvement. We leave it in the future.

Conclusion
In conclusion, we construct a set of datasets and propose a framework to quantitatively evaluate how attention module and KG work in the bag-level RE. Based on the findings, we demonstrate the effectiveness of a straightforward solution on this task. Experiment results well support our claims that the accuracy of attention mechanism depends on the noise pattern of the training set. In addition, although effectively selecting valid sentences, attention mechanism could harm model's robustness to noisy sentences and aggravate the data sparsity issue. As for KG's effects on attention, we observe that it promotes model's performance by enhancing its robustness with external entity information, instead of improving attention accuracy.
In the future, we are interested in developing a more general evaluation framework for other tasks, such as question answering, and improving the attention mechanism to be robust to noise and insufficient data, and an effective approach to incorporate the KG knowledge to guide the model training.