Two Training Strategies for Improving Relation Extraction over Universal Graph

This paper explores how the Distantly Supervised Relation Extraction (DS-RE) can benefit from the use of a Universal Graph (UG), the combination of a Knowledge Graph (KG) and a large-scale text collection. A straightforward extension of a current state-of-the-art neural model for DS-RE with a UG may lead to degradation in performance. We first report that this degradation is associated with the difficulty in learning a UG and then propose two training strategies: (1) Path Type Adaptive Pretraining, which sequentially trains the model with different types of UG paths so as to prevent the reliance on a single type of UG path; and (2) Complexity Ranking Guided Attention mechanism, which restricts the attention span according to the complexity of a UG path so as to force the model to extract features not only from simple UG paths but also from complex ones. Experimental results on both biomedical and NYT10 datasets prove the robustness of our methods and achieve a new state-of-the-art result on the NYT10 dataset. The code and datasets used in this paper are available at https://github.com/baodaiqin/UGDSRE.


Introduction
Relation Extraction (RE) is an important task in Natural Language Processing (NLP). RE aims to turn unstructured texts into structured Knowledge Graph (KG), which is typically stored as (e 1 , r, e 2 ) triplets, where e 1 is a head entity, r is a relation and e 2 is a tail entity, such as (aspirin, may treat, pain) and (Guy Maddin, place lived, Winnpeg). RE can be formulated as a classification task to predict a predefined relation r from entity pair (e 1 , e 2 ) annotated evidences.
One obstacle that is encountered when building a RE system is the generation of a large amount of manually annotated training instances, which is ex-pensive and time-consuming. For coping with this difficulty, Mintz et al. (2009) propose Distant Supervision (DS) to automatically generate training samples via linking KGs to texts. They assume that if (e 1 , r, e 2 ) is in a KG, then all sentences that contain (e 1 , e 2 ) (hereafter, sentence evidences) express the relation r. It is well known that the DS assumption is too strong and inevitably accompanies the wrong labeling problem, such as the sentence evidences (1b and 2) below, which fail to express may treat and place lived relation respectively.
(1) a. Aspirin e 1 is widely used for short-term treatment of pain e 2 , fever or colds.
b. The tumor was remarkably large in size, and pain e 2 unrelieved by aspirin e 1 .
(2) He is now finishing a documentary about Winnipeg e 2 , the final installment of a personal trilogy that began with "Cowards Bend the Knee" (a 2003 film that also featured a hapless hero named Guy Maddin e 1 ).
Recently, neural network models with attention mechanism have been proposed to alleviate the wrong labeling problem and attend informative sentence evidences such as (1a) (Lin et al., 2016;Ji et al., 2017;Du et al., 2018;Jat et al., 2018;Han et al., 2018a,b). However, there can be a large portion of entity pairs that lack such informative sentence evidences that explicitly express their relation. This makes Distantly Supervised Relation Extraction (DS-RE) further challenging (Sun et al., 2019).
For compensating the lack of informative sentence evidences, Dai et al. (2019) utilize multihop paths connecting a target entity pair (hereafter, path) over a KG as extra evidences for DS-RE. An example of such multi-hop KG path can be seen in Figure 1,  of the form of e 1 component of The model of Dai et al. (2019) uses such multi-hop paths as additional features for predicting the relation between a given target entity pair (e 1 , e 2 ), which is reported effective for performance improvement. However, KGs are often highly incomplete (Min et al., 2013) and may be too sparse to provide enough informative paths in practice, which may hamper the effectiveness of multi-hop paths. Given this background, in this study, we take one step further, aiming for inducing maximal signals of distant supervision from both a KG and a large text collection (hereafter, Text). For this purpose, we consider using multi-hop paths over a Universal Graph (UG) as extra features for DS-RE. Here, we define a UG as a joint graph representation of both KG and Text, where each node represents an entity from KG or Text, and each edge indicates a KG relation or Textual relation, as shown in Figure 1. The path p 2 in the figure is an example of UG path, comprising a textual edge TR1, a KG edge may treat, and another textual edge TR2. By augmenting the original KG with textual edges, one can expect far more chances to find informative path evidences between any given target entity pairs, because the number of such textual edges is likely to be much larger than the number of KG edges (Note that one can collect as many textual edges as needed from a raw text corpus with an entity linker). Extending a KG to a UG, therefore, may allow a DS-RE model to learn richer distant supervision signals.
The idea of using multi-hop paths over a UG is not necessarily new on its own. For example, Toutanova et al. (2015) propose to use a UG for knowledge graph completion, and Das et al. (2017b) propose a model trained to reason over a UG for question answering. However, there is no prior study that has explored the effective way to use a UG for the task of DS-RE from text. In fact, finding an effective way of using a UG for DS-RE is not as simple as it may seem. As we report in this paper, a straightforward extension of the Dai et al. (2019) model to the UG setting may result in performance degradation.
Motivated by this, in this paper, we address how one can make effective use of UG for DS-RE. We first report our observation that a straightforward extension of the Dai et al. (2019) model to the UG setting tends to allocate the majority of attention to only a limited set of UG paths such as short KG paths and miss out the learning from a wide range of UG paths ( §4.1), which hinders performance gain. In order to alleviate the negative effect of the attention bias and realize the potential of UG paths, we propose two training (or debiasing) strategies: (1) Path Type Adaptive Pretraining ( §4.2), which aims to improve the adaptability of the model to various UG paths; and (2) Complexity Ranking Guided Attention mechanism ( §4.3), which enables the model to learn from both simple and complex UG paths. Experimental results on both biomedical and NYT10 (Riedel et al., 2010) datasets prove that: (1) UG paths have the potential to bring performance gain for DS-RE as compared with KG paths; (2) the proposed training methods are effective to fully exploit the potential of UG paths for DS-RE because the proposed methods significantly and consistently outperform several baselines on both datasets and especially achieve a new state-of-the-art result on the NYT10 dataset. Our work differs from the ones mentioned above in two ways: (i) We utilize the UG paths as extra evidences for the task of DS-RE from text, (ii) We take into account the factor of attention bias while encoding UG paths and propose two effective debiasing methods to exploit the potential of UG paths for DS-RE.

Base Model
We select the DS-RE model proposed by Dai et al. (2019) as our base model and extend it into our UG setting. Given a target entity pair (e 1 , e 2 ), a bag of corresponding sentence evidences S r = {s 1 , ..., s n } and a bag of UG paths P r = {p 1 , ..., p m }, the base model aims to measure the probability of (e 1 , e 2 ) having a predefined relation r (including the empty relation NA). The base model consists of four main modules: KG Encoder, Sentence Evidence Encoder, Path Evidence Encoder and Relation Classification Layer, as shown in Figure 2.

KG Encoder
Suppose we have a KG containing a set of fact triplets O = {(e 1 , r, e 2 ), ...}, where each fact triplet consists of two entities e 1 , e 2 ∈ E and their relation r ∈ R. Here E and R stand for the set of entities and relations respectively.
The KG Encoder then encodes e 1 , e 2 ∈ E and their relation r ∈ R into low-dimensional vectors h, t ∈ R d and r ∈ R d respectively, where d is the dimensionality of the embedding space. The KG Encoder adopts TransE (Bordes et al., 2013) to score a given triplet. Specifically, given a triplet (e 1 , r, e 2 ), TransE evaluates its plausibility  Figure 2: Overview of the base model and our proposed Complexity Ranking Guided Attention mechanism ( §4.3). The base model takes the sentence evidences (e.g., s 1 , ...) containing a target entity pair and the UG paths (e.g., p 1 , ...) connecting the entity pair as input for predicting their relation. The KG embeddings of the entity pair (i.e., h and t) are used for calculating the attention over these sentences and paths. The Complexity Ranking Guided Attention mechanism is proposed to force the model to attend both simple UG paths (e.g., p 1 ∼ p j ) and complex ones (e.g., via Equation 1: where b is a bias constant and r ht is a latent relation embedding for (e 1 , e 2 ). The conditional probability can be formalized over all fact triplets O as follows: where θ E and θ R are parameters for entities and relations respectively.

Sentence Evidence Encoder
where r ht is from Equation 2, W is the weight matrix, b is the bias vector, a i is the weight for the i-th sentence in S r .

Path Evidence Encoder
Given a bag of UG paths P r = {p 1 , ..., p m } connecting an entity pair of interest (e 1 , e 2 ), the Path Evidence Encoder encodes them into a bag-level vector representation p all . Since we represent a path as a sequence of words (or a long sentence), as shown in Figure 1, analogously to the Sentence Evidence Encoder, we apply a CNN-Max (see Appendix §A.1) to encode each path p i , namely The bag-level path representation p all for P r is then calculated via Equation 5: where a i is the weight for the i-th path in P r .

Relation Classification Layer
The conditional probability of (e 1 , e 2 ) having a relation r is formulated via Equation 6: θ P are the parameters in Sentence Evidence Encoder and Path Evidence Encoder, M is the representation matrix of relations, d is a bias vector, o is the output vector containing the prediction scores of all predefined relations, [o] c is the prediction score for the relation c, and n r is the total number of relations.

Problem of Attention Bias
While extending the base model into our UG setting, we observe that the base model tends to allocate more attention to KG or linguistically simple paths as compared to Textual paths (i.e., the path comes form Text), Hybrid paths (i.e., the path comes from both Text and KG), or linguistically complex ones, as shown in Figure 3a and Figure 3b. We consider that this would be because paths including Textual relations (i.e., Textual and Hybrid paths) or complex paths are comparatively noisier than KG or simple paths, but which does not necessarily mean the former is not useful. For instance, in Figure 1, the complex Hybrid path p 2 is useful for predicting (Colesevelam HCl, may treat, Type 2 Diabetes), because p 2 implies a plausible line of reasoning " Colesevelam HCl However, due to the attention bias mentioned above, the base model allocates low attention (a 2 ≈ 8.0 × 10 −36 ) on the informative path, and thus fails to learn from such complex but useful evidences.
To reduce the negative effect of the attention biases and make full use of the UG path, we propose the following two training (or debiasing) strategies: Path Type Adaptive Pretraining ( §4.2) and Complexity Ranking Guided Attention ( §4.3).

Path Type Adaptive Pretraining
As shown in Figure 3a, the base model tends to bias toward KG paths. This indicates that the base model mainly relies on KG paths so that it is incapable of capturing informative features from Textual and Hybrid paths. This bias will decrease the flexibility and adaptability of the base model to different types of paths.
To address this issue, we propose a debiasing strategy called Path Type Adaptive Pretraining. In this strategy, we pretrain the base model sequentially using Textual, Hybrid, and KG Paths as path evidences, and then finetune it with all types of paths as illustrated in Figure 4. We hypothesize that this strategy can prevent the reliance on a single type of UG path and improve the capacity of extracting features from the entire UG paths, and thereby increase the performance.

Complexity Ranking Guided Attention
Similar to the bias towards KG paths, the base model also focuses on linguistically simple paths, as shown in Figure 3b, even though complex ones are informative (e.g. p 2 in Figure 1). We hypothesize that restricting the attention span to the complex (simple) paths can force the model to pay attention to the complex (simple) paths, thereby effectively utilize them. Under this hypothesis, we propose a Complexity Ranking Guided Attention mechanism, as illustrated in Figure 2.
Specifically, given a bag of paths P r = {p 1 , ..., p m }, we rank them according to their complexity scores (κ), which are calculated via κ = λ 1 τ 1 + λ 2 τ 2 + ..., where τ denotes the feature for capturing linguistic complexity (e.g., path length) and λ is a corresponding weight, which is a hyperparameter. Sentence length (i.e., the number of tokens in a sentence) and lexical richness (i.e., the number of token types) are commonly used features for evaluating sentence complexity (Brunato et al., 2018). Therefore, this work adopts them to calculate the complexity for a given path.
Then, we group top j most and least complex paths into a set of complex and simple paths respectively, where j is a hyperparameter 1 . The set level representation is calculated via the Equation 8.
Finally, we concatenate the resulting representation s final , p final , p simple and p complex as the input to the relation classification layer. The conditional probability P (e 1 , r, e 2 |S r , P r , θ S , θ P ) is formulated via Equation 9 Figure 4: Path Type Adaptive Pretraining strategy, where "Textual/Hybrid/KG P." represent Textual, Hybrid, and KG paths respectively, and "Sent." represents the sentence evidences. In this strategy (Pretrain), instead of using all types of paths to train the base model in all iterations, we sequentially train the model with Textual, Hybrid and KG paths, and then finetune it with all types of paths.

Data
We evaluate our proposed framework on a biomedical dataset and NYT10 dataset (Riedel et al., 2010). The statistics of both datasets is summarized in Table 1. We will detail both datasets as follows. Biomedical Dataset. This datatset is created by linking biomedical KG with biomedical Text. We choose UMLS 2 and Medline corpus as the biomedical KG and Text respectively. UMLS is a frequently used biomedical knowledge base, while Medline corpus is a large collection of biomedical abstracts, both are developed and maintained by the U.S. National Library of Medicine 3 . For identifying UMLS entity mentions in the Medline corpus, we use a state-of-the-art UMLS Named En-tity Recognizer (NER), ScispaCy (Neumann et al., 2019). The NER identifies UMLS concepts and annotates them by their corresponding UMLS Concept Unique Identifier (CUI) and entity types.
From the UMLS KG and the entity linked Medline corpus, we extract fact triplets (i.e., (e 1 , r, e 2 )) and corresponding sentence evidences containing (e 1 , e 2 ) under the restriction that: (1) each entity pair should be connected by a RO (RO stands for "has Relationship Other than synonymous, narrower, or broader") relationship; (2) each entity should belong to the following entity types: Protein, Gene, Disease or Syndrome, Enzyme, Chemical, Sign or Symptom and Pharmacologic Substance. Then we divide the collected triplets and sentence evidences into training and testing set according to the year when the source abstract of sentence evidence was published. The former is aligned to the years until 2008 and the latter to the years 2009 ∼ 2018, ensuring the testing set only contains the  unobserved triplets.
To simulate the noise in the real world, besides the "related" triplets, we also extract the "unrelated" triplets and sentence evidences based on a closed world assumption: pairs of entities not listed in a KG are regarded to have NA relation and sentences containing them are considered to be the NA sentence evidences. We divide the NA triplets and NA sentence evidences in the same way mentioned above. We use a subset of UMLS (see Appendix §A.3) and the Medline abstracts published until 2008 as the KG and Text respectively to create the UG for path retrieval. In addition, we use the same subset of UMLS triplets mentioned above to train the KG Encoder introduced in §3. NYT10. The dataset is created by aligning Freebase relational facts with the New York Times Corpus. Sentence evidences from the year 2005 ∼ 2006 are used for training and the evidences from 2007 are used for testing. NYT10 dataset has been widely used by (Lin et al., 2016;Ji et al., 2017;Du et al., 2018;Jat et al., 2018;Du et al., 2018;Han et al., 2018a,b;Vashishth et al., 2018;Ye and Ling, 2019;Alt et al., 2019). We use Freebase 4 and ClueWeb12 with Freebase entity mention annotations (Gabrilovich et al., 2013) as the KG and Text to create the UG for path searching. In addition, following (Han et al., 2018a), we use FB60K for training the KG Encoder.
UG path search. Given an entity pair (e 1 , e 2 ), the UG path set P r is obtained by performing random walks over the UG from e 1 till e 2 with maximum step 5 .

Settings
We follow (Lin et al., 2016) and conduct the heldout evaluation, in which the model for DS-RE is evaluated by comparing the fact triplets identified from evidences (i.e., the bag of sentence evidences 4 From the entire Freebase, we only collect the triplets with the relations that are mentioned in NYT10 dataset for UG creation, ensuring not to overlap with testing set. 5 We manually set the maximum step as 3. S r and the bag of UG path evidences P r ) with those in KG. Following the evaluation of previous works, we draw Precision-Recall curves and report the Area Under Curve (AUC) and Precision@N (P@N) metrics, which gives the percentage of correct triplets among top N ranked candidates. The parameter settings of our experiments are detailed in Appendix §A.2.
To demonstrate the effectiveness of our framework, we choose the model proposed by Dai et al. (2019) as the baseline model, because this is the closest model in terms of incorporating multiple paths for DS-RE. Henceforth, "Sent+KG" is the baseline model, which uses both sentence evidences and KG paths. "Sent+UG" represents the base model in §3 which takes UG paths instead of KG paths as path evidences. "Sent+UG+Pretrain" and "Sent+UG+Ranking" denote the base model trained with Path Type Adaptive Pretraining strategy and the base model with Complexity Ranking Guided Attention mechanism, respectively. "Sent+UG+Ranking+Pretrain" means the base model trained with both strategies.

Results and Discussion
Precision-Recall Curves. The Precision-Recall (PR) curves of each model on the biomedical and NYT10 datasets are shown in Figure 5 and Figure 6, respectively. The results show that: (1) "Sent+UG" does not have obvious advantages than "Sent+KG", illustrating that due to the biases discussed in §4.1, simply applying UG paths on the base model has limited effect on improving the performance of DS-RE. (2) "Sent+UG+Pretrain" and "Sent+UG+Ranking" achieve better overall performance than "Sent+KG" on both datasets, especially when the recall is greater than 0.3, demonstrating that UG has the potential to enhance the performance and the two proposed debiasing strategies are effective for exploiting the potential of UG for DS-RE. (3) "Sent+UG+Ranking+Pretrain" achieves the highest precision over the (almost) entire recall range on both datasets, proving that the two proposed strategies have a mutual complementary relationship on exploiting UG for DS-RE. This is understandable because the two proposed strategies deal with different types of biases, in addition, "Pretrain" helps the base model adapt to UG paths by effectively tuning its weights, while "Ranking" enhances the base model by adjusting its attention mechanism. (4) The consistent improvement on     AUC and P@N Evaluation. Table 2 further presents the results in terms of AUC and P@N. From them, we have similar observation to the PR curves. We also observe that the effectiveness of UG paths is more pronounced on Biomedical dataset than on NYT10 dataset. We speculate that compared to the generic NYT10 dataset, further Background Knowledge (BK) is needed to identify relations from Biomedical dataset, and UG paths could be utilized as the BK to facilitate the scientific DS-RE.
Case Study. Table 4

Conclusion and Future Work
We have introduced UG paths as extra evidences for the task of DS-RE from text. In order to fully take advantage of the rich UG paths, we have proposed two training (or debiasing) strategies: Path Type Adaptive Pretraining and Complexity Ranking Guided Attention mechanism. We have conducted experiments on both biomedical and NYT10 datasets. The results show that the two proposed methods are effective for exploiting the potential of UG paths for improving the performance of DS-RE.
In the future, we plan to carry out the following steps: (1) we further investigate how the proposed training methods influence the performance via manual analysis so as to better the efficiency; and (2) instead of random walk, we may collect UG paths by adopting more sophisticated mechanisms such as training a path searching agent via reinforcement learning to prevent redundant and noisy paths.

A.3 Subset of UMLS
Besides the 7 entity types mentioned above, we also use other 22 entity types, as listed in Table 6, to collect the UMLS triplets that are connected by RO relationship, ensuring all testing triplets are removed. The main reasons to manually restrict the entity type is because (1) we observe that most of the Medline abstracts discuss the relationship among these entity types; (2) these concrete entities could prevent semantic drift while searching UG paths.