GMH: A General Multi-hop Reasoning Model for KG Completion

Knowledge graphs are essential for numerous downstream natural language processing applications, but are typically incomplete with many facts missing. This results in research efforts on multi-hop reasoning task, which can be formulated as a search process and current models typically perform short distance reasoning. However, the long-distance reasoning is also vital with the ability to connect the superficially unrelated entities. To the best of our knowledge, there lacks a general framework that approaches multi-hop reasoning in mixed long-short distance reasoning scenarios. We argue that there are two key issues for a general multi-hop reasoning model: i) where to go, and ii) when to stop. Therefore, we propose a general model which resolves the issues with three modules: 1) the local-global knowledge module to estimate the possible paths, 2) the differentiated action dropout module to explore a diverse set of paths, and 3) the adaptive stopping search module to avoid over searching. The comprehensive results on three datasets demonstrate the superiority of our model with significant improvements against baselines in both short and long distance reasoning scenarios.


Introduction
Knowledge graphs (KGs) have become the preferred technology for representing, sharing and adding factual knowledge to many natural language processing applications like recommendation (Wang et al., 2019;Lei et al., 2020) and question answering (Huang et al., 2019;Zhang et al., 2018). KGs store triple facts (head entity, relation, tail entity) in the form of graphs, where entities are represented as nodes and relations are represented as labeled edges between entities (e.g., Figure 1 (a)). Although popular KGs already contain millions of facts, e.g., YAGO (Suchanek et al., 2007) * Corresponding author. and Freebase (Bollacker et al., 2008), they are far from being complete considering the amount of existing facts and the scope of continuously appearing new knowledge. This has become the performance bottleneck of many KG-related applications, triggering research efforts on the multi-hop reasoning task.
The multi-hop reasoning task can be formulated as a search process, in which the search agent traverses a logical multi-hop path to find the missing tail entity of an incomplete triple in KG. As shown in Figure 1 (Xiong et al., 2017;Das et al., 2018) have been proposed to model the search process as a sequential decision problem in reinforcement learning (RL) framework. (Lin et al., 2018) further optimized the reward function of RL framework based on (Das et al., 2018). However, these works have only scratched the surface of multi-hop reasoning as they focus only on short distance reasoning scenarios (e.g., the two-hop case in Figure 1 (b)).
We observe that the long distance reasoning scenarios are vital in the development of multi-hop reasoning and KG-related applications, because two superficially unrelated entities may be actually deeply connected over a long distance. With the significant expansion of KGs, the incompleteness of KG becomes more prominent, and long distance scenarios are rapidly increasing. As shown in Figure 1 (c), the missing entity James Harden in the incomplete triple (Stephen Curry, opponent, ) is inferred by a long reasoning process, i.e., a fourhop path. Moreover, in practice, the long and short distance reasoning scenarios are mixed. The ideal multi-hop reasoning model should be competent Figure 1: Examples of (a) an incomplete knowledge graph, (b) a short distance scenario (two-hop) about the reasoning of (Stephen Curry, teammate, ), and (c) a long distance scenario (four-hop) about the reasoning of (Stephen Curry, opponent, ). The dotted lines refer to the relations of incomplete triples and solid lines refer to existing relations. The green, blue and black boxes represent the entities of the incomplete triples, the entities in the reasoning paths and the unrelated entities, respectively. As it can be seen, the long distance reasoning is needed and more complex than the short distance reasoning. Best viewed in color. on mixed short and long distances. Specifically, we argue that there are two key issues in the traverse of KG that need to be resolved: i)Where to go? The search agent needs to decide where to go at next search step, i.e., selecting an edge connected with the current node. Selecting the positive edge means that the agent will move towards the target node, otherwise, it will move away from the target. When the search distance increases, the issue becomes more challenging because the agent needs to make more decisions. ii)When to stop? The search agent needs to decide when to stop the search because the exact search steps cannot be known in advance. An ideal search agent needs to stop at a suitable time to avoid over searching and adapt to realistic reasoning scenarios with mixed short and long distances.
To this end, we propose a General Multi-Hop reasoning model, termed GMH, which solves the two above-listed issues in three steps: 1) the localglobal knowledge fusion module fuses the local knowledge learnt from history path and the global knowledge learnt from graph structure; 2) the differentiated action dropout module forces the search agent to explore a diverse set of paths from a global perspective; and 3) the adaptive stopping search module uses a self-loop controller to avoid over searching and resource wasting. We train the policy network with RL and optimize the reward to find the target entity effectively. In summary, the main contributions of this work are as follows: •We observe that the long distance reasoning scenarios are vital in the development of multi-hop reasoning, and argue that an ideal multi-hop rea-soning model should be competent on mixed longshort distance reasoning scenarios.
•We propose a general multi-hop reasoning model, GMH, which can solve two key issues in mixed long-short distance reasoning scenarios: i) where to go and ii) when to stop.
•We evaluate GMH on three benchmarks, FC17, UMLS and WN18RR. The results demonstrate the superiority of GMH with significant improvements over baselines in mixed long-short distance reasoning scenarios and with competitive performances in short distance reasoning scenarios.

Related Work
In this section, we summarize the related work and discuss their connections to our model. Firstly, we introduce the two lines of work on the KG completion task: multi-hop reasoning and KG embedding. The multi-hop reasoning task focuses on learning logical multi-hop paths reasoned from KG. The multi-hop reasoning models distill deep information from paths thereby generating further directly interpretable results. (Lao et al., 2011;Das et al., 2017;Jiang et al., 2017;Yin et al., 2018) predicted the missing relations of incomplete triples based on pre-computed paths. (Xiong et al., 2017) firstly adopted the RL framework to improve the reasoning performance. The task of finding a missing entity is orthogonal to the prediction of the missing relation in a complementary manner. (Das et al., 2018) used the history path to facilitate the search agent finding the missing entity and (Lin et al., 2018) optimized the reward function of RL framework based on (Das et al., 2018). (Lv et al., 2019) adopted the meta learning framework for multi-hop reasoning over few-shot relations. These works are conditioned in short distance scenarios, and tend to rapidly lose effectiveness as the distance increases. In contrast, we propose a general model which can be sufficiently utilized in both the short and long distance reasoning scenarios.
The KG embedding task is another line of work carried to alleviate the incompleteness of KG. Embedding-based models project KGs in the embedding space and estimate the likelihood of each triple using scoring functions. (Bordes et al., 2013;Wang et al., 2014;Lin et al., 2015;Ji et al., 2016) defined additive scoring functions based on the translation assumption. Trouillon et al., 2016) defined multiplicative scoring functions based on linear map assumption. Moreover, recent models introduce special neural networks like neural tensor network (Socher et al., 2013), convolution neural network (Dettmers et al., 2018) and graph convolutional network (Nathani et al., 2019). Due to the neglection of deep information within multi-hop paths, the results of the embedding-based models lack interpretability, which is critical for KG-related applications. However, embedding-based models are less sensitive to the reasoning distance because they learn KG structure from the global perspective. Thus, we take advantage of this strength to learn the global knowledge from graph structure and retain the interpretability by reasoning from the history paths.
Secondly, we discuss the community research on long distance reasoning scenarios. (Tuan et al., 2019) formed a transition matrix for reasoning over six-hop path in KG for the conversational reasoning task. It is however not suitable for large-scale KGs, because the matrix multiplication requires large calculation space. (Wang et al., 2019) proposed a long-term sequential pattern to encode long distance paths for the recommendation task. Because there is no real reasoning process for the long distance paths, it is not suitable for the KG completion. To summary, we are the first to study long distance reasoning scenarios in the KG completion.
We propose a general model that tackles both short and long distance reasoning scenarios and works effectively on large-scale KGs. Figure 2 illustrates the entire process of the GMH model. The input involves the head entity and relation of the incomplete triple with the background KG. The output is the missing tail entity. We systematize the model in three steps: 1) the local-global knowledge fusion module to integrate knowledge of history paths and graph structure; 2) the differentiated action dropout module to diversify the reasoning paths; and 3) the adaptive stopping search module to formulate the optimal steps of searching. The local-global knowledge fusion and differentiated action dropout modules facilitate the agent to address the issue of where to go. The adaptive stopping search module controls the search steps to resolve the issue of when to stop.

Preliminary
We formally represent a KG as a collection of where e h , r and e t denote the head entity, relation and tail entity in one triple, E and R are the entity and relation sets, respectively. Each directed link in KG represents a valid triple (i.e., e h and e t are represented as the nodes and r as the labeled edge between them). For an incomplete triple, multi-hop reasoning can be perceived as searching a target tail entity e t through limited steps in KG, starting from head entity e h and based on the relation r ∈ R. We use query q to represent (e h , r) in the following sections. At step s, the search agent will transfer to the entity e s updating the history path trajectory H s = {e h , r 1 , e 1 , ..., r s , e s }, and the available action set A s = (r i s , e i s )|(e s , r i s , e i s ) ∈ T . A s consists of all outgoing relations and the associated entities of e s . The agent will select one action from A s to transfer to the next entity e s+1 through the correlated relation r s+1 at next step.

Local-Global Knowledge Fusion
In this module, we learn local knowledge lk s and global knowledge gk s to resolve the "where to go" issue, as shown in Figure 3. The local knowledge indicates that the agent makes decisions on the basis of the history path trajectory H s at step s from a local perspective. The global knowledge is calculated through a pre-trained embedding-based models from a global perspective. We use an aggregate (abbr. AGG) block to aggregate lk s and gk s , which has two types: summation (lk s + gk s ) and scalar product (lk s * gk s ). The distribution p(A s ) ∈ R |As| is calculated through the AGG block and represents the confidence score for each available entity in A s . The agent will select one action from A s according to the distribution p(A s ) to transfer to the next entity.

Local Knowledge Learning
The local knowledge lk s indicates from a local perspective that the agent makes decisions based on the history path trajectory H s at step s. We adopt long short-term memory (LSTM) neural network and attention mechanism to encode the history path trajectory and yield the local knowledge.
The history path trajectory H s = (e h , r 1 , e 1 , ..., r s , e s ) consists of the sequence of entities and relations which the agent has selected over the last s steps. We adopt an embedding layer to generate the embedding of entities and relations. The embedding of query is q = [ e h ; r] ∈ R 2dim , i.e., the concatenation of the embeddings of the head entity e h ∈ R dim and relation r ∈ R dim , where dim is the dimension. We use an LSTM to encode the embedding of H s to yield the hidden state embedding sequence is the hidden state at step s, e s is the current entity and r s is the relation that connects e s−1 and e s .
Prior works (Das et al., 2018;Lin et al., 2018) use only the current hidden state embedding (i.e., h s ) to yield the next action and they neglect the differentiated importance between the hidden states over the last s steps. Therefore, the attention weight value calculated between the hidden state embedding sequence and the query embedding is introduced to optimize the local knowledge lk s . Each weight value is derived by comparing the query q with each hidden state h i : where i and j stand for the i-th and j-th hidden state candidate, respectively. Here, f (·) is represented as a query-based function: Ultimately, local knowledge lk s ∈ R |As| , which reflects the influence of the history path trajectory on each element in A s , can be obtained: where W 1 and W 2 are the weights, and δ 1 is the activation function.

Global Knowledge Learning
Prior works (Das et al., 2018;Lin et al., 2018) use the local knowledge and neglect the long distance cases which requires higher decision accuracy of the agent. We introduce the global knowledge gk s learnt from graph structure by a pretrained embedding-based model.
Embedding-based models map the graph structure in continuous vector space by using a scoring function ψ(e h , r, e t ). We generate the new triple (e h , r, e i s ) by concatenating the head entity and relation with available entity e i s ∈ E A t , where E A t ∈ R |As|×dim contains all available entities in A s . As we consider that the positive available entity is closer to the target tail entity in vector space, combining the positive available entity in A s with the query will get a higher score than that using negative available entities. Formally, we adopt a pretrained embedding-based model to calculate these new triples to obtain the global knowledge gk s : gk s = [ψ( e h , r, e 1 s ); ...; ψ( e h , r, e |As| s )].
(3) Concatenating each of new triples' scores gives the global knowledge gk s ∈ R |As| . The selection of scoring function ψ(·) is discussed in Section 4.3.

Differentiated Action Dropout
In the multi-hop reasoning task, it is important to enforce effective exploration of a diverse set of paths and dilute the impact of negative paths. (Lin et al., 2018) forced the agent to explore a diverse set of paths using action dropout technique which randomly masks some available actions in A s , i.e., blocking some outgoing edges of the agent. However, in the case of reasoning over long distances, the number of paths is much greater than that in the short distance scenarios because the search space grows exponentially. The random action dropout technique is inefficient because it cannot discriminate paths of different qualities. We then propose the differentiated action dropout (DAD) technique based on the global knowledge gk s to mask available actions, since we believe that higher-scoring actions are more likely to exist in a high-quality path. In particular, the mask matrix M t ∈ R |As| is sampled from the Bernoulli distribution: M t ∼ Bernoulli(sigmoid( gk s )).
(4) The element in M t is binary, where 1 indicates the action is reserved and 0 indicates abandonment. The fusion of local-global knowledge and differentiated action dropout modules helps the agent to tackle the key problem where to go jointly.

Adaptive Stopping Search
For the second key issue of when to stop, we devise the adaptive stopping search (ASS) module inspired by the early stopping strategy (Prechelt, 1997) which is used to avoid overfitting when training a learner with an iterative method. We add a self-loop action (self-loop, e s ) to give the agent an option of not expanding from e s . When the agent chooses the self-loop action for several times, we consider it means that the agent has found the target tail entity, thus it can choose to end early.
In this module, we devise a self-loop controller to avoid over searching and resource wasting. The self-loop controller has a dual judgment mechanism based on the the maximum search step S and the maximum loop number N . When the search step reaches the maximum S, or the agent selects the self-loop action for N consecutive times, the search process will be stopped. Using the ASS strategy improves our model's scalability on both short and long distances and effectively avoids wasting of resources caused by over-searching.

Training
Following (Das et al., 2018), we frame the search process as a Markov Decision Process (MDP) on the KG and adopt the on-policy RL method to train the agent.
We design a randomized history-dependent policy network π = (p(A 1 ), ..., p(A s ), ..., p(A S )). The policy network is trained by maximizing the expected reward over all training samples D train : where θ denotes the set of parameters in GMH, R(·) is the reward function andê t is the final entity chosen by the agent. Ifê t = e t , then the terminal reward is assigned +1 and 0 otherwise.
The optimization is conducted using the REIN-FORCE algorithm (Williams, 1992) which iterates through all (e h , r, e t ) triples in D train and updates θ with the following stochastic gradient: The training process is detailed in Algorithm 1. During a search process, for each search step, the agent takes three operations: local-global knowledge fusion (lines 5-6), differentiated action dropout (line 7) and adaptive stopping search (lines 8-10). After finding the tail entity, the reward is calculated and the parameters are updated (line 13). Finally, the optimized parameters are output.

Setup
Dataset Existing popular benchmarks, such as UMLS (Stanley and Pedro, 2007) and WN18RR (Dettmers et al., 2018), focus on the multi-hop reasoning in short 1 distance scenarios. Thus, they are unsuitable for evaluating complex cases requiring both long and short distance learning. To this end, we adopt the large-scale dataset FC17 (Neelakantan et al., 2015) which contains triples based on Freebase (Bollacker et al., 2008) enriched with the information fetched from ClueWeb (Orr et al., 2013). Because the data with distance type larger than five is relatively small, we maintain the data with distance type between 2 and 5. The sample number of each distance type (2-5) is 63k, 53k, 11k, 5k, respectively. Note that, there are extra relations served in the background KG plus 46 relation types in the train/valid/test sets of FC17. We also evaluate our model on the other short distance datasets, i.e., UMLS and WN18RR. Table 1 summarizes the basic statistics of datasets. Baselines We compare GMH with 1) the embedding-based models involving TransE (Bor-  , Com-plEx (Trouillon et al., 2016), and ConvE (Dettmers et al., 2018); as well as 2) the multi-hop reasoning models involving MINERVA (Das et al., 2018) and MultiHop (Lin et al., 2018). Implementation Details GMH is implemented on PyTorch and runs on a single TITAN XP. Following (Das et al., 2018), we augment KG with the reversed link (e t , r −1 , e h ) for each triple. We exclude the triples from the training set if they occur in the validation or testing sets. For the baselines and GMH, we set the maximum search step S to five because the entity pair's distance is up to five in FC17. For the short distance datasets, UMLS and WN18RR, S is set to three. The maximum loop number N for all datasets is set to two. We employ softmax function as the activation function. All hyper-parameters are tuned on the validation set can be found in supplementary materials The pre-trained embedding-based model that we adopt is ConvE. We optimize all models with Adam (Kingma and Ba, 2015) 2 .
Metrics We follow the evaluation protocol of (Lin et al., 2018) that records the rank of the available entities at final step in a decreasing order of confidence score for each query, and adopts mean reciprocal rank (MRR) and HITS@N to evaluate the results. All results given in our experiments are the mean and standard deviation values of three training repetitions.   samples. We observe that multi-hop reasoning models outperform most embedding-based models, but their performance declines when the distance increases. We assume this may be attributed to the significantly increasing difficulty of building long paths when predicting long distance relations. The embedding-based models appear to be less sensitive to the distance variations, but they neglect the deep information existing in multi-hop paths, which limits the interpretative ability of predicting results. We further evaluate the short-distance reasoning performance on UMLS and WN18RR. The results of the baselines are cited from (Lin et al., 2018). GMH performs comparably well in reasoning in the short distance scenarios, yet its effectiveness in the long-short compound reasoning or long distance reasoning scenarios is more obvious. For the WN18RR dataset, GMH performs weaker than MultiHop. We speculate that this is because the number of relations in WN18RR is extremely smaller than the number of entities, which will make it difficult to accurately learn the relation embeddings. Choosing a superior pre-trained embedding-based model is critical for our model.

Multi-Hop
Reasoning in long distance scenarios As we noticed in Table 2, GMH achieves new state-of-the-art results on FC17 dataset which contains both short distance and long distance types. We further evaluate its performance in terms of reasoning on the relations in longer distances, which have been rarely examined by the existing works. Therefore, we extract the relations from FC17 whose distances span from 4 to 7 and in this way we construct a subdataset, called FC17-8, which contains eight query relation types . Table 3 displays the results of reasoning on the four distance types based on the MRR metric. Compared with GMH and the multi-hop reasoning models, the embedding-based model seems less sensitive to the distance variations, while its reasoning performance is inferior to the compared models on all distance types. GMH consistently yields the best performance on the long distance reasoning scenarios. We observe that all the models perform better on the even distance type (4 and 6) than odd distance type (5 and 7). There are two possible reasons: 1) there is an imbalance between the difficulty and the number of different distance types; 2) the models are better at reasoning on symmetric paths like the four-hop path Stephen Curry In addition to the superior reasoning capability of GMH as demonstrated in Table 2 and Table 3, other promising potentials pave the way for GMH in advanced applications. First, GMH is explainable because it considers the path information, which is beyond the scope of the existing embedding-based models. Second, the global knowledge learnt from graph structure, which has been overlooked by the existing multi-hop reasoning models, is incorporated in GMH.

Analysis of GMH
In this section, we conducted an extensive analysis of GMH from two aspects: 1) modules (c.f., Table 4); 2) hyper-parameters (c.f., Figure 4); and 3) scoring functions and aggregators (c.f., Figure 5). Local Knowledge vs. Global Knowledge We fuse two components (i.e., the local knowledge lk s and the global knowledge gk s ) to enable the search agent to find the target tail entity. Thus, an extensive experiment is conducted to test the contributions of lk s and gk s in the multi-hop reasoning task. The top three lines of Table 4 reveal that fusing lk s and gk s achieves the best results under different evaluation metrics. Removing either knowledge yields a significant performance drop. Concretely, removing the local knowledge causes a 9.10% MRR degradation, and removing the global knowledge results in a 4.05% MRR degradation. This suggests that the local knowledge may be more beneficial for the search agent than the global knowledge, and using only the local knowledge to find a path in KG may be ineffective in the training process. Still we argue that the importance of the global knowledge should not be neglected, especially when it is combined with the local knowledge to handle the "where to go" issue. Performance w.r.t. Differentiated Action Dropout The differentiated action dropout module is adopted to increase the diversity of search paths in the training stage. The fourth line of Table 4 shows the validity of this module. We also test the effect of randomly action dropout (22.15% under MRR), and there is a gap with our model. This illustrates that the reason why the differentiated action dropout performs well is because the mask operation is based on the global knowledge rather than on random strategy. Performance w.r.t. Adaptive Stopping Search As mentioned before, we have devised the adaptive stopping search module to avoid wasting of resources caused by over-searching, i.e., the "when to stop" issue. As can be seen from the bottom two rows of Table 4, ASS also has a slight effect on the performance. This is because the module can partially prevent the search agent from continuing to search when the target tail entity has been found.

Maximum Search
Step As shown in Figure 4, GMH achieves best performance at S = 5. Using a large S will cause wasting of resources, while if using a small S, it will affect the performance on the long distance reasoning samples. Meanwhile, the running time rises sharply when increasing S. Therefore, the introduction of adaptive stopping search module is necessary and rational.

Maximum Loop Number
We divide the self-loop action into two types: positive and negative. The positive self-loop action means the agent arrives at the target tail entity, while the negative self-loop action means the current entity is not the target. See Figure 4, a small N may cause the agent to misrecognize negative actions as positive actions, while a large N may lead to lose the advantage of reducing time consumption. Compared with not using the adaptive stopping search module (i.e., N = 1), using it has resulted in a significant improvement with the optimal number of 2. Scoring Function Types The pre-trained embedding-based model that we adopt is ConvE. For more extensive ablation analysis, we have conducted the experiments by incorporating effective embedding-based models (i.e., TransE, DistMult, ComplEx, and ConvE). As shown in Figure 5(a), ConvE has a superb ability to learn the global semantic representation than other embedding-based models. Aggregator Types We next investigate the perfor- mance of our model w.r.t different aggregator types. We adopt two types of aggregators: summation and scalar product, to fuse the local knowledge lk s and global knowledge gk s . We can see from Figure 5(b) that the scalar product outperforms the summation. The advantage of the scalar product aggregator is that the multiplication operation can increase the discrimination between available actions.

Conclusions
We have studied the multi-hop reasoning task in long distance scenarios and proposed a general model which could tackle both short and long distance reasoning scenarios. Extensive experiments showed the effectiveness of our model on three benchmarks. We will further consider the feasibility of applying our model to complex real-world datasets with more long distance reasoning scenarios and more relation types. Besides, we have noticed that there are other "interference" in long distance reasoning. For example, noise from the KG itself, i.e., the fact that it lacks validity. These noises can gradually accumulate during long distance reasoning and affect the result confidence. We leave the further investigation to future work.