Rule-Aware Reinforcement Learning for Knowledge Graph Reasoning

Multi-hop reasoning is an effective and ex-plainable approach to predicting missing facts in Knowledge Graphs (KGs). It usually adopts the Reinforcement Learning (RL) framework and searches over the KG to ﬁnd an evidential path. However, due to the large exploration space, the RL-based model struggles with the serious sparse reward problem and needs to make a lot of trials. Moreover, its exploration can be biased towards spurious paths that coin-cidentally lead to correct answers. To solve both problems, we propose a simple but effective RL-based method called RARL (Rule-Aware RL). It injects high quality symbolic rules into the model’s reasoning process and employs partially random beam search, which can not only increase the probability of paths getting rewards, but also alleviate the impact of spurious paths. Experimental results show that it outperforms existing multi-hop methods in terms of Hit@1 and MRR.


Introduction
Knowledge Graphs (KGs), which store facts as triples in the form of (subject entity, relation, object entity), benefit various NLP applications (Lan and Jiang, 2020;Wang et al., 2019b;He et al., 2017). However, existing KGs face with serious incompleteness despite of their large scales. Therefore, KG completion, which aims to reason missing facts based on existing triples, has been an important research area.
The past decade has witnessed the rise of embedding-based reasoning methods on KGs (Bordes et al., 2013;Yang et al., 2014;Balažević et al., 2019). However, due to their black-box nature, these methods cannot provide interpretations for a specific prediction (Ji et al., 2020;Sadeghian et al., 2019). Recently, there has been growing interest in using multi-hop reasoning to improve the interpretability (Gardner et al., 2013;Rocktäschel and Riedel, 2017). This approach usually adopts Reinforcement Learning (RL) to find a reasoning path (Xiong et al., 2017;Das et al., 2018;Hildebrandt et al., 2020). Starting from the query entity, the RL-based model sequentially selects an outgoing edge and transits to a new entity until it arrives at the target.
However, due to the complexity of the KG, the number of paths grows exponentially when the reasoning hop increases. Most of paths cannot arrive at correct answers, and cannot receive a none-zero reward, which is also called the "sparse reward problem" (Nair et al., 2018). Moreover, since golden paths are not available in the training process, the RL-based model may coincidentally reach the target via a meaningless path (i.e. spurious paths). Take the query (Captain America, director, ?) as an instance. Although the path (Captain America, country, US, lives in −1 , Peter Farrelly), can arrive at the target. It is semantically inconsistent with the query relation director and is an accidental success. One trouble is that the RL-based model relies heavily on rewards and reinforces the past actions receiving high rewards regardless of their path quality. In addition, in large scale KGs, there are more spurious paths than correct ones (Lin et al., 2018). It is more easier for the model to discover spurious ones first other than the true and meaningful ones. If the model finds spurious ones first, these spurious paths will lead to a biased exploration and induce negative influence to the reasoning process (Guu et al., 2017;Lin et al., 2018). Lin et al. (2018) uses shaped rewards calculated by pre-trained embedding-based models and an action dropout mechanism to solve the above two challenges, respectively. However, its performance largely depends on the embedding-based model used. In addition, embedding-based model increases the opacity of the reasoning process. Motivated by this, we focus on the action selection strategy and propose RARL (Rule Aware RL), a simple but effective model to solve the above two challenges. RARL introduces high quality rules as prior information about actions and explores K paths in one episode. It selects actions from three parts: actions matching rules, actions with high scores, and actions randomly sampled. The former two parts can increase the probability of reasoning paths arriving at targets. The later one allows the model to explore a more diverse path set and thus avoids the model adhering to the past actions receiving high rewards, which can naturally mitigate the impact of spurious paths.
We evaluate RARL on three benchmark datasets, and experimental results show the effectiveness of RARL when compared with existing multi-hop methods.

Preliminaries
Let E be the set of entities and R be the set of relations, a knowledge graph can be represented as G = {(e s , r, e t )} ⊆ E × R × E. In this paper, we focus on the standard link prediction task. Given a query of the form (e s , r q , ?), the reasoning model is expected to predict the correct answer e t after traversing over the graph.

The RL-based Knowledge Reasoning Framework
Following (Das et al., 2018), when given a query (e s , r q , ?), the RL-based model can be viewed as an agent, which interacts with the KG environment and aims to find a reasoning path p = (e s , r 1 , e 1 , ...) to explicitly show how to conduct reasoning. The parameters of the policy defines a policy. At each time t, the agent selects an action a t , i.e. an outgoing edge of the current position e t , to expand the path using a policy. Here, we define A t = {(r , e )|(e t , r , e ) ∈ G} as the possible actions at time t. The model first uses a Long Short-Term Memory network (LSTM) to encode the path history into a vector h t . Then, the policy network π θ (a two-layer feed-forward network) calculates a distribution over all possible actions in A t .
where e t ∈ R d and r q ∈ R d are embeddings of e t and r q , respectively. A t ∈ R |At|×2d is the stack of all actions embeddings in A t and σ denotes the softmax operator. After this, the next edge is selected via an -greedy action selection strategy.
A binary reward R(p) is observed after the maximum time step T : R(p) = 1 if the path ends at the correct answer and 0 otherwise.
The objective of the model is to maximize the expected reward: where P (e s , r q ) is the set of all reasoning paths related to the given query (e s , r q , ?). The optimization is then performed by using REINFORCE algorithm (Williams, 1992).

Beam Search
In the RL context, Beam Search (BS) (Sutskever et al., 2014) stores top-K scoring partially constructed paths at each time step, where K is known as the beam size. At each time t, BS extends itself via the following process. Let us denote the paths set held by BS at the end of time t For each path p = (e s , r 1 , ..., e t ) ∈ B t , we first generate its candidate paths cand(p), cand(p) = cond(e s , r 1 , ..., e t ) = {(e s , r 1 , ..., e t , r , e )|(e t , r , e ) ∈ G}. (3) Each candidate path p ∈ cand(p) is associated with a score s(p) calculated by the policy network. Here, s(p) = π θ ((r , e )|e t ). Further, we take the union of these candidate paths B t = p∈Bt cand(p). A new beam B t+1 is generated by picking the K top-most elements in B t .

The RARL Model
As illustrated in Figure 1, the RARL model consists of two parts: the KG environment and the agent. By interacting with the environment, the agent employs a beam search based action selection strategy and picks K actions to extend the beam in one episode. The action selection strategy selects actions from three parts: actions matching rules, actions with high scores, and actions randomly sampled. After the maximum time step, the agent will receive binary rewards.

Rule Based Action Selection
In a typical KG, when the path length increases, finding a non-zero reward is exponentially more difficult. Learning from such sparse rewards requires lots of effective exploration. However, in the beginning, due to the randomly initialized parameters, the model chooses actions randomly and can hardly arrive at targets. This makes the sparse reward problem even worse (Xiong et al., 2017;Hare, 2019). Considering that rules precisely characterize a mapping from query relations to semantic composition paths (Zhang et al., 2019), RARL utilizes rules as prior information about actions to increase the probability of paths receiving rewards, which can also help to facilitate effective exploration.
The rules mined from KGs are in the form of head ← body, where the head is an atom r(a, b) and the body is in the format of: r(x 0 , x 1 ) ∧ . . . ∧ r(x n , x n+1 ). Note that r(x i , x j ) is equivalent to the fact triple (x i , r, x j ).
Given the query relation r q , RARL first selects rules R rq whose heads are identical to r q from the rule pool. At each time step t, it maintains a beam B t of K paths. For each path p = {e s , r 1 , e 1 , ..., r t , e t } ∈ B t , RARL expands its candidate paths based on the outgoing edges of e t . Next, for all candidate paths in B t , only those in which relation sequence can match related rules from left to right are selected. For instance, suppose R rq contains only one rule r q ← r 1 ∧ r 2 , given two candidate paths (e s , r 1 , e 1 , r 2 , e 2 ), (e s , r 2 , e 3 , r 3 , e 4 ), only the relation sequence in the former path can successfully matches the rule. As a result, the former one will be selected to generate B t+1 . If the number of candidate paths matching rules excesses the beam size, RARL selects top-K paths from these paths matching rules according to their scores calculated by the policy network. If not, remaining paths not matching rules are selected as a compliment. To make a balance between actions generated by free exploration and actions matching rules, RARL randomly masks some related rules to shrink the number of paths matching rules.

Partially Random Beam Search
To ease the impact of spurious paths, we try to prevent the RL-based model from the obsession of spurious paths and induce diversity during BS. Inspired by (Guu et al., 2017), we introduces partially randomness to standard BS, to fight against the impact of spurious paths.
Like regular beam search, at time t, RARL computes the set of all candidate paths B t and sorts them by their scores computed by the policy network π θ . Instead of selecting K highest-scoring candidate paths, RARL randomly chooses λK candidate paths from B t and remaining paths are chosen according to their scores. In this way, lowscoring paths discarded in standard BS can also have the chance to be explored. Besides, the randomness can avoid the model sticking to the paths getting rewards. In the experiment, RARL selects paths with replacement when available actions smaller than λK.

Experimental Setup
Datasets and Rules We adopt three datasets to evaluate the performance of RARL for link prediction: UMLS (Kok and Domingos, 2007), WN18RR (Dettmers et al., 2017), and FB15K-237 (Toutanova et al., 2015). For FB15k-237, 20 relations in the film field are selected. Following (Niu et al., 2020), We use AIME+ (Galárraga et al., 2015) to automatically extract rules, and we limit the maximum length of rules to 2. Table 2 lists the statistics of rules with various confidence thresholds mined from these three datasets.  Hyperparameters We set the dimensions of entity and relation embeddings within (50,200). A threelayer LSTM is used as the path encoder and its hidden dimension is set in (100,200). The λ is set as 0.9, 0.4 and 0.7 for UMLS, WN18RR and FB15K-237, respectively, according to the average degree of nodes and the average number of relation rules on each dataset. Table 1 summarizes the experimental results of our proposed approach and the baselines. As shown in Table 1, RARL achieves competitive results over multi-hop reasoning methods. On FB15K-237, RARL outperforms all baselines in terms of Hit@1, Hits@10, and MRR. On WN18RR and UMLS, the RARL achieves the best results in terms of Hit@1 and MRR. The Hit@1 results emphasize the superiority of our approach in high-precision link prediction and confirm the effectiveness of high quality rules. We also notice that the emebedding-based methods perform better on UMLS and FB15K-237 compared with multi-hop reasoning methods. One reason is that the multi-hop reasoning methods are more sensitive to the sparsity and incompleteness of graphs compared with embedding-based methods. It is hard for them to find evidential paths reaching targets via strictly searching in the KG. While the embedding-based methods (Lin et al., 2018;Fu et al., 2019) map entities and relations into a unified semantic space to capture inner connections, which relaxes this restriction.

Ablation Study
We perform an ablation study to look deep into the framework of RARL. We deactivate the validity of rule information, random mechanism from RARL. The MRR results are summarized in Table 3. It can be observed that removing each reasoning component of RARL results in a significant performance drop on UMLS and WN18RR. On FB15K-237, removing rule information seems like to be no influence. As a result, we further conduct an analysis experiment using w/o rule setting and found lower results on the testing set. This performance gap may be caused by the difference of data distribution between the testing set and the validation set.
Besides, our ablation study shows that removing partial randomness has a greater negative impact on reasoning performance. This suggests that increasing the exploration diversity to get more valid path patterns is important in training stage.  Table 3: Ablation study of the proposed method.

Conclusions
In this paper, we introduced RARL, a new RLbased method for knowledge graph reasoning. RARL makes use of high-quality symbolic rules and partical random beam search jointly and efficiently fights against the sparse reward and spurious path problems. Experimental results demonstrate that RARL achieves better performance compared with existing multi-hop methods in terms of both Hit@1 and MRR.