UniKER: A Unified Framework for Combining Embedding and Definite Horn Rule Reasoning for Knowledge Graph Inference

Knowledge graph inference has been studied extensively due to its wide applications. It has been addressed by two lines of research, i.e., the more traditional logical rule reasoning and the more recent knowledge graph embedding (KGE). Several attempts have been made to combine KGE and logical rules for better knowledge graph inference. Unfortunately, they either simply treat logical rules as additional constraints into KGE loss or use probabilistic model to approximate the exact logical inference (i.e., MAX-SAT). Even worse, both approaches need to sample ground rules to tackle the scalability issue, as the total number of ground rules is intractable in practice, making them less effective in handling logical rules. In this paper, we propose a novel framework UniKER to address these challenges by restricting logical rules to be definite Horn rules, which can fully exploit the knowledge in logical rules and enable the mutual enhancement of logical rule-based reasoning and KGE in an extremely efficient way. Extensive experiments have demonstrated that our approach is superior to existing state-of-the-art algorithms in terms of both efficiency and effectiveness.


Introduction
Knowledge Graphs (KGs) have grown rapidly recently which provide remarkably valuable resources for many real-world applications (Auer et al., 2007;Bollacker et al., 2008;Suchanek et al., 2007). KG reasoning, which aims at inferring missing knowledge through the existing facts, is the key to the success of many downstream tasks and have received wide attention.
Knowledge Graph Embedding (KGE) methods currently hold the state-of-the-art in KG reasoning (Bordes et al., 2013;Wang et al., 2014;Yang et al., 2014;Sun et al., 2019). They aim to capture the similarity of entities via exploring rich structure information in KGs to predict unseen triples. Despite the excellent performance of KGE methods, the ignorance of possible high-order constraints specified by logical rules limits their application in more complex reasoning tasks. In Fig. 1, for example, we can infer Stilwell lives in USA based on KGE, due to her similarity with Miller in terms of embeddings, as both can be reached via the same relation transformation (i.e. isMarriedTo) from the same entity (Edison). But without the capability of leveraging logical rules, we cannot infer Stilwell and Miller speak English.
An alternative solution is to infer missing facts via logical rules, which have been extensively explored by traditional logical rule-based methods (Richardson and Domingos, 2006;De Raedt and Kersting, 2008). As shown in Fig. 1, two entities (e.g., Miller and English) that are not directly connected in a KG could participate in the same ground logical rule, (e.g., speakLanguage(Miller, English) ← liveIn(Miller, USA) ∧ officialLanguage(USA, English)), and a relation between them can be inferred if all predicates in the rule body are true. Different from KGE, logical inference treats triples as independent units and ignores the correlation among them. As a result, the performance of logical inference highly depends on the completeness of KGs, which suffers from severe insufficiency in reality. For example, due to the absence of the triple liveIn(Mary Stilwell, USA), the triple speakLanguage(Mary Stilwell, English) can not be inferred in Fig. 1. Besides its sensitivity to the quality of KG, logical inference is also known for its high computation complexity as it requires instantiating universally quantified rules into ground rules, which is extremely time-consuming.
Although both embedding-based methods and logical rule-based methods have their limitations, they are complementary for better reasoning capability. As shown in Fig. 1, on one hand, logical rules are useful to provide additional information by exploiting the higher-order dependency of KG relations ( Fig. 1(d)). On the other hand, high-Despite several attempts made to combine KGE and logical rules for KG reasoning, they either simply treat logical rules as additional constraints into KGE loss (Guo et al., 2016;Rocktäschel et al., 2015;Demeester et al., 2016) or use probabilistic model to approximate the exact logical inference (i.e., MAX-SAT) (Qu and Tang, 2019;Zhang et al., 2019;Harsha Vardhan et al., 2020). Moreover, these methods rely on ground rules, the total number of which is intractable in practice. To tackle the scalability issue, only a small portion of ground predicates/ground rules are sampled to approximate the inference process, which causes further information loss from the logic side.
To overcome the above issues, we propose a novel Unified framework for combining Knowledge graph Embedding with logical Rules (UniKER) for better KG reasoning, to handle a special type of first-order logic, i.e., the definite Horn rules. First, we combine logical rule reasoning and KG embedding in an iterative manner, to make sure the inferred knowledge via both techniques can benefit each other as shown in Fig. 1. Second, we propose an iterative grounding algorithm to extend the classic forward chaining algorithm that is designed for definite Horn rule reasoning in an extremely efficient way. Consequently, UniKER can fully exploit the knowledge contained in logical rules and enrich KGs for better embedding. Meanwhile, KGE enhances the forward chaining by including more potential useful hidden facts (See Fig. 1(c)). In this way, two procedures mutually enhance each other. The main contributions of this paper are summarized as follows: • We investigate the problem of combining embedding and definite Horn rules, a much simpler yet popular form of logical rules, for KG inference.
• A unified framework, UniKER, is proposed, which provides a simple yet effective iterative mechanism to let logical inference and KGE mutually enhance each other in an efficient way.
• We theoretically and experimentally show that UniKER is superior to existing SOTA methods in terms of efficiency and effectiveness.

Knowledge Graphs in the Language of Symbolic Logic
A KG, denoted by G = {E, R, O}, consists of a set of entities E, a set of relations R, and a set of observed facts O. Each fact in O is represented by a triple (e i , r k , e j ), where e i , e j ∈ E and r k ∈ R. In the language of logic, entities can also be considered as constants and relations are called predicates. Each predicate in KGs is a binary logical function defined over two constants, denoted as r(·, ·). A ground predicate is a predicate whose arguments are all instantiated by constants. For example, given a predicate liveIn(·, ·), by assigning constants Miller and USA to it, we get a ground predicate liveIn(Miller, USA). A triple (e i , r k , e j ) is essentially a ground predicate, denoted as r k (e i , e j ) in the language of logic. In the reasoning task, a ground predicate can be regarded as a binary random variable: r k (e i , e j ) = 1 when the triple (e i , r k , e j ) holds true, and r k (e i , e j ) = 0 otherwise. Given the observed facts v O = {r k (e i , e j )|(e i , r k , e j ) ∈ O}, the task of knowledge graph inference is to predict the truth value for all hidden triples (i.e., unobserved triples)

First Order Logic and Definite Horn Rules
First-order logic (FOL) rules are constructed over predicates using logical connectives and quantifiers, which usually require extensive human supervision to create and validate and thus severely limit their applications. Instead, definite Horn rules, as a special case of FOL rules, can be extracted automatically and efficiently via modern rule mining systems, such as WARMR (Dehaspe and Toivonen, 1999) and AMIE (Galárraga et al., 2015) with high quality, which are widely used in practice. Definite Horn rules are composed of a body of conjunctive predicates and a single positive head predicate. They are usually written in the form of implication as shown below: where r 0 (x, y) is called the head of the rule while r 1 (x, z 1 ) ∧ r 2 (z 1 , z 2 ) ∧ r 3 (z 2 , y) is the body of the rule. By substituting the variables x, z 1 , z 2 , y with concrete entities e i , e p , e q , e j , we get a ground definite clause as follows:

Logical Reasoning
Traditional logical inference aims to find an assignment of truth values to all hidden ground predicates, leading to maximizing satisfied ground rules. Thus, it can be mathematically modeled as a MAX-SAT problem, which is NP-hard (Shimony, 1994).

Knowledge Graph Embedding
KGE aims to capture the similarity of entities by embedding entities and relations into lowdimensional vectors. Scoring functions, which measure the plausibility of triples in KGs, are the crux of KGE models. We denote the score of a triple (e i , r k , e j ) calculated following scoring function as f r k (e i , e j ). Representative KGE algorithms include TransE (Bordes et al., 2013), TransH (Wang et al., 2014), TransR (Lin et al., 2015), DistMult (Yang et al., 2014), Com-plEx (Trouillon et al., 2016), andRotatE (Sun et al., 2019), which differ in their scoring functions.

Related Work on Integrating
Embedding and Logical Rules MAX-SAT problem is defined using boolean logic while scoring function of KGE provides a soft truth value to triples in KG. Probabilistic logic is widely used to integrate both worlds into the same framework, which is able to extend boolean logic to probabilistic logic to enable uncertain inference. These approaches can be divided into two categories: (1) designing a Probablistic Soft Logic (PSL)-based regularization to embedding models and (2)  PSL-based Regularization in Embedding Loss.
The first way to combine two worlds is to treat logical rules as additional regularization to embedding models, where the satisfaction loss of ground rules is integrated into the original embedding loss. Probabilistic Soft Logic (PSL) (Bach et al., 2015) is used to compute the satisfaction loss, where the probability of each predicate is determined by the embedding. KALE (Guo et al., 2016), RUGE (Guo et al., 2017) and Rocktäschel et al. (Rocktäschel et al., 2015) are some of the representative methods. A summary of these methods can be found in Appendix E. All approaches in this category have to instantiate universally quantified rules into ground rules before model learning. When including all ground rules into the calculation of satisfaction loss, the additional regularization becomes the convex program which reasons analogous MAX-SAT problem defined over Lukasiewicz logic (Klir and Yuan, 1996) whose solution is an approximation to MAX-SAT problem (Bach et al., 2015). Detailed proof is given in Appendix C. As the total number of ground rules is intractable in practice, only a small portion of ground rules will be sampled to tackle the scalability issue, which further leads to the loss of logical information. Moreover, most methods in this category make only a one-time injection of logical rules to enhance embedding, ignoring the interactive nature between embedding and logical inference (Guo et al., 2016;Rocktäschel et al., 2015).  English)), and adds their conclusion (e.g., speakLanguage (Mina Miller, English)) to the known facts until no facts can be added anymore. As illustrated in Fig. 1, unlike other logical inference algorithms, which require all ground predicates (including both observed and unobserved ground predicates) into calculation, forward chaining adopts "lazy inference" instead. It involves only a small subset of "active" ground predicates/rules, and activates more if necessary as the inference proceeds. The mechanism dramatically improves inference efficiency by avoiding the computation for massive ground predicates/rules that are never used. Moreover, considering that definite Horn rules which can be extracted efficiently via modern rule mining systems are usually chainlike Horn rules, which is in the form as shown in Eq.

Embedding-based
(1). The conjunctive body of a ground chain-like Horn rules is essentially a path in a KG, which can be extracted efficiently using sparse matrix multiplication. More general implementation of forward chaining algorithm can be found in Appendix G.
in which L(e i , r k , e j ) is defined as: where (e i , r k , e j ) denotes their corresponding negative samples, and γ is a margin to separate them. The score f r k (e i , e j ) of a triple (e i , r k , e j ) can be calculated following any scoring functions of KGE models. To reduce the effects of randomness, we sample multiple negative triples for each positive sample, which is denoted as N (e i , r k , e j ). To ensure true but unseen triples will not be sampled, the selection of N (e i , r k , e j ) is restricted to v F H * .
Update KG with KGE-based Inference. Although forward chaining can find the satisfying truth assignment for all hidden triples efficiently, its reasoning ability is severely limited by the coverage of rules, the incompleteness of KGs, and the errors/noise contained in KGs. Considering its strong reasoning ability and robustness, KGE models are not only useful to (1) prepare a more complete KG by adding useful hidden triples but also helpful to (2) eliminate incorrect triples in both KGs and inferred results.
(1) Including Potential Useful Hidden Triples (∆+). Since the body of a definite Horn rule is a conjunction of predicates, its ground rule can get activated and contribute to logical inference only if all the predicates in its body are completely observed. Due to the sparsity of real-world KGs, only a small portion of ground rules can participate in logical inference, which severely limits the reasoning ability of definite Horn rules. A straightforward solution would be computing the score for every hidden triple and adding the most promising ones with the highest scores to the KG. Unfortunately, the number of hidden triples is quadratic to the number of entities (i.e. O(|R||E| 2 )), thus it is too expensive to compute scores for all of them. Instead, we adopt "lazy inference" strategy to select only a small subset of "potentially useful" triples. Take the ground rule in Eq. (2) as an example, if r 1 (e i , e p ) ∈ v O , r 3 (e q , e j ) ∈ v O , and r 2 (e p , e q ) ∈ v H , we would not be able to infer the head r 0 (e i , e j ) as whether r 2 (e p , e q ) is true or not is unknown. Thus, r 2 (e p , e q ) becomes the crux to determine the truth value of the head, which is called "potentially useful". In general, given a ground rule whose body includes only one unobserved ground predicate, this unobserved ground predicate can be regarded as a "potentially useful" triple. We denote the set of all "potentially useful" triples as ∆ + . According to their positions, "potentially useful" triples can be divided into two categories: (1) triples that are the first or the last predicate in a ground rule; and (2) triples that are neither the first nor the last. We propose algorithms to identify both types of "potentially useful" triples respectively as illustrated in Fig. 2 by taking Eq. (2) as an example. More details are summarized in Appendix A. Score f r k (e i , e j ) will be computed by KGE model to predict whether a "potentially useful" triple is true. If f r k (e i , e j ) is larger than the given threshold Ψ, the triple is classified as true. Otherwise, the triple is classified as false. And we experimentally analysed the effect of Ψ in the Appendix J.2. Note that a dynamic programming algorithm can also be used to alleviate the computational complexity for long rules. The detailed algorithm can be found in the Appendix B.
(2) Excluding Potential Incorrect Triples (∆−). In addition, due to the symbolic nature, logical rules cannot handle noisy data as well. If the KGs contain any error, based on incorrect observations, forward chaining will not be able to make the correct inference. Even worse, it may contribute to the propagation of the error by including incorrectly inferred triples into KGs. Therefore, a clean KG is significant for logical inference. Since KGE models show great power in capturing the network structure of KGs, incorrect triples usually result in contradictions and get lower prediction scores in KGE models compared to correct ones. Therefore, score f r k (e i , e j ) computed by KGE model is able to measure reliability of triple (e i , r k , e j ) in O ∪ V T H * . We denote bottom θ% triples with lowest prediction scores as ∆ − . It will be excluded from O ∪ V T H * to alleviate the impact of noise.

Integrating Embedding and Logical Rules in an Iterative Manner
Since logical rules and KGE can mutually enhance each other, we propose a unified framework, known as UniKER, to integrate KGE and definite Horn rules-based inference iteratively. The pseudo-code of UniKER can be found in Algorithm 1. MAX_ITER is the user specified max iterations to run the algorithm, which highly depends on KG datasets. According to results in Fig. 4, MAX_ITER is usually set as 2 to 4. For each iteration of UniKER, it is comprised of two steps. First, we focus on logical reasoning to update KG. Following forward chaining algorithm, by triggering all rules whose premises are satisfied, we derive entailed triple set

Connection to Existing Approaches
Connection to PSL-based Regularization Approaches. The general objective of PSL-based regularization approaches can be written as: where L KGE denotes the loss of the base KGE model while L P SL corresponds to the satisfaction loss of the sampled ground rules. When including all ground rules into the calculation of L P SL , L P SL becomes the convex program which reasons analogous MAX-SAT problem defined over Lukasiewicz logic, which only approximates the exact logical inference. Detailed proof is given in Appendix C. Instead of guiding the embedding learning approximately, UniKER directly take the optimum of MAX-SAT problem as targets to optimize the embedding model. Thus it can better exploit the knowledge contained in definite Horn rules. Moreover, L P SL makes only a one-time injection of logical rules to enhance embedding, where logical reasoning will not be further enhanced even after the KGE gets improved. On the contrary, UniKER is able to capture the interactive nature between embedding and logical inference.
Connection to Embedding-based Variational Inference to MLN. The general objective of embedding-based variational inference for MLN can be written as: where the variational distribution Q θ is defined using a KGE model and P w is the true posterior defined over MLN. L KGE (Q θ ) denotes the loss of the base KGE model. By optimizing L ELBO (Q θ , P w ), the KL divergence between Q θ and P w can be minimized. In this way, the knowledge contained in rules can be transferred into the embeddings. Due to the nature of the approximate solution provided by variational inference and the information loss caused the sampling procedure, Q θ can only approximate the optimum of MAX-SAT problem and no guarantees are provided on the quality of the solutions obtained. Instead of guiding the learning of embedding model via variational inference, we directly solve MAX-SAT problem and use the derived knowledge v T H * to train the embedding model, which leads to superior reasoning.
Advantages of UniKER compared to SOTA methods. We categorize all existing methods according to two aspects: (1) whether they capture mutual interaction between KGE and logical inference; and (2) whether they conduct exact logical inference. The summary is given in Table 1. For the first aspect, most PSL-based regularization approaches make only a one-time injection of logical rules to enhance embedding, while embeddingbased variational inference to MLN and UniKER provide the interaction between embedding and logical inference. For the second aspect, both PSLbased regularization approaches and embeddingbased variational inference to MLN follow the framework of probabilistic logic to combine logical rule and KGE, which can only approximate the optimal solution of MAX-SAT problem. UniKER is the first to use forward chaining to conduct exact inference, which provides an optimal solution to the original MAX-SAT problem.

KG Completion
To compare different algorithms on KG inference task, we mask the head or tail entity of each test triple, and require each method to predict the   Table 2 shows the comparison results, from which we find that: (1) UniKER consistently outperforms KGE models in most cases with significant performance gain, which can ascribe to the utilization of additional knowledge from logical rules; (2) UniKER also obtains better performance than both classes of approaches which combine embedding model with logical rules as it provides an exact optimal solution to satisfiable problem defined over all ground rules rather than employ sampling strategies to do approximation.
Impact of Iterative Algorithm on KG Completion. To investigate how iterative process helps improve reasoning ability of UniKER, we conduct experiments on Family dataset and record the performance of UniKER on test data in terms of Hit@1, Hit@10 and MRR in every iteration. In   Table 3: Ablation study on noise threshold θ% on Family dataset (whose train set is injected with noise) particular, KGE model is trained based on the original data without any inferred triples included in iteration 0. As presented in Fig. 3, we observed that (1) with the increase of iterations, the performance improves rapidly first, and gradually slows down; (2) UniKER has a bigger impact on Hit@k compared to MRR. Robustness Analysis. To investigate the robustness of UniKER, we compare the reasoning ability of UniKER with TransE on Family dataset with noise. Complete details of injecting noise are summarized in Appendix J.1. We vary θ among {10, 20, 30, 40, 50} to study the effect of the threshold used to eliminate noisy triples. The comparison results are presented in Table 3. We observe that (1) UniKER outperforms TransE on noisy KG with significant performance gain; (2) with the increase of θ, the performance first increases and then decreases. The best performance is achieved when θ = 40%.
Effect of Threshold Ψ Used to Include Poten-  tial Useful Hidden Triples To investigate effect of threshold used to include useful hidden triples, we also compare the reasoning ability of UniKER with TransE on Family dataset with different thresholds Ψ. As threshold can vary a lot for different data sets, to propose a unified way to determine proper threshold Ψ, we take score f r k (e i , e j ) corresponding to the triple which ranks as top ψ% in test dataset as threshold Ψ. We vary ψ among {10, 20, 30, 40, 50}. The comparison results are presented in Table 9. We can observe that reasoning ability of UniKER does not vary a lot with different thresholds. In other words, the performance is less sensitive to the parameter ψ, which is appealing in practice.

Efficiency Analysis
Besides the promising results on KG reasoning, our UniKER is superior in terms of efficiency. Though we have theoretically analyzed the computational complexity of UniKER in Appendix H, the efficiency of forward chaining highly depends on KG datasets. Note that forward chaining learns the optimal truth assignment for the satisfiable problem iteratively, the number of iterations required to achieve the optimal solution may influence its scalability. We first conduct two experiments on six datasets (details are introduced in Appendix I): (1) as presented in Fig. 4, we record the proportion of inferred triples accumulated in every iteration over all inferred triples. The result shows that forward chaining can achieve the optimal solution within 12 iterations, and infer most correct triples within only 4 iterations; (2)

Mutual Enhancement between KGE and Logical Inference
In this section, we aim to show that the synergy of KGE and logical inference via UniKER is more powerful than a plain union. Enhancement of Logical Inference via KGE. On one hand, high quality embedding learned by KGE models is useful to prepare more complete KGs via including useful hidden triples, which the performance of logical inference highly depends on. To show the benefit brought by KGE over logical inference, we evaluate UniKER-TransE against forward chaining on Family Dataset with the triple classification task, which aims to predict correct facts in the testing data. In order to create a testing set for classification, we randomly corrupt relations of correct testing triplets for negative triples construction. It results in a total of 2 × #Test triplets with equal number of positive and negative examples. During evaluation, we adopt three evaluation metrics, i.e., precision, recall and F1. As shown in   As some triples in test dataset can be directly derived from logical rules, to ensure the improvement comes from the reasoning ability enhancement of KGE model, we exclude the triples derived directly from rules from the test data. As presented in Table 6, we can observe that UniKER-TransE outperforms TransE model with huge performance gain, especially in terms of Hit@1, which can ascribe to the added value brought by logical rules over KGE.

Conclusion
In this paper, we proposed a novel framework, known as UniKER, to integrate embedding and definite Horn rules in an iterative manner for better KG inference. We have shown that UniKER can fully leverage the knowledge in definite Horn rules and completely transfer them into the embeddings in an extremely efficient way.

A Illustration of Potential Useful Hidden Triples
Let every relation r k in KG associate with an |E| × |E| matrix M (k) , in which the element M (k) ij = 1 if the triple (e i , r k , e j ) ∈ O, and 0 otherwise. The algorithms to identify both types of "potential useful" triples as illustrated in Fig. 2 in main context are given as follows.
• When the "potential useful" triple is the first or the last predicate in a ground rule, other observed triples in a chain-like definite Horn rule still constitute a complete path, which can be extracted efficiently by sparse matrix multiplication. Take Fig. 2 (c) in main context as an example, to identify the "potential useful" triple r 1 (e i , e p ), we have to first extract all connected path r 2 (e p , e q ) ∧ r 3 (e q , e j ) by calculating M = M (2) M (3) , where M (2) and M (3) are adjacency matrices corresponding to relations r 2 and r 3 . Each nonzero element M pj indicates a connected path between e p and e j . We denote all indexes correspond to nonzero rows in M as δ = {p|( j M pj ) = 0}, which indicates that there is always a connected path starting at p. For specific p ∈ δ, ∆ p = {(e i , r 1 , e p )|e i ∈ E} defines a set "potential useful" triples. If (e i , r 1 , e p ) in ∆ p is predicted to be true via KGE, the head predicates r 0 (e i , e j ) can be inferred.
• Otherwise, the path corresponds to the conjunctive body of the ground rule get broken into two paths by the "potential useful" triple, which we have to extract separately. As shown in Fig.2 (d) in main context , when identifying "potential useful" triples r 2 (e p , e q ) ∈ v H , two paths to be extracted are essentially two single relations r 1 and r 3 , whose corresponding matrices are M (1) and M (3) , respectively. We denote all indexes correspond to nonzero columns in M (1) as δ 1 = {p|( i M qj ) = 0}. ∆ 12 = {(e p , r 2 , e q )|p ∈ δ 1 , q ∈ δ 2 } defines a set "potential useful" triples. If (e p , r 2 , e q ) in ∆ 12 is predicted to be true via KGE, the head predicates {r 0 (e i , e j )|M  . . , C m }, where C j ∈ C is a disjunction of variable r k (e i , e j ) or its negation ¬r k (e i , e j ), which can be written as: where I + j (resp. I − j ) is the set of variables that are not negated (resp. negated). Instead of interpreting the clauses C using Boolean logic, Lukasiewicz logic allow variables r k (e i , e j ) to take soft truth values I(r k (e i , e j )) in an interval between [0, 1]. Given two variables x i and x j , the formulas for the relaxation of the logical conjunction (∧), disjunction (∨), and negation (¬) are as follows: Therefore, by associating each C j ∈ C with weight w j , the analogous MAX-SAT problem defined over C in Lukasiewicz logic can be written as: (8) where w j is the weight of C j . It is equivalent to the relaxation of MAX-SAT problem. The proof is as follows.
The MAX-SAT problem defined over weighted C can be formulated as the integer linear program as follows: (9) where w j is the weight of C j . Finding a most probable assignment to the variables r k (e i , e j ) is NPhard (Shimony, 1994). Using relaxation techniques developed in the randomized algorithms community, we can independently round each Boolean variable r k (e i , e j ) to true with probability p ijk . Then, the expected satisfaction scoreŴ of clauses C is: The optimalŴ would give the exact MAX-SAT solution. According to (Bach et al., 2015), to approximately optimizeŴ , we can relax Eq.(9) as following: This results in the equivalence of Eq.(8) and MAX-SAT relaxation. Therefore, the optimum of Eq. (8) can only approximate the optimum of MAX-SAT problem.

D Satisfiability of KG Inference under Restriction of Definite Horn Rules
Given a set of logical rules F and their ground rules F g , if there exists at least one truth assignment that satisfies all ground rules F g , we call it satisfiable. We will show there exists a truth assignment to all hidden triples in a KG such that all ground rules are satisfied when restricting logical rules to be definite Horn rules.
Theorem 1. Knowledge graph inference is satisfiable when restricting logical rules to be definite Horn rules.
Proof. A set of ground rules is unsatisfiable if we can derive a pair of opposite ground predicates (i.e., r 0 (e i , e j ) and ¬r 0 (e i , e j )) from them. It is the case if and only if ¬r 0 (e i , e j ) is defined in KG as definite Horn rules can only include one single positive head predicate which results in its incapability in deriving negative triples. However, a typical KG will not explicitly include negative triples (i.e., ¬r 0 (e i , e j )). Thus we can never derive such a pair of opposite ground predicates, which confirms that KG inference is satisfiable when restricting logical rules to be definite Horn rules.

E Summary of PSL-based Regularization Approaches.
PSL-based regularization methods treat logical rules as additional regularization, where satisfaction of rules is integrated into the original embedding loss. A typical integration is defined as follows: (1) sampling ground logical rules given the template logical rules; (2) mapping each related triple (i.e., predicate) into a confidence score (i.e., soft truth value); (3) computing the satisfaction score to each ground rule based on its predicates' scores; and (4) defining proper loss based on the satisfaction score for all the ground rules. We now use KALE (Guo et al., 2016) as an example to illustrate the procedure described above. First, a set of positive and negative ground rules (i.e., f + and f − ) are sampled given the template logical rules. Together with atomic formulas (i.e., positive and sampled negative triples), the whole set of formulas is denoted as F g . Second, each predicate r k (e i , e j ) is assigned with a soft truth value, which is a transformation of TransE-based scoring function: I(r k (e i , e j )) = 1 − 1 3 √ d e i + r k − e j 1 , where e i , r k , and e j are embedding vectors to the corresponding entities and relations and d is the dimensionality of the embeddings. Third, the soft truth value of a ground rule is computed according to Lukasiewicz Logic, where the basic operations are summarized in appendix C: Given the basic operations, the truth value of any ground formula can be calculated recursively. Finally, KALE defines a loss function over formulas from F g , which contain both triples and ground rules. Similar to marginbased ranking loss, embeddings are learned via maximizing the difference between the soft truth value of positive formulae I(f + ) and its negative samplings I(f − ). Note that, by removing ground rules from F g , the loss is degenerated to regular TransE-based embedding loss. Rocktäschel (Rocktäschel et al., 2015) devised a model similar to KALE. However, instead of learning entity embeddings for individual entity, they utilize matrix factorization to learn joint embeddings of pairs of entities v e i ,e j as well as embeddings of relations v r k . Logistic loss is used to maximize the soft truth value of positive formulaes I(f + ). Different from above methods, RUGE (Guo et al., 2017) defines a loss function over triples. It employs scoring function in ComplEx (Trouillon et al., 2016), σ(Re( e i , r k , e j )), to model triples. Triples are divided into two categories, including observed triples (i.e., v O ) and hidden triples (i.e., v H ). Observed triples have labels y r k (e i ,e j ) = 1 whereas sampled negative triples have labels y r k (e i ,e j ) = 0. The soft label of hidden triples s r k (e i ,e j ) ∈ [0, 1] have to be predicted following t-norm fuzzy logic. With y r k (e i ,e j ) and s r k (e i ,e j ) , RUGE learns embedding by enforcing triples to be consistent with their labels. The summary of all logical rule-based regularization approaches can be found in Table 7.

F Summary of Embedding-based
Variational Inference for MLN.
To specify probability distributions over complex relational domains compactly, Markov Logic Network (MLN) (Richardson and Domingos, 2006) provides a probabilistic extension of FOL via probabilistic graphical models. Given a set of FOL formulas F and their corresponding weight vector w, it defines a Markov network with one node per ground predicate and one feature per ground rule.
The weight of a feature is the weight of its original FOL rules. Under the MLN model, the joint probability of all triples is defined as: is the number of true groundings of F i based on the values of v O and v H , and Z(w) is a normalization constant for w to make the probabilities of all worlds sum up to one. Since MLN inference subsumes probabilistic inference, which is #P-complete, and logical inference, which is NP-complete even in finite domains (Richardson and Domingos, 2006), it is a very challenging problem computational wise. Several methods including pGAT (Harsha Vardhan et al., 2020), Ex-pressGNN (Zhang et al., 2019) and pLogicNet (Qu and Tang, 2019) propose to conduct variational inference of MLN to alleviate the time complexity. We now use pLogicNet (Qu and Tang, 2019) as an example to illustrate the procedure. pLogicNet aims to train the MLN model by optimizing the evidence lower bound (ELBO) for the likelihood function of observed triples v O : where the variational distribution q θ (v H ) is defined using a knowledge graph embedding model, by assuming each triple independently follows a Bernoulli distribution, with parameters specified by the embedding score function: where Ber represents the Bernoulli distribution and f r k (e i , e j ) is an embedding scoring function denoting the probability of triple (e i , r k , e j ) to be true. For example, in DistMult, f r k (e i , e j ) can be defined as σ(e T i diag(r k )e j ). This lower bound can be effectively optimized using varitional EM algorithm (Neal and Hinton, 1998). In variational Estep, p w is fixed and q θ is updated to minimize the KL divergence between q θ (v H ) and p w (v H |v O ). In M-step, q θ is fixed and the weights of the rules w is updated to maximize the joint probability of both observed and hidden triples (i.e., ei, ej, rk ∈ C d ). However, due to the expensive computational cost of MLN inference, even for variational inference algorithms that are developed to alleviate the time complexity, the efficiency issue remains a big problem.

G Implementation of Forward Chaining.
We have discussed the implementation of forward chaining algorithm for chain-like definite Horn rules in Section 4.1. Next, we discuss more general implementation of forward chaining algorithm to handle definite Horn rules in any form. As shown in Fig. 5, unlike traditional logical inference methods, instead of instantiating the rules with all potential triples in KG (including both observed and unobserved triples), only observed triples are considered when apply forward chaining. Let us take the definite Horn rule in Fig. 5 as an example. To apply forward chaining to infer new facts, we first focus on all observed triples related to first predicate in the body, liveIn(person, country). Consequently, "country" is limited to a small set of concrete entities "USA","Denmark". By restricting "country" within the set "USA","Denmark", we again ground officialLanguage (country, language) with observed triples. The candidate entities for "country" will be further limited to "USA". Finally, only triples speakLanguage (Mina Miller, English) can be inferred as the new facts and add to the KG.

H Theoretical Computational Complexity Analysis of UniKER.
To theoretically demonstrate the superiority of our proposed UniKER in terms of efficiency, we compare the space and time complexity of UniKER and other methods that combine KG embedding and logical rules. More precisely, we only include logical rule-based regularization approaches because embedding-based variational inference for MLN is essentially a #P problem. Obviously, they have a much higher computational cost than we do. As both logical rule-based regularization approaches and our UniKER consists of two parts, materialization (i.e., sampling ground logical rules and inference U using forward chaining) and KG embedding learning, we include the complexity of both parts in Table 11. Note that materialization only contributes to the time complexity without affecting space complexity. We denote n e /n r /n t /l/n l /θ/a/d as the number of entities/relations/observed triples/length of rule body/number of rules/sampling ratio/average degree of entities/dimension of the embedding space. We can observe that: (1) For space complexity, our proposed UniKER is the same as other logical rule-based regularization approaches; (2) For time complexity, considering a n e , if the sampling ratio is not small enough, our proposed UniKER is much smaller than other logical rule-based regularization approaches.

I Data Statistics
The detailed statistics of three large scale real-world KGs (e.g., Family, FB15k-237 and WN18RR) are provided in Table 12. FB15K237 and WN18RR are the most widely used benchmark datasets for KGE models, which don't suffer from test triple leakage in the training set. The Family dataset is selected due to better interpretability and high intuitiveness. In addition, three small scale datasets (e.g., RC1000, sub-YAGO3-10 and sub-Family) are also included in our experiments to evaluate the scalability of forward chaining against a number of SOTA inference algorithms for MLN as shown in J.3 due to the poor scalability of MLN .
• RC1000 is a typical benchmark dataset for inference in MLN. It involves the task of relational classification with hand-code rules given.
• sub-YAGO3-10 is a subset of a well known  For the large scale knowledge graph, we adopt three commonly used benchmark datasets, including Family, FB15k-237 and WN18RR.
• Family contains family relationships among members of a family (Denham, 1973). We substract a subset from Family dataset and call it sub-Family.
• FB15k-237 is the most commonly used benchmark knowledge graph datasets introduced in (Bordes et al., 2013). It is an online collection of structured data harvested from many sources, including individual, user-submitted wiki contributions.
• WN18RR is another widely used benchmark knowledge graph datasets introduced in (Bordes et al., 2013). It is designed to produce an intuitively usable dictionary and thesaurus, and support automatic text analysis. Its entities correspond to word senses, and relationships define lexical relations between them.
J Experimental Details.

J.1 Setting for Knowledge Graph Completion
To compare among the reasoning ability of UniKER and aforementioned baseline algorithms, we mask the head or tail entity of each test triple, and require each method to predict the masked entity. We use three large-scale datasets including Family, FB15K-237 and WN18RR. During evaluation, we use the filtered setting (Bordes et al., 2013) and three evaluation metrics, i.e., Hit@1, Hit@10 and MRR. We randomly split the data into training set and test set with the ratio of 8:2 and do not exclude the triples derived from rules from test data. To fairly compare among all baseline methods, we consistently apply this same setting to all of them. Due to the unavailable codes, we take the results of BLP from the corresponding paper (Qu and Tang, 2019) and the results of pGAT from the corresponding paper (Harsha Vardhan et al., 2020). As only the results on the FB15k-237 and WN18RR datasets are reported, we only compare with them on these two datasets.
Hyperparameter Settings Adam (Kingma and Ba, 2014) is adopted as the optimizer. We set the parameters for all methods by a grid search strategy. The range of different parameters is set as follows: embedding dimension k ∈ {250, 500, 1000}, batch size b ∈ {256, 512, 1024}, and fixed margin γ ∈ {6, 9, 12, 24}. Afterwards, we compare the best results of different methods. Both the entity embeddings and the relation embeddings are uniformly initialized and no regularization is imposed on them. The detailed hyperparameter settings can be found in Table 8.  Robustness Analysis. As all kinds of noise might be contained in the process of constructing KGs, we introduce noise by substituting the true head entity or tail entity with randomly selected entity. Following this approach, we construct a noisy Family dataset with noisy triples to be 40% of original data. All generated noisy triples only fused into the original training set while validation and test sets remain the same.

J.2 Effect of Threshold Ψ Used to Include Potential Useful Hidden Triples
To investigate effect of threshold used to include useful hidden triples, we also compare the reasoning ability of UniKER with TransE on Family dataset with different thresholds Ψ. As threshold can vary a lot for different data sets, to propose a unified way to determine proper threshold Ψ, we take score f r k (e i , e j ) corresponding to the triple which ranks as top ψ% in test dataset as threshold Ψ. We vary ψ among {10, 20, 30, 40, 50}. The comparison results are presented in Table 9. We can observe that reasoning ability of UniKER does not vary a lot with different thresholds. In other words, the performance is less sensitive to the parameter ψ, which is appealing in practice.

J.3 Efficiency Analysis
We evaluate the scalability of forward chaining against a number of state-of- The results of their inference time are given in Table 13. Additionally, we compared the overall efficiency of our proposed UniKER with other methods. As shown in Table 10, UniKER showed its efficiency in the time cost per epoch. Even with the inference period, UniKER is faster than other methods combining embedding with logical rules experimentally.

J.4 Impact of Coverage of Logical Rules on Family Dataset
To further analyze the impact of coverage of logical rules on KG inference, we measure the coverage of logical rules using the total number of triples, which can be inferred from the given set of definite Horn rules. Due to space limitations, we only show the results on the Family dataset as we have similar observations on the remaining datasets. To ensure enough coverage of logical rules, we take the whole Family dataset as training data while the triples, which can be inferred from the training data using all 41 logical rules, are regarded as test data. To investigate the effects of coverage of logical rules, we vary the number of definite Horn rules among {10, 20, 30, 35, 36, 38, 41}. We provide the number of triples that can be inferred from these sets of rules in Table 14. As shown in Figure 6, we compare UniKER-DistMult against its default model DistMult as well as forward chaining algorithm.
In particular, we regard link prediction in KG as binary classification and evaluate all methods in terms of triple True/False classification accuracy. We make the observations that: (1) When the coverage of logical rule is not enough, traditional rulebased methods have shown poor performance; (2) Without the incorporation of logical rules, DistMult has already shown pretty good reasoning ability; (3) The performance of UniKER sustainedly and steadily increases with the increase of coverage of logical rules; (4) When we include only 30 rules, UniKER has already achieved accuracy close to 1, which is much higher than forward chaining. A small number of logical rules is very appealing in practice as it is costly and labor-intensive to obtain high-quality logical rules.