RelWalk - A Latent Variable Model Approach to Knowledge Graph Embedding

Embedding entities and relations of a knowledge graph in a low-dimensional space has shown impressive performance in predicting missing links between entities. Although progresses have been achieved, existing methods are heuristically motivated and theoretical understanding of such embeddings is comparatively underdeveloped. This paper extends the random walk model of word embeddings to Knowledge Graph Embeddings (KGEs) to derive a scoring function that evaluates the strength of a relation R between two entities h (head) and t (tail). Moreover, we show that marginal loss minimisation, a popular objective used in much prior work in KGE, follows naturally from the log-likelihood ratio maximisation under the probabilities estimated from the KGEs according to our theoretical relationship. We propose a learning objective motivated by the theoretical analysis to learn KGEs from a given knowledge graph.Using the derived objective, accurate KGEs are learnt from FB15K237 and WN18RR benchmark datasets, providing empirical evidence in support of the theory.


Introduction
Knowedge graphs (KGs) such as Freebase (Bollacker et al., 2008) organise information in the form of graphs, where entities are represented by the vertices and the relations between two entities are represented by the edges that connect the corresponding vertices. Despite the best efforts to create complete and large-scale KGs, most KGs remain incomplete and do not represent all the relations that exist between entities (Min et al., 2013). In particular, new entities are constantly being generated, and new relations are formed between new as * Danushka Bollegala holds concurrent appointments as a Professor at University of Liverpool and as an Amazon Scholar. This paper describes work performed at the University of Liverpool and is not associated with Amazon. well as existing entities. Therefore, it is unrealistic to assume that a real-world KG would be complete at any given time point. Developing approaches for KG completion is an important research field associated with KGs.
KG components can be embedded into numerical formats by learning representations (a.k.a embeddings) for the entities and relations in a given KG. The learnt KGEs can be used for link prediction, which is the task of predicting whether a particular relation exists between two given entities in the KG. Specifically, given KGEs for entities and relations, in link prediction, we predict R that is most likely to exist between h and t according to some scoring formula. Thus, by embedding entities and relations that exist in a KG in some (possibly lower-dimensional and latent) space, we can infer previously unseen relations between entities, thereby expanding a given KG.
KGE can be seen as a two-step process. Given a KG represented by a set of relational triples (h, R, t), where a semantic relation R holds between a head entity h and a tail entity t, first a scoring function is defined that measures the relational strength of a triple (h, R, t). Second,the entity and relation embeddings that optimise the defined scoring function are learnt using some optimisation method. Despite the wide applications of entity and relation embeddings created via KGE methods, the existing scoring functions are heuristically motivated to capture some geometric requirements of the embedding space. For example, TransE (Bordes et al., 2011) assumes that the entity and relation embeddings co-exist in the same (possibly lower dimensional) vector space and translating (shifting) the head entity embedding by the relation embedding must make it closer to the tail entity embedding, whereas ComplEx (Trouillon et al., 2016) models the asymmetry in relations using the component-wise multi-linear inner-product among entity and relation embeddings.
Theoretical understanding of KGE methods is under developed. For example, it is not clear how the heuristically defined KGE objectives relate to the generative process of a KG. Providing such a theoretical understanding of the KGE process will enable us to develop KGE methods that address the weaknesses in the existing KGE methods. For this purpose, we propose Relational Walk (RelWalk), a theoretically motivated generative approach for learning KGEs. We are particularly interested in the semantic relationships that exist between entities such as the is-CEO-of relation between a person such as Jeff Bezos and a company such as Amazon Inc.
We model KGE as a random walk over the KG. Specifically, a random walker at the vertex corresponding to the (head) entity h will uniformly at random select one of the outgoing edges corresponding to the semantic relation R, which will lead it to the vertex corresponding to the (tail) entity t. Continuing this random walk will result in a traversal over a path in the KG. Based on this random walk model we derive a relationship between the probability of R holding between h and t, p(h, t | R), and their KGEs R, h and t. Interestingly, the derived relationship is not covered by any of the previously proposed heuristically-motivated scoring functions, providing the first-ever KGE method with a provable generative explanation.
We show that the margin loss, a popular training objective in prior work on KGE, naturally emerges as the log-likelihood ratio computed from the derived p(h, t | R). Based on this result, we derive a training objective that is optimised for learning KGEs that satisfy our theoretical relationship. This enables us to empirically verify the theoretical relationships that we derived from the proposed random walk process.
Using FB15K237 and WN18RR benchmarks, we evaluate the learnt KGEs on link prediction and triple classification. Although we do not obtain state-of-the-art (SoTA) performance on these benchmark datasets, KGEs learnt using RelWalk perform consistently well on both tasks, providing empirical support to the theoretical analysis conducted in this paper. We re-emphasise that our main objective in this paper is to study KGEs from an interpretable theoretical perspective and not necessarily improving SoTA. To this end, we study the relationship between the concentration of the partition function as predicted by our theoretical analysis and the performance of the learnt KGEs. We observe that when the partition function is narrowly distributed, we are able to learn accurate KGEs. Moreover, we empirically verify that the learnt relation embedding matrices satisfy the orthogonality property as expected by the theoretical analysis.

Related Work
At a high-level of abstraction, KGE methods can be seen as differing in their design choices for the following two main problems: (a) how to represent entities and relations, and (b) how to model the interaction between two entities and a relation that holds between them. Next, we briefly discuss prior proposals to those two problems (refer to Wang et al. (2017);Nguyen (2017);Nickel et al. (2015) for an extended survey on KGE).
Given entity and relation embeddings, a scoring function evaluates the strength of a triple (h, R, t). Scoring functions that encode various intuitions have been proposed such as the 1 or 2 norms of the vector formed by a translation of the head entity embedding by the relation embedding over the target embedding, or by first performing a projection from the entity embedding space to the relation embedding space (Yoon et al., 2016). As an alternative to using vector norms as scoring functions, Dist-Mult (Yang et al., 2015) and ComplEx (Trouillon et al., 2016) use the component-wise multi-linear dot product. Lacroix et al. (2018) proposed the use of nuclear 3-norm regularisers instead of the popular Frobenius norm for canonical tensor decomposition. Table 1 shows the scoring functions along with algebraic structures for entities and relations proposed in selected prior work in KGE learning. Given a scoring function, KGEs are learnt that assign higher scores to relational triples in existing KGs over triples where the relation does not hold (negative triples) by minimising a loss function such as the logistic loss (RESCAL, DistMult, ComplEx) or marginal loss (TransE).
Alternatively to directly learning embeddings from a graph, several methods (Grover and Leskovec, 2016;Perozzi et al., 2014;Ristoski et al., 2018) have considered the vertices visited during truncated random walks over the graph as pseudo sentences, and have applied popular word embedding learning algorithms such as continuous bag-ofwords model (Mikolov et al., 2013) to learn vertex embeddings. However, pseudo sentences generated in this manner are syntactically very different from sentences in natural languages.
On the other hand, our work extends the random walk analysis by Arora et al. (2016a) that derives a useful connection between the joint cooccurrence probability of two words and the 2 norm of the sum of the corresponding word embeddings. Specifically, they proposed a latent variable model where the words in a corpus are generated by a probabilistic model parametrised by a time-dependent discourse vector that performs a random walk. In contrast to Arora's model that uses co-occurrences as a generic relation, in our work we include relations as labels for the edges in the graph. Bollegala et al. (2018) extended the model proposed by Arora et al. (2016a) to capture co-occurrences involving more than two words. Specifically, they defined the co-occurrence of k unique words in a given context as a k-way cooccurrence, where Arora et al. (2016a) result could be seen as a special case corresponding to k = 2. Moreover, it has been shown that it is possible to learn word embeddings that capture some types of semantic relations such as antonymy and collocation using 3-way co-occurrences more accurately than using 2-way co-occurrences. However, that model does not explicitly consider the relations between words/entities and uses only a corpus for learning the word embeddings.

Relational Walk
Let us consider a KG, D, where the knowledge is represented by relational triples (h, R, t) ∈ D.
Here, R is a relational predicate with two arguments, where h (head) and t (tail) entities respectively filling the first and second arguments. In this work, we assume relations to be asymmetric in general (if (h, R, t) ∈ D then it does not necessarily follow that (t, R, h) ∈ D). The goal of KGE method Score function Relation f (h, R, t) parameters Unstructured (Bordes et al., 2012) ||h − t|| 1/2 none Structured (Bordes et al., 2011) ||R 1 h − R 2 t|| 1,2 R 1 , R 2 ∈ R d×d TransE (Bordes et al., 2013) ||h + r − t|| 1/2 r ∈ R d DistMult (Yang et al., 2015) h, r, t r ∈ R d RESCAL (Nickel et al., 2011) h Rt R ∈ R d×d ComplEx (Trouillon et al., 2016) h, r,t r ∈ C d KGE is to learn embeddings for the relations and entities in the KG such that the entities that participate in similar relations are embedded closely to each other in the entity embedding space, while at the same time relations that hold between similar entities are embedded closely to each other in the relational embedding space. We call the learnt entity and relation embeddings collectively as KGEs.
We assume that entities and relations are embedded in the same vector space, allowing us to perform linear algebraic operations using the embeddings in the same vector space. Following our aforementioned modelling of a knowledge base as a graph, let us consider a random walker who is at a vertex corresponding to some entity h. This entity will have one or more semantic relations with other entities in the KG. The random walker will uniformly at random pick one of the outgoing edges corresponding to a particular semantic relation R, and follow it to land on the entity t. This one-step of the random walk thus generates a tuple (h, R, t) in the KG. The random walker proceeds by using t as the new starting point. Multiple steps of this random walk trace a single path in the KG.
To illustrate a random walk over a KG, let us assume that we are currently at the vertex corresponding to the company entity Amazon Inc. Possible outgoing edges at Amazon Inc. would correspond to semantic relations such as has-ceo, is-headquarted-at, founded-in etc., where Amazon Inc. is the head entity. If there are only three such outgoing relations at Amazon Inc., then the random walker will pick any one of those relations with a probability 1/3. For example, by selecting has-ceo, is-headquarted-at or founded-in the random walker would arrive at entities respectively Jeff Bezos, Seattle or 1994. Let us assume that the random walker selected the has-ceo relation and landed at Jeff Bezos. The random walker might subsequently continue its random walk from Jeff Bezos following the relation born-in and transiting to New Mexico, US. Prior work studying inferences in KGs have successfully used random walk models similar to what we describe here (Gardner et al., 2013;Lao et al., 2012Lao et al., , 2011Lao and Cohen, 2010).
Let us consider a random walk characterised by a time-dependent knowledge vector c k , where k is the current time step. The knowledge vector represents the knowledge we have about a particular group of entities and relations that express some facts about the world. For example, when we are talking about Amazon Inc., we will use the knowledge associated with Amazon Inc. such as its CEO, location of the headquarters, when it was founded etc. Therefore, it is intuitive to assume that the entities associated with Amazon Inc. with some set of semantic relations can be generated from this knowledge vector. Each entity and relation has time-independent latent representations that capture their correlations with c k . For entities h and t, we denote their representations by d-dimensional vectors respectively h, t ∈ R d .
We assume the task of generating a relational triple (h, R, t) in a given KG to be a two-step process as described next. First, given the current knowledge vector at time k, c = c k and the relation R, we assume that the probability of an entity h satisfying the first argument of R to be given by the loglinear entity production model in (1).
Here, R 1 ∈ R d×d is a relation-specific orthogonal matrix that evaluates the appropriateness of h for the first argument of R. For example, if R is the is-ceo-of relation, we would require a person as the first argument and a company as the second argument of R. However, note that the role of R 1 extends beyond simply checking the types of the entities that can fill the first argument of a relation. For our example above, not all people are CEOs and R 1 evaluates the likelihood of a person to be selected as the first argument of the ceo-of relation. Z c is a normalisation coefficient such that where the vocabulary V is the set of all entities in the KG.
After generating h, the state of our random walker changes to c = c k+1 , and we next gener-ate the second argument of R with the probability given by (2).
Here, R 2 ∈ R d×d is a relation-specific orthogonal matrix that evaluates the appropriateness of t as the second argument of R. Z c is a normalisation coefficient such that t∈V p(t | R, c ) = 1. Following our previous example of is-ceo-of relation, R 2 evaluates the likelihood of an organisation to be a company with a CEO position. Importantly, R 1 and R 2 are representations of the relation R and independent of the entities. Therefore, we consider (R 1 and R 2 ) to collectively represent the embedding of R. Orthogonality of R 1 , R 2 is a requirement for the mathematical proof and also acts as a regularisation constraint to prevent overfitting by restricting the relational embedding space ( The knowledge vector c k performs a slow random walk (meaning c k+1 is obtained from c k by adding a small random displacement vector) such that the head and tail entities of a relation are generated under similar knowledge vectors. More specifically, we assume that ||c k − c k+1 || ≤ 2 for some small 2 > 0. This is a realistic assumption for generating the two entity arguments in the same relational triple because, if the knowledge vectors were significantly different in the two generation steps, then it is likely that the corresponding relations are also different, which would not be coherent with the above-described generative process. Moreover, we assume that the knowledge vectors are distributed uniformly in the unit sphere and denote the distribution of knowledge vectors by C.
To relate KGEs with the connections in the graph, we must estimate the probability that h and t satisfy the relation R, p(h, t | R), which can be ob-tained by taking the expectation of p(h, t | R, c, c ) w.r.t. c, c ∼ C given by (3).
Here, partition functions are given by (4) follows from our two-step generative process where the generation of h and t in each step is independent given the relation and the corresponding knowledge vectors.
Computing the expectation in (5) is generally difficult because of the two partition functions Z c and Z c . However, Lemma 1 shows that the partition functions are narrowly distributed around a constant value for all c (or c ) values with high probability.
Lemma 1 (Concentration Lemma). If the entity embedding vectors satisfy the Bayesian prior v = sv, wherev is from the spherical Gaussian distribution, and s is a scalar random variable, which is always bounded by a constant κ, then the entire ensemble of entity embeddings satisfies that: for z = O(1/ √ n), and δ = exp(−Ω(log 2 n)), where n ≥ d is the number of entities in a given KG and Z c is the partition function for c given by Refer to Appendix A for the proof of the concentration lemma. We empirically investigate the relationship between the performance of the KGEs and the degree to which Lemma 1 is satisfied in subsection 5.1. Under the conditions required to satisfy Lemma 1, the following main theorem of this paper holds: Theorem 1. Suppose that the entity embeddings satisfy (1). Then, we have Proof sketch: Let F be the event that both c and c are within (1 ± z )Z. Then, from Lemma 1 and the union bound, event F happens with probability at least 1 − 2 exp(−Ω(log 2 n)). The R.H.S. of (5) can be split into two parts T 1 and T 2 according to whether F happens or not.
T 1 can be approximated as given by (12).
On the other hand, T 2 can be shown to be a constant, independent of d, given by (13).
The vocabulary size n of real-world KGs is typically over 10 5 , for which T 2 becomes negligibly small. Therefore, it suffices to consider only T 1 . Because of the slowness of the random walk we have c ≈ c . Using the law of total expectation we can write T 1 as follows: where A(c) := E c |c exp t R 2 c . Doing some further evaluations we show that Plugging (51) back in (14) provides the claim of the theorem. Detailed proof is shown in Appendix B.
The relationship given by (9) indicates that head and tail entity embeddings are first transformed respectively by R 1 and R 2 , and the squared 2 norm of the sum of the transformed vectors is proportional to the probability p(h, t | R).

Learning KG Embeddings
In this section, we derive a training objective from Theorem 1 that we can then optimise to learn KGEs. The goal is to empirically validate the theoretical result by evaluating the learnt KGEs. KGs represent information about relations between two entities in the form of relational triples. The joint probability p(h, R, t) given by Theorem 1 is useful for determining whether a relation R exists between two given entities h and t. For example, if we know that with a high probability that R holds between h and t, then we can append (h, R, t) to the KG. The task of expanding KGs by predicting missing links between entities or relations is known as the link prediction problem (Trouillon et al., 2016). In particular, if we can automatically append such previously unknown knowledge to the KG, we can expand the KG and address the knowledge acquisition bottleneck.
To derive a criteria for determining whether a link must be predicted among entities and relations, let us consider a relational triple (h, R, t) ∈ D that exists in a given KG D. We call such relational triples as positive triples because from the assumption it is known that R holds between h and t. On the other hand, consider a negative relational triple (h , R, t ) ∈ D formed by, for example, randomly perturbing a positive triple. A popular technique for generating such (pseudo) negative triples is to replace h or t with a randomly selected different instance of the same entity type. As an alternative for random perturbation, Cai and Wang (2018) proposed a method for generating negative instances using adversarial learning. Here, we are not concerned about the actual method used for generating the negative triples but assume a set of negative triples,D, generated using some method, to be given.
Given a positive triple (h, R, t) ∈ D and a negative triple (h , R, t ) ∈D, we would like to learn KGEs such that a higher probability is assigned to (h, R, t) than that assigned to (h , R, t ). We can formalise this requirement using the likelihood ratio given by (16).
Here, η > 1 is a threshold that determines how higher we would like to set the probabilities for the positive triples compares to that of the negative triples.
By taking the logarithm of both sides in (16) we obtain If a positive triple (h, R, t) is correctly assigned a higher probability than a negative triple p(h , R, t ), then the left hand side of (17) will be negative, indicating that there is no loss incurred during this classification task. Therefore, we can re-write (17) to obtain the marginal loss (Bordes et al., 2013(Bordes et al., , 2011, L(D,D), a popular choice as a learning objective in prior work in KGE, as shown in (18).
We can assume 2d log η to be the margin for the constraint violation. Theorem 1 requires R 1 and R 2 to be orthogonal. To reflect this requirement, we add two 2 regularisation terms R 1 R 1 − I 2 2 and R 2 R 2 − I 2 2 respectively with regularisation coefficients λ 1 and λ 2 to the objective function given by (18). In our experiments, we compute the gradients (18) w.r.t. each of the parameters h, t, R 1 and R 2 and use stochastic gradient descent (SGD) for optimisation. Considering that negative triples are generated via random perturbation, it is important to consider multiple negative triples during training to better estimate the classification boundary. This approach can be easily extended to learn from multiple negative triples as shown in Appendix C.

Empirical validation
To empirically evaluate the theoretical result stated in Theorem 1, we learn KGEs (denoted by Rel-Walk) by minimising the marginal loss objective derived in section 4. We use the FB15k237, FB13 (subsets of Freebase) and WN18RR (a subset of WordNet) datasets, which are standard benchmarks for KGE. We use the standard training, validation and test splits. Statistics about the datasets and training details are in Appendix D. RelWalk is implemented in the open-source toolkit OpenKE (Han et al., 2018) and the code and learnt KGEs will are publicly available 1 .
We conduct two evaluation tasks: link prediction (predict the missing head or tail entity in a given triple (h, R, ?) or (?, R, t)) (Bordes et al., 2011) and triple classification (predict whether a relation R holds between h and t in a given triple  (Lin et al., 2015) 82.5 TransG (Xiao et al., 2016) 87.3 NTN (Socher et al., 2013) 87.2 RelWalk 88.6  (h, R, t)) (Socher et al., 2013). We evaluate the performance in the link prediction task using mean reciprocal rank (MRR), mean rank (MR) (the average of the rank assigned to the original head or tail entity in a corrupted triple) and hits at ranks 1, 3 and 10 (H@1, 3, 10), whereas in the triple classification task we use accuracy (percentage of the correctly classified test triples). We only report scores under the filtered setting (Bordes et al., 2013), which removes all triples appeared in training, validating and testing sets from candidate triples before obtaining the rank of the ground truth triple. In link prediction, we consider all entities that appear in the corresponding argument in the entire knowledge graph as candidates.
In Table 2 we compare the KGEs learnt by Rel-Walk against prior work using the published results. For triple classification, RelWalk reports the best performance on FB13, outperforming all methods compared. For the link prediction results as shown in Table 2, we see that RelWalk obtains competitive performance on both WN18RR and FB15K237 under all evaluation measures. In particular, it is outperformed by the KGE method proposed by Lacroix et al. (2018) (CP-N3), which uses nuclear 3-norm regularisers with canonical tensor decomposition. Interestingly, the improvement against structured embeddings (SE) is consistent and interesting because the scoring function of SE closely resembles that of RelWalk as we can redefine R 2 with the negative sign. However, SE learns KGEs that minimise the 1,2 norm whereas according to (9) we must maximise the probability for relational triples in a knowledge graph. WN18RR excludes triples from WN18 that are simply inverted between train and test partitions (Toutanova and Chen, 2015; Dettmers et al., 2017b), making it a difficult dataset for link prediction using simple memorisation heuristics. RelWalk's consistent good performance on both versions of this dataset shows that it is considering the global structure in the KG when learning KGEs.
We note that our goal in this paper is not to claim SoTA for KGE but to provide a theoretical understanding with empirical validation. To this end, the experimental results support our theoretical claim and emphasise the importance of theoretically motivating the KGE scoring function design process.

Orthogonality and Concentration
Our theoretical analysis depends on two main assumptions: (a) concentration of the partition function Z c (Lemma 1), and (b) the orthogonality of the relation embedding matrices R 1 , R 2 . In this section, we empirically study the relationship between these assumptions and the performance of RelWalk.
Given R 1 and R 2 learnt by RelWalk for a particular R, we can measure the degree to which the orthogonality, ν R , is satisfied by the sum of the non-diagonal elements (19). If a matrix A is orthogonal, then the non-diagonal elements of the inner-product A A will contain zeros. Therefore, the smaller the ν R values, more orthogonal the relation embeddings will be. We measure ν R values for the 11 relation types in the WN18RR dataset as shown in Table 3. From Table 3 we see that ν R values are indeed small for different relation types indicating that the orthogonality requirement is satisfied as expected. Interestingly, a moderately high (-0.515) negative Pearson correlation between H@10 and ν R shows that orthogonality correlates with the better the performance.
To visualise how the orthogonality affects different relation types, we plot the elements in R 1 R 1 and R 2 R 2 for four relations in the WN18RR dataset in Figure 1 for 100 × 100 dimensional relational embeddings. For the two relations also see and similar to we see that the corresponding innerproducts are sparse except in the main diagonal, compared to that in hypernym and member meronym relations. On the other hand, according to Table 3 the H@10 values for also see and similar to are higher than that for hypernym and member meronym as implied by the negative correlation.
To test for the concentration of the partition function, for a relation R we compute Z c and Z c values using respectively (6) and (7) over a set of randomly sampled 10000 head or tail entities. We compute the standard deviations σ c and σ c respectively for the distributions of Z c and Z c and their geometric means as shown in Table 3. We observed a Gaussian-like distributions for the par- tition functions for different relations for which smaller standard deviations indicate stronger concentration around the mean. Interestingly, from Table 3 we see a negative correlation between H@10 and the standard deviations indicating that the performance of RelWalk depends on the validity of the concentration assumption.

Compression of Embeddings
To reduce the amount of memory required for KGEs, especially with a large KG, compressing KGEs has been studied recently (Sachan, 2020). RelWalk uses (orthogonal) matrices to represent relations, which require more parameters compared to a vector representation of the same dimensionality of a relation. Prior work studying lowerrank decomposition of KGEs have shown that, although linear embeddings of graphs can require prohibitively large dimensionality to model certain types of relations (Nickel et al., 2014) (e.g. sameAs), nonlinear embeddings can mitigate this problem (Bouchard et al., 2015). In this section, we propose memory-efficient low-rank approximations to the RelWalk embeddings.
From the definition of orthogonality it follows that the relation embeddings R 1 , R 2 ∈ R d×d learnt by RelWalk for a particular relation R means that R 1 , R 2 are both full-rank and cannot be factorised as the product of two lower rank matrices. This prevents us from directly applying matrix decomposition methods such as non-negative matrix factorisation on the learnt relation embeddings to obtain low-rank approximations. Therefore, we subtract the identity matrix I ∈ R n×n from the relation embedding R(∈ {R 1 , R 2 }) and factorise the remainder R ∈ R n×n as the product of two low-rank matrices using the eigendecomposition of R as given by (20).
Here, U is the matrix formed by arranging the eigenvectors of R as columns, and D is a diagonal matrix containing the eigenvalues of R in the descending order. We can then use the largest K ≤ d eigenvalues and corresponding eigenvectors to obtain a rank-K approximation in the sense of minimum Frobenius distance between R and its rank-K approximation. In the case we use K factors in the approximation, we must store dK real numbers corresponding to the d-dimensional eigen vectors per each of the K components as opposed to d 2 real numbers in R. The compression ratio in this case becomes dK/d 2 = K/d. When K d, this results in a significant compression.
To empirically evaluate the trade-off between the number of singular vectors used in the compression and the accuracy of the learnt relation embeddings, we use the approximated relation embeddings for link prediction on WN18RR as shown in Figure 2 (similar trend was observed for FB15K237). We use d = 100 dimensional relation embeddings learnt by RelWalk and approximate using top-K eigenvectors. From Figure 2 we see for K > 60 components the performance saturates in both datasets. On the other hand, we need at least K = 30 components to get any meaningful accuracy for link prediction on these two datasets. With K = 60 and d = 100 this approximation results in an 60% compression ratio.

Conclusion
We proposed RelWalk, a generative model of KGE and derived a theoretical relationship between the probability of a triple consisting of head, tail entities and the relation that exists between those two entities, and the embedding vectors for the two entities and embeddings matrices for the relation. In RelWalk, we represented entities by vectors and relations by matrices. We then proposed a learning objective based on the theoretical relationship we derived to learn entity and relation embeddings from a given knowledge graph. Experimental results on a link prediction and a triple classification tasks show that RelWalk outperforms several previously proposed KGE learning methods. The key assumptions of RelWalk are validated by empirically analysing the relationship between such assumptions and the performance of the learnt embeddings from a KG. Moreover, we studied the compressibility of the learnt relation embeddings and discovered that using only 60% of the components, we can approximate the relation embeddings without any significant loss in performance.

A Proof of the Concentration Lemma
To prove the concentration lemma, we show that the mean E h [Z c ] of Z c is concentrated around a constant for all knowledge vectors c and its variance is bounded. If P is an orthogonal matrix and x is a vector, then P x 2 2 = (P x) (P x) = x PP x = ||x|| 2 2 , because P P = I. Therefore, from (6) and the orthogonality of the relational embeddings, we see that R 1 c is a simple rotation of c and does not alter the length of c. We represent h = s hĥ , where s h = ||h|| andĥ is a unit vector (i.e. ĥ 2 = 1) distributed on the spherical Gaussian with zero mean and unit covariance matrix I d ∈ R d×d . Let s be a random variable that has the same distribution as s h . Moreover, let us assume that s is upper bounded by a constant κ such that s ≤ κ. From the assumption of the knowledge vector c, it is on the unit sphere as well, which is then rotated by R 1 .
We can write the partition function using the inner-product between two vectors h and R 1 c, Z c = h∈V exp h (R 1 c) . Arora et al. (Arora et al., 2016a) showed that (Lemma 2.1 in their paper) the expectation of a partition function of this form can be approximated as follows: where n = |V| is the number of entities in the vocabulary. (21) follows from the expectation of a sum and the independence of h and R 1 from c. The inequality of (22) is obtained by applying the Taylor expansion of the exponential series and the final equality is due to the symmetry of the spherical Gaussian. From the law of total expectation, we can write where, x = h R 1 c. Note that conditioned on s h , h is a Gaussian random variable with variance σ 2 = s 2 h . Therefore, conditioned on s h , x is a random variable with variance σ 2 = σ 2 h . Using this distribution, we can evaluate E x|s h exp h R 1 c as follows: Therefore, it follows that where s is the variance of the 2 norms of the entity embeddings. Because the set of entities is given and fixed, both n and σ are constants, proving that E h [Z c ] does not depend on c. Next, we calculate the variance V h [Z c ] as follows: Because 2h R 1 c is a Gaussian random variable with variance 4σ 2 = 4s 2 h from a similar calculation as in (24) we obtain, By substituting (27) in (26) we have that for Λ = exp(8κ 2 ) a constant bounding s ≤ κ as stated. From above, we have bounded both the mean and variance of the partition function by constants that are independent of the knowledge vector. Note that neither exp h R 1 c nor exp t R 2 c are sub-Gaussian nor sub-exponential. Therefore, standard concentration bounds derived for sub-Gaussian or sub-exponential random variables cannot be used in our analysis. However, the argument given in Appendix A.1 in (Arora et al., 2016b) for a partition function with bounded mean and variance can be directly applied to Z c in our case, which completes the proof of the concentration lemma. From the symmetry between h and t, the concentration Lemma is also applies for the partition function Z c = t∈V t R 2 c .

B Proof of RelWalk Theorem
Let us consider the probabilistic event that (1 − Then from the union bound we have, whereF is the complement of event F . Moreover, let F be the probabilistic event that both F c and F c being True. Then from Pr[F ] = 1 − Pr[F c ∪F c ] we have, Pr[F ] ≥ 1 − 2 exp −Ω log 2 n . The R.H.S. of (5) can be split into two parts T 1 and T 2 according to whether F happens or not.
Here, 1 F and 1F are indicator functions of the events F andF given as follows: Let us first show that T 2 is negligibly small. For two real integrable functions ψ 1 (x) and ψ 2 (x) in [a, b], the Cauchy-Schwarz's inequality states that Applying (33) to T 2 in (30) we have: Note that Z c ≥ 1 because Z c is the sum of positive numbers and if h R 1 c > 0 for at least one of the h ∈ V, then the total sum will be greater than 1. Therefore, by dropping Z c term from the denominator we can further increase the first term in (34) as given by (35).
Let us split the expectation on the R.H.S. of (35) into two cases depending on whether h R 1 c > 0 or otherwise, indicated respectively by 1 (h R 1 c>0) and 1 (h R 1 c≤0) .
The second term of (36) is upper bounded by The first term of (36) can be bounded as follows: where α > 1. Therefore, it is sufficient to bound E c exp(αh R 1 c) 2 E c |c [1F ] when ||h|| = Ω( √ d). Let us denote by z the random variable 2h R 1 c. Moreover, let r(z) = E c |z [1F ], which is a function of z between [0, 1]. We wish to upper bound E c [exp(z)r(z)]. The worst-case r(z) can be quantified using a continuous version of Abel's inequality (proved as Lemma A.4 in (Arora et al., 2015)), we can upper bound E c [exp(z)r(z)] as follows: Here, 1 [t,+∞] (z) is a function that takes the value 1 when z ≥ t and zero elsewhere. Then, we claim Pr c [z ≥ t] ≤ exp(−Ω(log 2 n)) implies that t ≥ Ω(log .9 n).
If c was distributed as N (0, 1 d I), this would be a simple tail bound. However, as c is distributed uniformly on the sphere, this requires special care, and the claim follows by applying the tail bound for the spherical distribution given by Lemma A.1 in (Arora et al., 2015) instead. Finally, applying Corollary A.3 in (Arora et al., 2015), we have: where |D| is the number of relational tuples (h, r, t) in the KB and δ 0 = |D| exp(−Ω(log 1.8 n)) ≤ exp(−Ω(log 1.8 n)) by the fact that Z ≤ exp(2κ)n = O(n), where κ is the upper bound on h R 1 c and t R 2 c , which is regarded as a constant.
On the other hand, we can lower bound p(h, t | r) as given by (43).
Taking the logarithm of both sides, from (42) and (43), the multiplicative error translates to an additive error given by (44).
where A(c) := E c |c exp t R 2 c . We assumed that c and c are on the unit sphere and R 1 and R 2 to be orthogonal matrices. Therefore, R 1 c and R 2 c are also on the unit sphere. Moreover, if we let the upper bound of the 2 norm of the entity embeddings to be κ √ d, then we have ||h|| ≤ κ √ d and ||t|| ≤ κ √ d. Therefore, we have Then, we can upper bound A(c) as follows: For some 2 > 0. The last inequality holds because To obtain a lower bound on A(c) from the firstorder Taylor approximation of exp(x) ≥ 1 + x we observe that: Therefore, from our model assumptions we have Hence, Therefore, from (47) and (50) we have Plugging A(c) back in (44) results in log p(h, t | r) equal to: log Ec exp h R1c A(c) ± δ0 − 2 log Z + 2 log(1 ± z ) = log Ec exp h R1c (1 ± 2) exp t R2c ± δ0 − 2 log Z + 2 log(1 ± z ) = log Ec exp h R1c exp t R2c ± δ0 − 2 log Z + 2 log(1 ± z ) + log(1 ± 2) = log Ec exp h R1c + t R2c ± δ0 − 2 log Z + 2 log(1 ± z ) + log(1 ± 2) = log Ec exp R1 h + R2 t c ± δ0 − 2 log Z + 2 log(1 ± z ) + log(1 ± 2) Note that c has a uniform distribution over the unit sphere. In this case, from Lemma A.5 in (Arora et al., 2015), (53) holds approximately.

C Learning with Multiple Negative Triples
This approach can be easily extended to learn from multiple negative triples as follows. Let us consider that we are given a positive triple, (h, R, t) and a set of K negative triples {(h k , R, t k )} K k=1 . We would like our model to assign a probability, p(h, t | R), to the positive triple that is higher than that assigned to any of the negative triples. This requirement can be written as (56).
p(h, t|R) ≥ max k=1,...,K p(h k , t k | R) We could further require the ratio between the probability of the positive triple and maximum probability over all negative triples to be greater than a threshold η ≥ 1 to make the requirement of (56) to be tighter.
By taking the logarithm of (57) we obtain log p (h, t | R) − log max k=1,...,K p h k , t k | R ≥ log(η) Therefore, we can define the marginal loss for a misclassification as follows: L (h, R, t) , { h k , R, t k } K k=1 = max 0, log max k=1,...,K p(h k , t k | R) + + log (η) − log p (h, t | R) However, from the monotonicity of the logarithm we have ∀x 1 , x 2 > 0, if log(x 1 ) ≥ log(x 2 ) then  x 1 ≥ x 2 . Therefore, the logarithm of the maximum can be replaced by the maximum of the logarithms in (59) as shown in (60).
By substituting (9) for the probabilities in (60) we obtain the rank-based loss given by (61).
In practice, we can use p(h k , t k | R) to select the negative triple with the highest probability for training with the positive triple.

D Training Details
The statistics of the benchmark datasets are shown in Table 4. We selected the initial learning rate (α) for SGD in {0.01, 0.001}, the regularisation coefficients (λ 1 , λ 2 ) for the orthogonality constraints of relation matrices in {0, 1, 10, 100}. The number of randomly generated negative triples n neg for each positive example is varied in {1, 10, 20, 50, 100} and d ∈ {50, 100}. Optimal hyperparameter settings were: λ 1 = λ 2 = 10, n neg = 100 for all the datasets, α = 0.001 for FB15K237 and FB13, α = 0.01 for WN18RR. For FB15K237 and WN18RR d = 100 was the best, whereas for FB13 d = 50 performed best. Negative triples are generated by replacing a head or a tail entity in a positive triple by a randomly selected entity and learn KGEs. We train the model until convergence or at most 1000 epochs over the training data where each epoch is divided into 100 mini-batches. The best model is selected by early stopping based on the performance of the learnt embeddings on the validation set (evaluated after each 20 epochs).