Improving Knowledge Graph Embedding Using Afﬁne Transformations of Entities Corresponding to Each Relation

To ﬁnd a suitable embedding for a knowledge graph remains a big challenge nowadays. By using previous knowledge graph embedding methods, every entity in a knowledge graph is usually represented as a k-dimensional vector. As we know, an afﬁne transformation can be expressed in the form of a matrix multiplication followed by a translation vector. In this paper, we ﬁrstly utilize a set of afﬁne transformations related to each relation to op-erate on entity vectors, and then these transformed vectors are used for performing embedding with previous methods. The main ad-vantage of using afﬁne transformations is their good geometry properties with interpretability. Our experimental results demonstrate that the proposed intuitive design with afﬁne transformations provides a statistically signiﬁcant increase in performance with adding a few extra processing steps or adding a limited number of additional variables. Taking TransE as an example, we employ the scale transformation (the special case of an afﬁne transforma-tion), and only introduce k additional variables for each relation. Surprisingly, it even outperforms RotatE to some extent on various data sets. We also introduce afﬁne transformations into RotatE, Distmult and ComplEx, respectively, and each one outperforms its original method.

Although millions of entities and billions of facts exist in the large-scale knowledge graphs, they still suffer from the incompleteness problem. Therefore, knowledge graph completion also known as link prediction which aims to predict missing links among entities based on the known triples has attracted much attention gradually. Recently, extensive studies have been done concerning knowledge graph embedding (Bordes et al., 2013;Dettmers et al., 2018). These methods represent entities and relations as low-dimensional vectors (or matrices, tensors, etc.), which not only preserve the semantic information of the knowledge graph, but also represent entities and relations in a fixed structure which is easier for machines' further processing. Therefore, apart from the link prediction task, knowledge graph embedding can also be used in various downstream tasks, such as triple classification (Nguyen et al., 2020), search personalization (Lu et al., 2020) and so on.
The success of existing knowledge graph embedding models heavily relies on their ability to model different types of the relations, such as symmetry/antisymmetry and composition. For example, TransE (Bordes et al., 2013), which represent relations as translations, can model the composition paterns. DistMult , which forces all relation embeddings to be diagonal matrices in bilinear model, can model the symmetry pattern. However, most models ignore the difference between single-relational and multi-relational triples.
Multi-relational triples are ubiquitous phenomena in knowledge graphs. For instance, Word-Net (Miller, 1995) contains the entity {de-partment_of_justice} with relations {_hypernym, _synset_domain_topic_of, _has_part}. Freebase (Bollacker et al., 2008) contains the entity {Bryan Singer} with relations {/film/director/film, /people/person/profession, /people/person/nationality, /people/person/place_of_birth and so on}. Different relations lead entities to different identities or concerns. Figure 1 briefly shows that multiple relations may have effects on the optimization of knowledge graph embedding models. We try to do some spatial transformations to make the entities contain the corresponding relation information, and help to distinguish the scenes of different relations. Although there exists similar works that project entities with each relation (Lin et al., 2015;Nguyen et al., 2016), they often require complex projection matrices, which lead to a large amount of calculation and are difficult to apply to other models. In addition, other than TransE series models, we also apply this transformation method to Bilinear Models similar to RESCAL (Nickel et al., 2011), and improve their performance obviously.
In this paper, we firstly utilize a set of affine transformations related to each relation to operate on entity vectors, and then these transformed vectors are used for performing embedding with previous methods like TransE (Lin et al., 2015), Ro-tatE , DistMult  and ComplEx (Trouillon et al., 2016). All of these applications are correspondingly simplified based on different model structures. Our experimental results demonstrate that the proposed intuitive design with affine transformations provides a statistically significant increase in performance with adding a few extra processing steps or adding a limited number of additional variables. Taking TransE as an example, we employ the scale transformation (the special case of an affine transformation), and only introduce additional variables for each relation. Surprisingly, it even outperforms RotatE to some extent on various data sets. The application in other models also shows better results than their original models. Especially for DistMult and ComplEx, experiments on three benchmark data sets show that the proposed affine-transformation-based algorithms outperform several other state-of-the-art algorithms.
Notations. Throughout this paper, we use lowercase letters , ℎ, , and to represent entities, head entities, relations, and tail entities, respectively. The triplet (ℎ, , ) denotes a fact in knowledge graphs. The corresponding boldface lower-case letters h, r and t denote the embeddings (vectors) of head entities, relations, and tail entities. and are the dimensionality of entity and relation embedding space, respectively (usually = ).

Related Work
In this section, we briefly review the related work. Roughly speaking, the existing knowledge graph embedding models are mainly divided into three categories: translational models, bilinear models and deep learning models. Table 1 summarizes different score functions (h, t) from previous stateof-the-art methods.
Translational models. TransE is the first link prediction model to propose translation distance constraints, which supposes that entities and relations satisfy h + r ≈ t, where h, r, t ∈ R , and defines the score function as (h, t) = − h+r−t 1/2 . TransH (Wang et al., 2014) is proposed to compensate for the shortcomings of transE. They find that TransE cannot handle 1-N, N-1, N-N relations well. TransH projects entities onto relationspecific hyperplanes with h ⊥ = h − w hw and t ⊥ = t − w tw , and the score function is defined as (h, t) = − h ⊥ + r − t ⊥ 2 . Moreover, RotatE  defines each relation as a rotation from the source entity to the target entity in a complex vector space, which is able to represent various relation patterns including symmetry/asymmetry, inversion and composition. Then QuatE  represents entities and relations with Quaternion; HAKE Zhang et al. (2020b) considers the hierarchical of relations, and both of them achieved impressive results.
Bilinear models. RESCAL (Nickel et al., 2011) represents each relation as a full rank matrix and defines a bilinear function as score function (h, t) = h M t . Although the embedded relations have a large number of parameters, RESCAL can still get good results through some of the latest training methods (Ruffinelli et al., 2019). Subsequently, DistMult  forces all relation embeddings M to be diagonal matrices, which can reduce the space of parameters and result in an easier model to be trained. However, Distmult assumes that all relations are symmetric, and is not friendly to other types of relations, such as antisymmetry, composition. To solve this problem, ComplEx (Trouillon et al., 2016) extends DistMult to complex space: h, r, t ∈ C , and uses conjugatetransposet to model asymmetric relations.

Embedding with Affine Transformation
In this section, we briefly introduce affine transformation at first. Then we introduce our proposed method which utilizes affine transformation in TransE, RotatE, DistMult and ComplEx, respectively.

Affine Transformation
Consider a data set of dimensional points { }. We wish to learn a × linear transformation matrix A and a translation vector b which will help to find better embedding of the original data points. In general, an affine transformation is composed of linear transformations (dilation, reflection, rotation, scaling or shear) and a translation (or "shift"). In addition the affine transformation preserves collinearity and ratios of distances. In this regard, we perform affine transformation on the head entities and tail entities according to the corresponding rela-tions: where A , C ∈ R × and b , d ∈ R are the head entity and tail entity affine transformation parameters, respectively.

Improving TransE with AT
For TransE + AT (affine transformation), the expected distance relationship after affine transformation can be expressed as Substituting Equation (1) into Equation (2), we can obtain We further simplify Equation (3) as Since C −1 and A are also transformation matrices about r, and the effect of C −1 on the product of h can be absorbed by A , we denote A as C −1 A . Similarly, denote r as C −1 (b + r − d ). In fact, the symbolic representations of A , r and A , r are only used to distinguish the changes, and we still

Model Score Function (h, t) Parameters
TransE (Bordes et al., 2013) − h + r − t 1/2 h, r, t ∈ R TransH (Wang et al., 2014) (Lin et al., 2015) −  h diag(r)t h, r, t ∈ R ComplEx (Trouillon et al., 2016) Re h diag(r)t h, r, t ∈ C ConvE (Dettmers et al., 2018) ( In experiments, using full matrices A h may cause parameter redundancy and overfitting. Therefore, we refer to DistMult  to take the diagonal parameters of the full matrix and mark it as diag(a ). And a simplified equation is obtained Then the corresponding score function of TransE + AT can be expressed as where h, r, t, a ∈ R . The simplified model of TransE + AT accidentally obtains a score function similar to MuRE (Balazevic et al., 2019), The scoring function of MuRE is where d is a distance function, R is a diagonal relation matrix, b s and b o are constants. Internally, MuRE (Rh − r − t) (Balazevic et al., 2019), TransE+AT (diag(a )h + r − t) are very similar, but MuRE calculates the square of the distance and there are two deviation terms, so the two are not totally the same.
The scoring function of the unsimplified version of TransE + AT can be expressed as Compared with other models based on relational transformation to improve TransE, such as TransH (Wang et al., 2014) and TransR (Lin et al., 2015) (refer to Table 1 for the scoring functions). TransH projects the entity onto the hyperplane where the relation r ∈ R is located, and TransR transform the entity based on the relation-specified matrix M ∈ R × . The entity of TransR has a larger transformation range than that of TransH, so it can be understood that TransH is a special case of TransR. When A = C and b = d = 0, the original TransE + AT is equivalent to TransR. That is, TransR is a special case of TransE + AT, and we simplify TransE + AT on this basis.

Improving RotatE with AT
For RotatE + AT, the expected rotation relationship after affine transformation can be expressed as Substituting Equation (1) into Equation (10), we can obtain We further simplify Equation (11) as We simplify C −1 ((A h) • r) as diag(a )h • r to represent scale transformation, and denote b as Again, we use a , b to represent a , b in the following equation. And we can obtain Then the corresponding score function of RotatE + AT can be expressed as where h, r, t, b ∈ C , a ∈ R .

Improving DistMult and ComplEx with AT
Since the loss functions of RESCAL, DistMult and ComplEx have similar structures, we use RESCAL + AT to show the application process. For RESCAL + AT, the expected score function after affine transformation can be expressed as Substituting Equation (1) into Equation (15), we can obtain We further simplify Equation (16) as (17) Here, we denote b as A −1 b , d as C −1 d and M as A M C . Also, we use b , d and M to represent b , d and M in the following equation. Correspondingly, the score function of RESCAL + AT can be expressed as where h, t, b , d ∈ R , M ∈ R × . Similarly, for DistMult, the corresponding score function of DistMult + AT is where h, r, t, b , d ∈ R .
For ComplEx, the corresponding score function of ComplEx + AT is

Experiments
This section is organized as follows: Firstly, we introduce the experimental settings in detail. Secondly, we show the effectiveness of our proposed model on three benchmark datasets. Finally, we analyze the embeddings generated by TransE + AT, RotatE + AT, Dismult + AT and ComplEx + AT, and show the results of ablation studies and visualize some parameters of models.

Experimental Settings
We evaluate our proposed models on three commonly used knowledge graphs, which are statistically summarized in Table 2 (Mahdisoltani et al., 2013) which has a minimum of 10 relations for each entity.
Most of the triples deal with descriptive attributes of people, such as citizenship, gender, and profession.
As pointed out by Toutanova and Chen (2015) and Dettmers et al. (2018), FB15k, WN18 and YAGO3 suffer from the test leakage. This issue is primarily due to the presence of relations that are nearly identical or the inverse of one another. One can achieve the state-of-the-art results even using a simple rule-based model. Therefore, we use WN18RR, FB15k-237 and YAGO3-10 as the benchmark datasets.  Evaluation Protocol. For each triple (ℎ, , ) in the test dataset, we replace either the head entity ℎ or the tail entity with the total list of the embedding entities. Then we base the score function to rank the candidate entities in descending order. The filtered setting is used to remove some correct results that appear in the training set or validation set but not in test set. We choose Mean Reciprocal Rank (MRR) and Hits at N (H@N) as the evaluation metrics. Higher MRR or H@N indicates better performance. Table 3 shows the number of parameters that different models need to learn on WN18RR, FB15k-237 and YAGO3-10 data sets. Compared with the original models: TransE, RotatE, DistMult and ComplEx, our proposed TransE + AT, RotatE + AT, DistMult + AT and ComplEx + AT models only adds a small number of parameters. Especially for the WN18RR and YAGO3-10 data sets, the number of added parameters is almost negligible, but the final experimental results are significantly improved. TransH has the same number of parameters as TransE + AT, but needs more computing resources for Hyperplanes translating in both head entities and tail entities, and TransR needs more number of parameters and calculations for the matrix multiplication with M . Compared with the recent state of art methods, i.e., QuatE and HAKE, the number of parameters of TransE + AT and DistMult + AT are smaller than both, while ComplEx + AT and RotatE + AT are close to HAKE but smaller than QuatE, and our method also exceeds their results in some quality indexes.
average MRR increase of 2.5% on the three data sets. Especially for the YAGO3-10 data set, RotatE + AT exceeds the retrained TransE and closes to HAKE. For DistMult + AT and ComplEx + AT, We creatively introduce the translation component into the bilinear model. Interestingly, this kind of application works and makes improvements then the original models. Compared with the retrained Dist-Mult, our results of DistMult + AT on the three data sets have an average MRR increase of 1.8%. Similarly, compare with the retrained ComplEx, our results of ComplEx + AT on the three data sets have an average MRR increase of 0.7%. In three data sets, DistMult + AT and ComplEx + AT exceed other affine transformation methods, and mostly outperform MuRE, QuatE, InteractE and HAKE, reaching the state of art results.

Ablation Studies
In this section, we conduct ablation studies on different models. Based on the structural differences of models, we split the affine transformation into different combinations, including only make affine transformation on head entities AT_h and only make affine transformation on tail entities AT_t; only keep the scale parameter of affine transformation AT_scale and only keep the translation parameters of the affine transformation AT_trans. For DistMult + AT and ComplEx + AT, we choose the first combination as it can easily split the affine transformation of the head and tail entities. And we chose the second combination for RotatE.
From Table 5, we can see that for most mod-els, better results are obtained by using a complete affine transformation. There are some results where H@10 is higher than the final models, such as Ro-tatE + AT_scale gains a 0.2% higher H@10 than RotatE + AT on the FB15k-237 data set, DistMult + AT_t gains a 0.1% higher H@10 than DistMult + AT on the YAGO-10 data set. We infer that the use of a complete affine transformation will have stronger constraints, which makes the accurate prediction H@1 higher, while the rough prediction H@10 decreases. On the contrary, under weak constraints, the accurate prediction H@1 will be lower, while the rough prediction H@10 will increase.

Visualize Embedded Parameters
In this part, we visualize the some instances of   Figure 2: Visualization of some instances of TransE + AT, RotatE + AT, DistMult + AT and ComplEx + AT on WN18RR, FB15k-237 and YAGO3-10 data sets /film/film/executive_produced_by, /fim/film/film _crew_role, /film/film/written_by}, they show a certain difference, and the last three related to people are more similar; In DistMult + AT and ComplEx + AT, we choose two similar relations to form different groups. The results show that different relationships have large differences in histograms, while similar relationships have smaller differences; similar phenomena also appears in the RotatE + AT.

Conclusion
We propose a novel knowledge graph embedding approach which firstly introduces a parametric mapping that projects entity vectors into a new space by an affine transformation corresponding to each relation, and then employs previous embedding methods that map the entities and relations into the embedding space. This algorithm enforces the embedding to be approximately uniformly distributed around the original entity vectors by adjusting the scaling and translation parameters of the affine transformation, which requires considerably less additional computational effort. Extensive experimental results show that the affine-transformationbased algorithms outperform the original TransE, RotatE, Distmult and ComplEx, respectively. Experiments on three benchmark data sets also show that the proposed affine-transformation-based algorithms outperform several other state-of-the-art algorithms in some quality indexes. We believe that knowledge graph embedding based on affine transformations is very promising and has the potential of being used for many applications. However, more comparison with other embedding methods are needed to fully understand its advantages and disadvantages.