Hyperbolic Geometry is Not Necessary: Lightweight Euclidean-Based Models for Low-Dimensional Knowledge Graph Embeddings

Recent knowledge graph embedding (KGE) models based on hyperbolic geometry have shown great potential in a low-dimensional embedding space. However, the necessity of hyperbolic space in KGE is still questionable, because the calculation based on hyperbolic geometry is much more complicated than Euclidean operations. In this paper, based on the state-of-the-art hyperbolic-based model RotH, we develop two lightweight Euclidean-based models, called RotL and Rot2L. The RotL model simplifies the hyperbolic operations while keeping the flexible normalization effect. Utilizing a novel two-layer stacked transformation and based on RotL, the Rot2L model obtains an improved representation capability, yet costs fewer parameters and calculations than RotH. The experiments on link prediction show that Rot2L achieves the state-of-the-art performance on two widely-used datasets in low-dimensional knowledge graph embeddings. Furthermore, RotL achieves similar performance as RotH but only requires half of the training time.


Introduction
To represent entities and relations of knowledge graphs (KGs) in the semantic vector space, researchers have proposed various knowledge graph embedding (KGE) models, which have shown great potential in knowledge graph completion and knowledge-driven applications (Wang et al., 2017;Broscheit et al., 2020). To achieve higher prediction accuracy, recent KGE models usually use highdimensional embedding vectors up to 200 or even 500 dimensions (Sun et al., 2019;Zhang et al., 2019). However, when facing large-scale KGs with millions of entities, high embedding dimensions would require prohibitive training costs and storage space (Sachan, 2020;Xie et al., 2020). It hinders * The Corresponding Author the practical application of KGE models, especially in mobile smart devices.
Recently, low-dimensional KGE models based on hyperbolic vector space have drawn some attention (Sun et al., 2020). The work of the first such model, MuRP, indicates that hyperbolic embeddings can capture hierarchical patterns in KGs and generate high-fidelity and parsimonious representations (Balazevic et al., 2019b). To capture logical patterns in KGs, Chami et al. propose a series of hyperbolic KGE models, including RotH, RefH, and AttH (Chami et al., 2020). Similar to the typical TransE model (Bordes et al., 2013) that treats the relation as a translation operation between the head and tail entity vectors, the state-of-the-art RotH model adjusts the head vector by the rotation and translation transformations to approach the tail vector in the hyperbolic space.
Although the above hyperbolic-based models outperform previous Euclidean-based models in low-dimensional condition, the necessity of hyperbolic space in this task is still questionable. Comparing a hyperbolic model with its Euclidean-based variant, it is uncertain which parts of the modifi-cation will be vital. Besides, despite theoretical support, the Möbius matrix-vector multiplication and Möbius addition operations in hyperbolic embeddings are far more complicated than Euclidean multiplication and addition. As shown in Fig. 1, RotH requires threefold more training time than its Euclidean-based variant RotE on two datasets. Especially on large-scale knowledge graphs, the additional calculating cost caused by the complicated hyperbolic operations would make the problem much severer.
Facing these problems, we analyze the effective components in the hyperbolic KGE models, and propose two lightweight "RotH-like" models, RotL and Rot2L, for low-dimensional knowledge graph embeddings. Without the hyperbolic geometry, RotL eliminates the Möbius matrix-vector multiplication and designs a new flexible addition operation to replace the Möbius addition. To further improve the RotL's representation capability, the Rot2L model utilizes two stacked rotationtranslation transformations in the Euclidean space. Benefiting from a specific parameterization strategy, Rot2L requires fewer parameters and calculations than RotH.
We conduct extensive experiments on two widely-used datasets. The results show that RotL outperforms existing Euclidean-based models in the 32-dimensional condition and only requires half of the training time of RotH. Rot2L obtains the state-of-the-art performance on the two datasets and outperforms RotH in both prediction accuracy and training speed. According to ablation experiments, we prove the effectiveness of the flexible addition and the other significant modules in Rot2L. We also verify our models in different embedding dimensions and analyze the performance difference between RotH and our models in a relation-specific experiment.
The rest of the paper is organized as follows. We discuss the background and definitions in Sec. 2. Sec. 3 introduces the technical details of RotL and Rot2L models. Sec. 4 reports the experimental studies and Sec. 5 further discusses several experimental investigations. The related work is reviewed in Sec. 6. Finally, we offer some concluding remarks in Sec. 7.

Background
In this section, we briefly describe the preliminaries related to this work.

Knowledge Graph Embeddings
In a knowledge graph G = (E, R, T ), E and R denote the set of entities and relations, and T is the collection of factual triples (h, r, t) where the head and tail entities h, t ∈ E and the relation r ∈ R. N e and N r refer to the number of entities and relations, respectively.
Knowledge graph embeddings aim to represent each entity e and each relation r as d-dimensional continuous vectors. A KGE model is evaluated by the link prediction task, which aims to find e t ∈ E given an e-r query q = (e, r), such that the triple (e, r, e t ) or (e t , r, e) should belong to the knowledge graph G. Generally, a scoring function F (h, r, t) is designed to measure each candidate triple. Take the distance-based scoring function F (h, r, t) = D(Q(h, r), t) as an example, it involves two operations: 1) Transformation function Q(h, r) transforms the head vector h using the relation vector r; 2) Distance function D(q, t) measures the distance between the tail vector t and the transformed head vector q = Q(h, r).

Hyperbolic Geometry
Recently, researchers start to work on effective lowdimensional models in the KGE domain (Sachan, 2020;Wang et al., 2021a,b). Multiple hyperbolic KGE models, such as MuRP, RotH, RefH and AttH, have achieved good performance in lowdimensional condition (Balazevic et al., 2019b;Chami et al., 2020). These models employ a hyperbolic geometry model, the d-dimensional Poincaré ball (Birman and Ungar, 2001) The hyperbolic space is one of the three kinds of isotropic spaces, and the relevant theoretical research has been carried out for decades (Birman and Ungar, 2001). To achieve the vector transformation in the hyperbolic space, the Möbius addition ⊕ c and Möbius matrix-vector multiplication ⊗ c are utilized. Möbius addition (Ungar, 2001) is proposed to approximate Euclidean addition in the hyperbolic space: x ⊕c y = 1 + 2c x, y + c y 2 x + 1 − c x 2 y 1 + 2c x, y + c 2 x 2 y 2 where · is the Euclidean inner product. It is clear that Möbius addition requires much more calculations than an ordinary addition. Möbius matrix-vector multiplication (Ganea et al., 2018) is also more complicated than Eu-clidean multiplication. Before computing matrix multiplication with M ∈ R d×k , the vector x ∈ B d c is projected onto the tangent space at 0 ∈ B d c with the logarithmic map log c 0 (x). Then the output of multiplication is projected back to B d c via the exponential map exp c 0 (x), i.e.,

The RotH Model
We briefly review RotH (Chami et al., 2020), the state-of-the-art model in the low-dimensional KGE. According to the official PyTorch implementation 1 , the scoring function of RotH employs a "translation-rotation-translation" transformation and utilizes a hyperbolic distance as the distance function D. Specifically, let e H ∈ B d c denote entity hyperbolic embeddings of entity e. For one relation r, two hyperbolic relation vectors r H , r H ∈ B d c are defined for two translation operations. Using a ddimensional vectorr, RotH parameterizes a Givens Rotation operation with a block-diagonal matrix of the form: where G(ri,rj) := ri, -rĵ rj,ri .
Then, for a triple (h, r, t), the scoring function F H of RotH is defined as: where c r > 0 is the relation-specific curvature parameter, and b e (e ∈ E) are entity biases which act as margins in the scoring function (Balazevic et al., 2019b;Chami et al., 2020). The other hyperbolic models can be regarded as RotH's variants using different relation transformations. In addition, RotE is a Euclidean-based RotH variant, and its scoring function is defined as: where h, r, t ∈ R d . Without complex hyperbolic calculations, F E can be computed in linear time of the embedding dimensions. 1 https://github.com/HazyResearch/KGEmb

The Methodology
The goal of this work is to design a high-efficiency low-dimensional KGE model by extracting the effective components in the RotH model and eliminating the redundancy.
We find that RotH performs noticeably well because of two reasons. The first reason is rotationtranslation transformation. As proved in previous research (Sun et al., 2019;Chami et al., 2020), this specific transforming operation can infer different relation patterns in the KG. The second reason is flexible normalization. All entity vectors in the hyperbolic space satisfy e 2 < 1 c before and after transformation, while the curvature c is relation-specific and self-adaptive. As the representation capability of a low-dimensional vector space is limited, the effect of flexible normalization would be more obvious. It explains why RotH can outperform its Euclidean-based variant RotE in low-dimensional KGE tasks.
In this section, we first propose a lightweight model, called RotL, which remains the flexible normalization of RotH and simplifies the complex hyperbolic operations. The details of RotL will be described in Sec. 3.1. We further design the Rot2L model using two stacked rotation-translation transformations. Rot2L employs a novel parameterization strategy that can save half of parameters in the two-layer architecture, which is detailed in Sec. 3.2. The architectures of the four models mentioned above are illustrated in Fig. 2.

The RotL Model and Flexible Addition
The RotL model aims to achieve similar performance to RotH and minimize its computational complexity close to that of RotE. Comparing the scoring functions of RotH and RotE in Eq. 9 and 10, it is clear that the additional calculations of RotH are centered on Möbius addition and Möbius matrix-vector multiplication.
Therefore, we first eliminate the hyperbolic embeddings in RotL and initialize the entity vector e and the two relation vectors for rotation and translation in the Euclidean space, such that the relation transformation can be calculated using Euclidean addition and multiplication directly.
To achieve the flexible normalization, we propose Flexible Addition ⊕ α , a simplified form of Möbius Addition, i.e., Figure 2: The architectures of four models, including the previous RotE and RotH, and the proposed RotL and Rot2L in this paper. The rectangle box denotes a Euclidean-based operation, while the rounded rectangle box denotes a hyperbolic-based one. The inside rectangles denotes the embedding vectors or matrices, in which the relation-specific ones are in orange. c , c , α refer to Möbius multiplication, Möbius addition and Flexible addition, respectively.
where α is a relation-specific scaling parameter and with a default value of 1. The Flexible Addition provides a self-adaptive normalization to (x + y), and has lower computational complexity than Möbius addition. Counting the operation times of d-dimensional vector operations, the former requires three additions and two multiplications, while the latter needs nine and 12 operations, respectively. We further discuss the connection between the two operations through Theorem 1.
Theorem 1. Given that c = α = 1, the Möbius Addition ⊕ c and Flexible Addition ⊕ α satisfy that Proof. With c = α = 1 and two vectors x, x ∈ R d , satisfying x = x , We emphasize that Theorem 1 indicates the equivalence of the two operations in a special condition. In our models, the proposed Flexible Addition is not equal to the Möbius Addition. It imitates the flexible normalization of the latter and eliminates the Hyperbolic space assumption. We then define the transformation function of RotL as Q α L (h, r) = Rot(r)h ⊕ α r , which can be regarded as a RotE transformation using the flexi-ble addition. To fit this novel operation, we further modify the distance function of RotH in Eq. 8 by designing a simpler non-linear mapping operation. The distance function and scoring function of RotL are defined as follows: where α r and α r are two different scaling parameters, and ϕ(x) = xe x is empirically discovered to replace the arctanh function in RotH with less complexity. Comparing Eq. 9 and 14, it is clear that the hyperbolic calculations are completely eliminated in the RotL model. Thus, RotL can reduce the computation complexity of RotH and save half of the training time as shown in Fig. 1.

The Rot2L Model and Stacked Transformation
Although the lightweight RotL maintains the flexible normalization effect, its performance is limited by the original transformation function of RotH.
In this section, we describe a novel Rot2L model utilizing two stacked translation-rotation transformations.
According to the theory of affine transformation (Berger, 1987), the two transformations can be replaced by a single one. Therefore, inspired by neural networks, we design a two-layer architecture with an activate function in the middle, as shown in Fig. 2. The transformation function Q 2L (h, r) in Rot2L is defined as: where γ is a hyper-parameter that balances the two parts. Q α1 L and Q α2 L represent two transformation layers, which are the same as the transformation function in RotL.
In the Rot2L model, the two layers need different parameters. This would double the amount of relation parameters because each layer requires two N r × d embedding matrices to represent the translation vectors and rotation matrices for all relations. To reduce relation parameters, Rot2L employs a novel parameterization strategy, which shares partial parameters among different relations.
Specifically, we utilize one embedding matrix M ∈ R Nr×d and a d-dimensional learnable vector f ∈ R d for each rotation-translation transformation layer. Such that half of the parameters are shared in different relations by replacing another embedding matrix to the vector f . Given the vector r = M [r] for the relation r, the corresponding translation vector and rotation matrix are constructed as: Finally, the scoring function of Rot2L contains the transformation function Q 2L (h, r) and the same distance function as RotL, which is defined as: Note that, it might be feasible to employ more transformation layers in Rot2L like deep neural networks. There are two reasons that we do not utilize more than two layers. First, more layers require more parameters, which goes against our original intention of being lightweight. Second, we find the vector values are gradually magnified when getting through multiple layers. Using three layers in the Rot2L model already suffers performance decrease. Exploring a deeper model with more effective regularization will be our future work.

Experimental Setup
Datasets. Our experimental studies are conducted on two widely-used datasets. WN18RR (Bordes et al., 2014) is a subset of the English lexical database WordNet (Miller, 1992), while FB15k237 (Toutanova and Chen, 2015) is extracted from Freebase including knowledge facts on movies, actors, awards, and sports. Inverse relations are removed from the two datasets, as many test triples can be obtained simply by inverting triples in the training set. The statistics of the datasets are given in Table  1 and "Train", "Valid", "Test" refer to the amount of triples in training, validation, and test sets. Implementation Details. Following the previous work, we utilize a binary cross-entropy loss, which is defined as: where σ(·) refers to the Sigmoid function, and (h i , r, t i ) refers to the negative samples after deleting training triples. All experiments are performed on NVIDIA GeForce GTX1080Ti GPUs, and implemented in Python using the PyTorch framework. Evaluation Metrics. For the link prediction experiments, we adopt three evaluation metrics: 1) MRR, the average inverse rank of the test triples, 2) Hits@10, the proportion of correct entities ranked in top 10, and 3) Hits@1, the proportion of correct entities ranked first. Higher MRR, Hits@10, and Hits@1 mean better performance. Following the previous works, we process the output sequence in the filter mode.

Link Prediction Task
We evaluate our models in the link prediction task on the two datasets. The experimental results are shown in Table 2. We select ten compared models in two types, in which the first six models are Euclidean-based, and the others utilize hyperbolic embeddings. Following the setting of Chami et al. (Chami et al., 2020), all these models are in 32-dimensional vector space. From the results, we have the following observations. At first, the four hyperbolic-based models generally outperform their Euclidean variants and RotatE and TuckER, which are the state-of-the-art models for high-dimensional knowledge graph embeddings. This proves the effectiveness of hyperbolic models in low-dimensional knowledge graph embeddings.
RotL outperforms RotE and the other Euclideanbased models. Compared with RotE, the Hits@10 of RotL improves from 0.529 to 0.550 on WN18RR, and from 0.482 to 0.500 on FB15k237. Using a lightweight architecture, RotL even achieves similar prediction accuracy as RotH on FB15k237. It indicates that the hyperbolic embeddings technology is possible to be replaced with flexible addition and new distance function. Using a novel two-layer transformation function, Rot2L further improves RotL and achieves the state-ofthe-art results on two datasets. Especially compared with RotH, MRR of Rot2L improves from 0.314 to 0.326 on FB15k237, and the Hits@1 increases from 0.428 to 0.434 on WN18RR.
It should be noted that improving lowdimensional performance is much harder than that in the high-dimensional condition. Our experimental results prove the effectiveness of Rot2L, while the computational complexity of Rot2L is lower than RotH and AttH.

Ablation Studies
We further conduct a series of ablation experiments to evaluate the different modules of our models. Two main improvements should be evaluated: 1) the new distance function (Dis) in Eq. 13 and 2) the middle activate function (Mid) in Eq. 15. Accordingly, we test the variants by eliminating one of the two functions (e.g., Rot2Lw/oDis by removing the distance function). The other parts, such as flexible addition and stacked transformations, can be verified by comparing RotE (a Euclidean-based variant of RotH), RotL and Rot2L. The experimental results are shown in Table 4.
From the results, we can see that Hits@10 of Rot2L are higher than Rot2Lw/oDis on the both datasets, which proves the effectiveness of the distance function. Similar result is also shown in RotL, but the improvement is relatively small. Hits@10 of Rot2Lw/oMid are lower than that of Rot2L on FB15k237, while having no obvious difference on WN18RR. This indicates that the activate function is more effective on FB15k237, which contains much more relations than WN18RR. For FB15k237, RotL outperforms Rot2Lw/oMid, indicating that facing complex relationships, the pure two-layer transformation is no better than a single layer. This further validates the contribution of the activate function. Comparing RotLw/oDis and RotE, the impact of flexible addition is obvious. Using a simple scaling operation, the flexible addition provides a 1% and 2% improvements of Hits@10 on the two datasets, respectively.  Overall, the experimental results indicate the effectiveness of the major modules of our proposed models in this paper. Based on the same Euclidean space, our RotL and Rot2L models have a significant performance improvement compared to the RotE model.

Efficiency Analysis
We analyze and compare the computational complexity among RotE, RotH, RotL, and Rot2L in this section. In terms of time complexity, as shown in Fig. 1, RotL is much faster than RotH mainly because that the Flexible Addition requires only a quarter of the computational cost of the Möbius addition. Although Rot2L repeats the rotationtranslation transformation twice, its computational cost is still lower than that of RotH.
In terms of space complexity, a slight difference is shown in the number of relation parameters. Because the parameters related to entities, including entity embedding vectors and entity biases, are the same in the four models and occupy the vast majority of total parameters, RotH requires the most relation parameters, (3N r + 1)d, including three relation transformation vectors and the learnable curvature for different relations. By contrast, RotE and RotL cost smaller, which are 2N r d and 2(N r +1)d, and the extra part of RotL comes from α in Flexible Addition. Although using an effective parameterization strategy, Rot2L still requires two shared vectors and another α-related vector. Its relation parameter amount is (2N r + 5)d. As the relation number N r is always greater than four, the Rot2L model requires fewer parameters than RotH.
In summary, the RotL and Rot2L models are highly efficient and better than the RotH model in both time complexity and space complexity.

Discussion
In this section, we further discuss several important questions on the RotL and Rot2L models.
Q1: Which parts of predictions are improved in our models comparing with RotH?
We measure the link prediction performance of relation-specific triples on WN18RR, shown in Table 3, to analyze the improvements of the Rot2L model. The results are generated in the 32-dimensional condition. Comparing in different relations, RotE and RotH have their own strengths. RotH has better Hits@10 in most relations but is weaker than RotE in the "member of domain usage" and "member of domain region" relations. RotL performs well like RotH, but fails to predict the "similar to" relation. As the optimal model, Rot2L obtains the best Hits@10 in 8 out of 11 relations and only has a small decrease in the other three relations. It should be noted that RotL and Rot2L effectively improve on the two "member of" relations, comparing to RotH. Especially, Rot2L achieves 42.86% and 64.27% improvements than RotH on these relations. Achieving the flexible normalization of RotH in the Euclidean space, our models perform well in both RotH-dominant and RotE-dominant relations.
Q2: Can our models encode hierarchical patterns like hyperbolic-based models?
As RotH has been proved on the benefits of hyperbolic geometric on hierarchical relations, we further analyze whether our models can still preserve this property. Following the work of Chami et al. (Chami et al., 2020), we utilize the Krackhardt hierarchy score (Khs G ) and estimated curvature (ξ G ) as metrics. The related results can be found in Table 4, in which a relation with higher Khs G and lower ξ G is more hierarchical.
In terms of non-hierarchical relations, such as "verb group", Euclidean-based and hyperbolicbased RotH have similar performances. In terms of hierarchical relations satisfying Khs G = 1, we observe that hyperbolic embeddings work better on relations having low ξ G , such as "hypernym", "has part", and "member meronym". Meanwhile, RoE and RotL outperform RotH in relations having relative higher ξ G , such as "instance hypernym", "member of domain usage", and "member of domain region". Compared with the other three models, Rot2L obtains the best Hits@10 in most relations and works effectively on hierarchical relations with different ξ G .
The results indicate that the simplified models, RotL and Rot2L, have a good ability to encode hierarchical relations. They preserve the good properties of both hyperbolic geometric and the Euclidean-based RotE.
Q3: How about the model performance in other embedding dimensions?
We further compare the four models in different dimensions from 8 to 128. The experimental results are shown in Fig. 3(a). The prediction accuracy of the four models improves with the growth of the embedding dimensions. When the dimensions are lower than 32, RotE is obviously weaker than the others, but it performs well in the high-dimensional condition. Except for RotE, the other three models obtain similar results under high dimensions, but there are still some differences. Specifically, RotL performs better in lower dimensions and achieves Q4: Can our models accelerate the training speed? Fig. 3 (b) shows the convergence of the training process for the four models under 32 dimensions. We can observe that RotE increases slowly in the first 40 epochs and converges later than the others. RotH converges faster than RotE, which is previously regarded as the contribution of hyperbolic space. From our experimental results, it is clear that both RotL and Rot2L show similar per-formance. RotL, which has little difference from RotE in structure, shows a much faster training speed. Although RotE takes less time over one epoch, RotL can achieve higher performance with less training epochs. Comparing RotH and Rot2L, we can find that Rot2L precedes in almost every epoch. In the 25th epoch, Rot2L already achieves similar performance as the final performance of RotH. The results indicate that our models can replace the hyperbolic RotH model with comparable prediction accuracy and training speed.
6 Related Work

Knowledge Graph Embeddings
Various KGE models have been proposed using different scoring functions, such as translation-based TransE (Bordes et al., 2013), factorization-based ComplEx (Trouillon et al., 2016) and CNN-based ConvE (Dettmers et al., 2018). With the rise of deep learning, several DL-based methods have been proposed, such as ConvKB (Nguyen et al., 2018) and CompGCN (Vashishth et al., 2020). Balazevic et al. (Balazevic et al., 2019a) propose a linear model based on Tucker decomposition of the binary tensor representation of knowledge graph triples. RotatE (Sun et al., 2019), inspired by Euler's identity, represents a relation as the rotation operation between the head and tail entities. Di-hEdral (Xu and Li, 2019) introduces rotation and reflection operations in dihedral symmetry group to construct the relation embeddings. Similar to the previous approaches, these models utilize highdimensional embedding vectors while designing a new score function to better distinguish the triples.

Hyperbolic Embeddings
Hyperbolic geometry has recently drawn wide attention because of its potential to learn parsimonious representations of symbolic data by simultaneously capturing hierarchy and similarity (Nickel and Kiela, 2017;Sala et al., 2018;Le et al., 2019).
Recently, some researchers start to apply hyperbolic embedding in the KGE domain. Balažević et al. (Balazevic et al., 2019b) propose the MuRP model to embed KG triples in the Poincaré ball model of hyperbolic space using the Möbius matrix-vector multiplication and Möbius addition operations. Similarly, Kolyvakis et al. (Kolyvakis et al., 2020) extend the translational models by learning embeddings of KG entities and relations in the hyperbolic Poincaré-ball model. Sun et al. (Sun et al., 2020) propose a hyperbolic relational graph neural network to capture knowledge associations for the KG entity alignment task. Chami et al. (Chami et al., 2020) employ rotation and reflection operations to replace the multiplication operation between the head entity and relation vectors, and propose a series of hyperbolic KGE models with trainable curvature, including RotH, RefH, and AttH.
Comparing with the existing hyperbolic KGE models, our model simplifies the hyperbolic calculations to improve computational efficiency while achieving competitive performance.

Conclusion
The recently proposed hyperbolic-based models achieve great prediction accuracy in lowdimensional knowledge graph embeddings, but require complicated calculations for hyperbolic embeddings. In this paper, we analyze the effective components in those models and propose a lightweight variant based on Euclidean calculations. After simplifying the Möbius operations in RotH, our proposed RotL model achieves a competitive performance, which saves half of the training time. Using a two-layer stacked transformation, we further propose Rot2L that outperforms the state-ofthe-art RotH model in both prediction accuracy and training speed.
These positive results encourage us to explore further research activities in the future. We will theoretically analyze the effectiveness of flexible normalization in the low-dimensional KGE tasks. For the stacked transformations in Rot2L, we will explore multiple-layer architectures and evaluate more different transformation forms. Finally, we plan to apply our models on real-world knowledge graphs in different domains such as mobile healthcare, smart cities, and mobile e-Commerce.