Hyperbolic Temporal Knowledge Graph Embeddings with Relational and Time Curvatures

Knowledge Graph (KG) completion has been excessively studied with a massive number of models proposed for the Link Prediction (LP) task. The main limitation of such models is their insensitivity to time. Indeed, the temporal aspect of stored facts is often ignored. To this end, more and more works consider time as a parameter to complete KGs. In this paper, we first demonstrate that, by simply increasing the number of negative samples, the recent AttH model can achieve competitive or even better performance than the state-of-the-art on Temporal KGs (TKGs), albeit its nontemporality. We further propose Hercules, a time-aware extension of AttH model, which defines the curvature of a Riemannian manifold as the product of both relation and time. Our experiments show that both Hercules and AttH achieve competitive or new state-of-the-art performances on ICEWS04 and ICEWS05-15 datasets. Therefore, one should raise awareness when learning TKGs representations to identify whether time truly boosts performances.


Introduction
The prevalent manner to store factual information is the s, p, o triple data structure where s, p and o stand for the subject, predicate and object respectively. An entity denotes whether a subject or an object while a relation denote a predicate that links two entities. A collection of triples defines a Knowledge Graph (KG) noted G(E, R) with E the set of entities, i.e. subjects and objects, corresponding to the nodes in the graph and R the set of predicates corresponding to directed edges. Shedding light on the type of connections between entities, KGs are powerful to work with for numer-ous downstream tasks such as question-answering (Bordes et al., 2014;Hao et al., 2017;Saxena et al., 2020), recommendation system (Yu et al., 2014;Zhang et al., 2016;Zhou et al., 2017), information retrieval (Lao and Cohen, 2010;Rocktäschel et al., 2015;Xiong et al., 2017), or reasoning (Xian et al., 2019;. However, KGs are sometimes incomplete and part of the knowledge is missing. A major concern was therefore raised to predict missing connections between entities, stimulating research on the Link Prediction (LP) task. Intuition is to map each entity and relation into a vector space to learn low-dimensional embeddings such that valid triples maximize a defined scoring function and that fallacious triples minimize it. An approach is efficient if it can model multiple relational patterns. Some predicates are symmetric (e.g. marriedTo), asymmetric (e.g. fatherOf ), an inversion of another relation (e.g. fatherOf and childOf ) or a composition (e.g. grandfatherOf ). Distinct strategies were introduced by explicitly model those patterns (Bordes et al., 2013;Trouillon et al., 2016;Sun et al., 2019). However, hierarchical relations have remained challenging to model in Euclidean space. As demonstrated in Sarkar (2011), tree structures are better embedded in hyperbolic spaces. Thus, hyperbolic geometry reveals to be a strong asset to capture hierarchical patterns. Nevertheless, the aforementioned approaches represent embeddings as invariant to time. For example, while writing this article, the triple Donald Trump, presidentOf, U.S. is correct but will be erroneous at reading time due to the meantime United States presidential inauguration of Joe Biden. To address this issue, recent works considered using quadruplet written as s, p, o, t by adding a time parameter t. Then, we note a Tem-poral Knowledge Graph (TKG) as G(E, R, T ) with T the set of timestamps. Stating a precise time can be advantageous for diverse applications (disambiguation, reasoning, natural language generation, etc). Recent works toward TKG representations are essentially extensions of existing timeless KG embeddings that incorporate the time parameter in the computation of their scoring function.
Similar to our work, Han et al. (2020) developed DYERNIE, an hyperbolic-based model inspired from MURP (Balažević et al., 2019a). DYERNIE uses a product of manifolds and adds a (learned) Euclidean time-varying representation for each entity such that each entity further possesses an entityspecific velocity vector along with a static (i.e. timeunaware) embedding.
In this paper, we first demonstrate that an optimized number of negative samples enables the ATTH model (Chami et al., 2020) to reach competitive or new state-of-the-art performance on temporal link prediction while being unaware of the temporal aspect. We further introduce HERCULES 1 , an extension of ATTH. HERCULES differs from DYERNIE in that: • Following Chami et al. (2020), we utilize Givens transformations and hyperbolic attention to model different relation patterns.
• A single manifold is used.
• Curvature of the manifold is defined as the product of both relation and time parameters.
To the best of our knowledge, this is the first attempt to leverage the curvature of a manifold to coerce time-aware representation. We also provide an ablation study of distinct curvature definitions to investigate the surprising yet compelling results of ATTH over time-aware models.

Related Work
In this section, we present existing methods on both KG and TKG vector representation.

Timeless Graph Embeddings
Previous works on KG completion essentially focused on undated facts with the s, p, o formalism. Bordes et al. (2013) initially proposed the TRANSE model considering the relation p as a translation between entities in the embedding space. Several variants were then designed. TRANSH (Wang et al., 2014) adds an intermediate projection onto a relation-specific hyper-plane while TRANSR (Lin et al., 2015) maps entities to a relation-specific space of lower rank. However, translation-based approaches cannot model symmetric relations. DIST-MULT  solves this issue by learning a bilinear objective that attributes same scores to s, p, o and o, p, s triples. COM- PLEX Trouillon et al. (2016) subsequently came up with complex embeddings. Tensor factorization techniques were also proposed. RESCAL (Nickel and Tresp, 2013) applies a three-way tensor factorization. TUCKER (Balažević et al., 2019b) uses Tucker decomposition and demonstrates that TUCKER is a generalization of previous linear models. More recently, ROTATE (Sun et al., 2019) considered relations as rotations in a complex vector space which can represent symmetric relations as a rotation of π. QUATE  further generalizes rotations using quaternions, known as hypercomplex numbers (C ⊂ H). A key advantage of quaternions is its non-commutative property allowing more flexibility to model patterns. Nonetheless, memory-wise, QUATE requires 4 embeddings for each entity and relation. To this extent, hyperbolic geometry provides an outstanding framework to produce shallow embeddings with striking expressiveness (Sarkar, 2011;Nickel and Kiela, 2017). Both MURP (Balažević et al., 2019a) and ATTH (Chami et al., 2020) learn hyperbolic embeddings on a n-dimensional Poincaré ball. Different to MURP, ATTH uses a trainable curvature for each relation. Indeed, Chami et al. (2020) have shown that fixing the curvature of the manifold can jeopardize the quality of the returned embeddings. Therefore, defining a parametric curvature for a given relation helps to learn the best underlying geometry.

Time-Aware Graph Embeddings
The above-mentioned techniques nevertheless disregard the temporal aspect. Indeed, lets consider the two following quadruplets Barack Obama, visits, France, 2009-03-11 and Barack Obama, visits, France, 2014-04-21 . Non-temporal models would exhibit the same score for both facts. However, the second quadruplet is invalid 2 and should therefore get a lower score. For this reason, several works contributed to obtain timeaware embeddings. Thanks to the existing advancements on graph representations, many strategies are straightforward extensions of static approaches. TTRANSE (Leblay and Chekol, 2018) alters the scoring function of TRANSE to encompass timerelated operations such as time translations. Likewise, TA-TRANSE (García-Durán et al., 2018) uses LSTMs (Hochreiter and Schmidhuber, 1997) to encode a temporal predicate which carries the time feature. By analogy with TRANSH, HYTE  (Goel et al., 2020) provides diachronic entity embeddings inspired from diachronic word embeddings (Hamilton et al., 2016). Recently, ATISE (Xu et al., 2019) embeds entities and relations as a multi-dimensional Gaussian distributions which are time-sensitive. An advantage of ATISE is its ability to represent time uncertainty as the covariance of the Gaussian distributions. TERO (Xu et al., 2020b) combines ideas from TRANSE and ROTATE. It defines relations as translations and timestamps as rotations. As far as we are aware, DYERNIE (Han et al., 2020) is the first work to contribute to hyperbolic embeddings for TKG. It achieves state-of-the-art performances on the benchmark datasets ICEWS14 and ICEWS05-15. Time is defined as a translation on a product of manifolds with trainable curvatures using a velocity vector for each entity. In our work, we demonstrate that using a single manifold with learnable relational and time curvatures is sufficient to reach competitive or new state-of-the-art performances.

Problem Definition
Lets consider a valid quadruplet s, p, o, t ∈ S ⊂ E × R × E × T , with E, R and T the sets of entities, relations and timestamps respectively and S the set of correct facts. A scoring function f : E × R × E × T → R is defined such that f (s, p, o, t) is maximized for any quadruplet ∈ S, and minimized for corrupted quadruplet ( / ∈ S). Throughout the optimization of the foregoing constraint, representations of entities, relations and times are learned accordingly. The resulting embeddings should then capture the multi-relational graph structure. Thus, f is gauging the probability that an entity s is connected to an entity o by the relation p at time t.

Hyperbolic Geometry
Hyperbolic geometry belongs to non-Euclidean geometry. In contrast to Euclidean geometry relying on Euclid's axioms (Heath and Euclid, 1956), non-Euclidean geometry rejects the fifth axiom known as the parallel postulate. It states that given a point x and a line l 1 , there exists a unique line l 2 parallel to l 1 passing through x. This is only possible due to a (constant) zero curvature of the space. The curvature defines how much the geometry differs from being flat. The higher the absolute curvature, the curvier. Euclidean space has a zero curvature hence called flat space. When represented in an Euclidean space, straight lines become curved, termed as geodesics (Fig. 1). Figure 1: Illustration of the exponential and logarithmic maps between the Poincaré ball B n,c and the tan- Hyperbolic geometry comes with a constant negative curvature. In our study, as Nickel and Kiela (2017); Han et al. (2020); Chami et al. (2020), we make use of the Poincaré ball (B n,c , g B ) which is a n-dimensional Riemannian manifold B n, c = {x ∈ R n : x 2 < 1 c } of constant curvature −c (c > 0) equipped with Riemannian metric g B and where . denotes the L 2 norm. The metric tensor g B gives information about how distances should be computed on the manifold and is defined as g B = (λ c x ) 2 I n with the conformal factor λ c x = gent space at x ∈ B n,c . At the difference of B n,c , T c x B n,c locally follows an Euclidean geometry. As illustrated in Fig. 1, we can project v ∈ B n,c on T c x B n,c at x via the logarithmic map. Inversely, we can map a point u ∈ T c x B n,c on B n,c at x via the exponential map. A closed-form of those mappings exist when x corresponds to the origin (Eqs. 1 and 2).
Contrary to T c x B n,c , the Euclidean addition on the hyperbolic manifold does not hold. Alternately, we use the Möbius addition (Vermeer, 2005;Ganea et al., 2018) satisfying the boundaries constraints of the manifold. It is however non-commutative and non-associative. The closed-form is presented in Eq. 3.
The distance between two points x and y on B n,c is the hyperbolic distance d B n,c (x, y) defined as:

From ATTH to HERCULES
Given a quadruplet, s, p, o, t , we note e H s , r H p and e H o the hyperbolic embeddings of the subject, predicate and object respectively. 3 ATTH uses relationspecific embeddings, rotations, reflections and curvatures. The curvature is defined as depending on the corresponding relation p involved. Precisely, a relation p is attributed with an individual parametric curvature c p . The curvature c p is defined in Eq. 5 as: where µ p is a trainable parameter ∈ R and σ is a smooth approximation of the ReLU activation function defined in [0, +∞]. With such approach, the geometry of the manifold is learned, thus modified for a particular predicate. The curvature dictates how the manifold is shaped. Changing the curvature of the manifold implies changing the positions of projected points. This means that for distinct relations, the same entity will have different positions because of the different resulting geometries for each relation. For example, lets consider the triples t 1 := Barack Obama, visit, France and t 2 := Barack Obama, cooperate, France . The Euclidean representations of entities Barack Obama and France from both facts will be projected onto the riemannian manifold. However, the structure (i.e. curvature) of the manifold changes as a function of the relation of each fact (i.e. 'visit' and 'cooperate'). Therefore, the resulting hyperbolic embbeding of Barack Obama of t 1 will not be the same resulting hyperbolic embedding of Barack Obama in t 2 . By analogy, the same holds for entity France.
In order to learn rotations and reflections, ATTH uses 2 × 2 Givens transformations matrices. Those transformations conserve relative distances in hyperbolic space and can therefore directly be applied to hyperbolic embeddings (isometries). We note W rot Θp and W ref Φp the block-diagonal matrices where each element on their diagonals is given by G + (θ p,i ) and G − (φ p,i ) respectively, with i the i th element of the diagonal (Eqs. 6 and 7).
Then, the rotations and reflections are applied only to the subject embedding as describe in Eq. 8.   and Chami et al. (2019). Then, the attention vector is mapped back to manifold using the exponential map (Eq. 1). We have then: ATTH finally applies a translation of the hyperbolic relation embedding r H p over the resulting attention vector (Eq. 9). As mentioned in Chami et al. (2020), translations help to move between different levels of the hierarchy.
The scoring function is similar to the one used in Balažević et al. (2019a) and Han et al. (2020) defined as: (11) where b s and b o stand for the subject and object biases, respectively. We then propose HERCULES, a time-aware extension of ATTH. HERCULES redefines the curvature of the manifold as being the product of both relation and time as illustrated in Eq. 12: with τ t a learnable parameter ∈ R. A trade-off therefore exists between relation and time. A low value of either µ p or τ t will lead to a flatter space while higher values will tend to a more hyperbolic space. The main intuition of HERCULES is that both relation and time directly adjust the geometry of the manifold such that the positions of projected entities are relation-and-time-dependent. This is advantageous in that no additional temporal parameters per entity are needed. Since the whole geometry has changed for specific relation and time, all future projections onto that manifold will be aligned to the corresponding relation and timestamp. We investigate different curvature definitions and time translation in our experiments (see Section 6). The scoring function of HERCULES remains same as ATTH.
When learning hyperbolic parameters, the optimization requires to utilize a Riemannian gradient (Bonnabel, 2013). However, proven to be challenging, we instead learn all embeddings in the Euclidean space. The embeddings can then be mapped to the manifold using the exponential map (Eq. 1). This allows the use of standard Euclidean optimization strategies.

Experiments
We outline in this section the experiments and evaluation settings.

Datasets
For fair comparisons, we test our model on same benchamark datasets used in previous works, i.e. ICEWS14 and ICEWS05-15. Both datasets were constructed by García-Durán et al. (2018) using the Integrated Crisis Early Warning System (ICEWS) dataset (Boschee et al., 2018). ICEWS provides geopolitical information with their corresponding (event) date, e.g. Barack Obama, visits, France, 2009-03-11 . More specifically, ICEWS14 includes events that happened in 2014 whereas ICEWS05-15 encompasses facts that appeared between 2005 and 2015. We give the original datasets statistics in Table 2. To increase the number of samples, for each quadruplet s, p, o, t we add s, p −1 , o, t , where p −1 is the inverse relation of p. This is a standard data augmentation technique usually used in LP (Balažević et al., 2019a;Goel et al., 2020;Han et al., 2020).

Evaluation Protocol & Metrics
Given a (golden) test triple s, p, o, t , for each entity s ∈ E, we interchange the subject s with s and apply the scoring function f on the resulting query s , p, o, t . Since replacing s by all possible entity s may end up with a correct facts, we filter out those valid quadruplets and give them extremely low scores to avoid correct quadruplets to be scored higher than the tested quadruplet in final ranking (Bordes et al., 2013). We then rank the entities based on their scores in descending order. We store the rank of the correct entity s noted z s . Thus, the model should maximize the returned score for the entity s such that z s = 1. The same process is done using the object o.
To evaluate our models, we make use of the Mean Reciprocal Rank (MRR). We also provide the Hits@1 (H@1), Hits@3 (H@3) and Hits@10 (H@10) which assess on the frequency that the valid entity is in the top-1, top-3 and top-10 position, respectively.

Implementation Details
To ensure unbiased comparability, the same training procedure and hyper-parameters are shared for

Results
We report models performances on the link prediction task on TKGs. Additional analyses on the results are also given.

Link Prediction results
We provide link prediction results on ICEWS14 and ICEWS05-15 for ATTH, HERCULES and different models from the literature. As Han et al. (2020), we adopted a dimension analysis to investigate behaviors and robustness of approaches. When possible, we re-run official implementation of models. Otherwise, official or best results in literature are reported. Results are shown in Table 3. 4 We used the official implementation of ATTH available at https://github.com/HazyResearch/KGEmb. We adapted it to implement HERCULES. 5 We used the official implementation available at https: //github.com/soledad921/ATISE As expected, hyperbolic-based strategies (i.e. DYERNIE, ATTH and HERCULES) perform much better at lower dimensions, outperforming most of other approaches with ten times less dimensions. We report an average absolute gain of 11.6% points in MRR with only 10 dimensions over the median performance of other approaches with 100 dimensions. This strengthens the effectiveness of hyperbolic geometry to induce high-quality embeddings with few parameters.
Astonishingly, we notice that ATTH model is highly competitive despite the absence of time parameter. ATTH exhibits new state-of-the-art or statistically equivalent performances compared to DYERNIE and HERCULES. We remark no statistically significant differences in performances between hyperbolic models. 6 Importantly, unlike other research carried out in this area, time information here does not lead to any notable gain. This seems to indicate that other parameters should be considered. We examine this phenomenon in section 6.4.2.
On ICEWS14, for dim ∈ {20, 40, 100}, both ATTH and HERCULES outperform DYERNIE by a large margin. We witness an improvement of 2.5% and 5% points in MRR and Hits@1 with 100dimensional embeddings. On ICEWS05-15, ATTH and HERCULES yield comparable achievements with the state-of-the-art. In contrast to DYERNIE, it is noteworthy that ATTH and HERCULES utilize a single manifold while reaching top performances.
We also distinguish tempered results on Hits@10 metric for ATTH and HERCULES models. This suggests that during optimization, ATTH and HER-CULES favor ranking some entities on top while harming the representation of others.

Is Time All You Need ?
In this section, we investigate the influence of the temporal parameter on performances.
First, besides time translation, we probe different curvature definitions to identify fluctuation in performances. We analyze how time information alters the LP results by adding time as part of the curvature (i.e. HERCULES) and as a translation. We also explore if incorporating the Euclidean dot product of the subject and object embeddings (noted e E s , e E o ) into the curvature helps to learn a better geometry. An ablation study is given in Table 4.
Albeit counter-intuitive, we observe that our results corroborate with our initial finding: time information is not the culprit of our high performances. More strikingly, a simple relational curvature (i.e. ATTH) is sufficient to perform best on ICEWS14 (dim = 40). Neither the inclusion of a time translation, similarly to TTRANSE, nor the Euclidean dot product provide interesting outcomes.
We then probe the sensitivity of HERCULES towards temporal feature by performing LP with incorrect timestamps. Our intuition is to inspect whether feeding invalid timestamps during evaluation exhibits significant variation or not compared to the reference performances, i.e. LP results with initial (non-corrupted) testing samples. To do so, for each testing quadruplet, we replace the (correct) time parameter with each possible timestamp from T . We therefore collect multiple LP performances of HERCULES corresponding to each distinct timestamp. We plot the distribution of resulting performances of temporally-corrupted quadruplets from ICEWS14 (dim=100) in Fig. 2. We can observe that despite erroneous timestamps, LP results show insignificant discrepancies with the initial HERCULES performance (dashed red line). The standard deviations from HERCULES reference performance for MRR, H@1, H@3, H@10 metrics are 1.78 × 10 −4 , 3.18 × 10 −3 , 1.05 × 10 −2 , 3.55 × 10 −3 respectively. This indicates that HER-CULES gives little importance to the time parameter and thus only relies on the entity and the predicate to perform knowledge graph completion. This fur-ther highlights our finding that timestamp is not responsible for our attracting performances.  Figure 2: Distribution of performances of HERCULES on temporally-corrupted quadruplets from ICEWS14 with dim=100. A smooth approximation of the distribution is drawn as a dashed black curve. The reference performance is indicated with a dashed red line.
We therefore assume that the optimization procedure may be involved. We consequently question the effect of negative sampling. Precisely, we train HERCULES with dim = 40 by tuning the number of negative samples between 50 to 500. We plot the learning curves in Fig. 3.    performances reach a plateau around 300 negative samples. We conjecture that a diversity in negative samples is enough to learn good representations.
Notwithstanding that a large number of negative samples heavily constraints the location of entities in space, the resulting embeddings might benefit from it to be better positioned relatively to others.
We conclude that despite the present time param-eter, an optimal negative sampling enables to reach new state-of-the-art outcome. Therefore, we argue that time is not the only parameter that should be considered when performing LP. We highlight that one should be raising awareness when training TKG representations to identify if time is truly helping to boost performances.

ATTH versus HERCULES
We explore here how the geometry of HERCULES differs from ATTH. To do so, we inspect the absolute difference of their learned curvatures ∆ c . We plot ∆ c with respect to the relations and timestamps for dim = 40 on ICEWS14 in Fig. 4. 7 Similar plots are given for ICEWS05-15 in Appendix A.1. Besides rare steep discrepancy in curvatures (i.e. ∆ c > 1.0), HERCULES is akin to ATTH concerning learned geometries. We report that 85.0% and 95.6% of ∆ c 's are smaller than 0.1 on ICEWS14 and ICEWS05-15 respectively. We point out that some timestamps affect globally all relations, albeit very limited. This can be seen in Fig. 5 by the aligned vertical strips. It indicates that HERCULES uses its additional time parameter to learn a slightly different manifold but nonetheless quite similar to ATTH. 8 We provide further analysis on the shifts of the curvatures while increasing embeddings dimension in Appendices A.2 and A.3.

Timestamps
We depict an example of learned hyperbolic representation of ICEWS14 entities of ATTH and HERCULES for the relation 'make a visit'and timestamp set to 01-01-2014 in Fig. 5. We plot embeddings of ICEWS05-15 in Appendix A.4.

Conclusion
In this paper, we have demonstrated that without adding neither time information nor supplementary parameters, the ATTH model astonishingly achieves similar or new state-of-the-art performances on link prediction upon temporal knowledge graphs. In spite of the inclusion of time with 7 For readability, we don't plot ∆c for reverse relations p −1 (see Section 6.1). 8 Similar geometry = Similar embeddings our proposed time-aware model HERCULES, we have shown that negative sampling is sufficient to learn a good underlying geometry. In the future, we plan to explore new mechanisms to incorporate temporal information to improve performances of ATTH.

A.2 ATTH versus ATTH
We inspect how the curvature of ATTH fluctuates while increasing the embedding dimension. We compare curvatures shifts between dim = 40 and dim = 100. We plot ∆ c in Fig. 7. The fluctuation in curvature is moderate while dimension of embeddings increases. We can see that variation is almost inferior to 0.2. This underlines that geometry does not change much for a specific relation when the dimension grows. For some relations, we note however that curvature is knowing significant dissimilitudes.

A.3 HERCULES versus HERCULES
We report the dissimilarities in curvature for different relations and timestamps on ICEWS14 and ICEWS05-15 in Fig. 8 and Fig. 9. We note that around 87.7% and 91.1% of variations are smaller than 0.1 on ICEWS14 and ICEWS05-15 respectively. This seems to indicate that despite the dimensionality gap, the learned geometry for each relation does not differ much between dim = 40 and dim = 100.

A.4 Two-Dimensional Hyperbolic Embeddings
We give an illustration of learned embeddings on ICEWS05-15 by ATTH and HERCULES models in