Time-dependent Entity Embedding is not All You Need: A Re-evaluation of Temporal Knowledge Graph Completion Models under a Unified Framework

Various temporal knowledge graph (KG) completion models have been proposed in the recent literature. The models usually contain two parts, a temporal embedding layer and a score function derived from existing static KG modeling approaches. Since the approaches differ along several dimensions, including different score functions and training strategies, the individual contributions of different temporal embedding techniques to model performance are not always clear. In this work, we systematically study six temporal embedding approaches and empirically quantify their performance across a wide range of configurations with about 3000 experiments and 13159 GPU hours. We classify the temporal embeddings into two classes: (1) timestamp embeddings and (2) time-dependent entity embeddings. Despite the common belief that the latter is more expressive, an extensive experimental study shows that timestamp embeddings can achieve on-par or even better performance with significantly fewer parameters. Moreover, we find that when trained appropriately, the relative performance differences between various temporal embeddings often shrink and sometimes even reverse when compared to prior results. For example, TTransE (CITATION), one of the first temporal KG models, can outperform more recent architectures on ICEWS datasets. To foster further research, we provide the first unified open-source framework for temporal KG completion models with full composability, where temporal embeddings, score functions, loss functions, regularizers, and the explicit modeling of reciprocal relations can be combined arbitrarily.


Introduction
The Knowledge Graph (KG), a graph-structured knowledge base, has gained increasing interest as * Equal contribution. † Corresponding author.
a promising way to store factual knowledge. KGs represent facts in the form of triples (s, r, o), e.g., (Bob, livesIn, New York), in which s (subject) and o (object) denote nodes (entities) and r denotes the edge type (relation) between s and o. Knowledge graphs are commonly static and store facts in their current state. In reality, however, the relations between entities often change over time. For example, if Bob moves to California, the triple of (Bob, livesIn, New York) will be invalid. To this end, temporal knowledge graphs (tKGs) have been introduced to capture temporal aspects of facts in addition to their multi-relational nature. A tKG represents a temporal fact as a quadruple (s, r, o, t) by extending a static triple with time t, describing that this fact is valid at time t. Figure 2 in the appendix depicts an exemplary temporal KG. To address the inherent incompleteness of temporal KGs, Tresp et al. (2015) proposed the first tKG model. Afterwards, a line of work emerged that extends static KG completion models by adding temporal embeddings, e.g., TTransE (Leblay and Chekol, 2018), TA-TransE (García-Durán et al., 2018), DE-SimplE (Goel et al., 2019), TNTCom-plEx (Lacroix et al., 2020), ConT (Ma et al., 2018), and many more. The models generally consist of two parts, a temporal embedding layer to capture the evolving features of tKGs and a score function to examine the plausibility of a given quadruple.
Temporal embeddings are crucial in temporal KG completion models for storing the evolving knowledge; without them, the temporal aspect cannot be captured. The PEs can be generally categorized into three classes: (1) timestamp embeddings (TEs): the models learn an embedding for each discrete timestamp in the same vector space as entities and relations (Tresp et al., 2017;Leblay and Chekol, 2018;Dasgupta et al., 2018;Lacroix et al., 2020). (2) time-dependent entity embeddings (TEEs): the models define entity embedding as a function that takes an entity and a timestamp as input and generates a time-dependent representation for the entity at that time (Goel et al., 2019;Xu et al., 2019;Han et al., 2020a). (3) deep representation learning (DTRs): the models incorporate temporal information into advanced deep learning models, e.g., Recurrent Neural Network and Graph Neural Network, to learn time-aware representations of entities and relations (García-Durán et al., 2018). In many cases, the introduction of new temporal embedding approaches went along with new score functions and new training methods (regularization, the explicit modeling of reciprocal relations, etc.). Ablation studies were provided, but not investigated thoroughly. Besides, some temporal embedding papers introduced new datasets. They commonly tune model architecture and hyperparameters of old temporal embedding approaches on new datasets using grid search on a small grid involving hand-crafted parameter ranges or settings known to work well from prior studies. A grid suitable for one dataset may be suboptimal for another, however. It is often difficult to attribute the incremental improvements in performance reported with each new state-of-the-art (SOTA) model to the proposed temporal embeddings or other components.
In this work, we investigate the significance of previously reported temporal embeddings with several thousands of experiments and 19000 GPU hours. First, we aim to study which temporal embedding approach can generally outperform other temporal embedding approaches regardless of different score functions and different datasets. We choose one representative from bilinear score functions, i.e., SimplE (Kazemi and Poole, 2018), and one from translation-based score functions, i.e., TransE (Bordes et al., 2013). Then we benchmark six temporal embedding approaches on two subsets of ICEWS (Boschee et al., 2015) and a subset of GDELT (Leetaru and Schrodt, 2013) with the two representative score functions through an extensive set of experiments. Second, we performed an extensive benchmark study on wellknown temporal KG completion models using popular model architectures and training strategies in a unified experimental setup. Following the work (Ruffinelli et al., 2020), we considered many training strategies as well as a large hyperparameter space, and we performed model selection using a quasi-random search followed by Bayesian optimization, which has been shown to be able to find good model configurations with relatively low effort.
Regarding the first aim, we surprisingly find that the TE proposed by Leblay and Chekol (2018) outperforms other temporal embedding approaches on the ICEWS subsets and achieves on-par results on GDELT. Leblay and Chekol (2018) represent timestamps in the same vector space as entities and relations and learn embeddings for each discrete timestamp. While achieving better results, the TE models only require about half of the model parameters as much of TEEs. However, the common belief is that the TEEs are more expressive and can better capture the evolving knowledge. Recall that models with TEEs learn an embedding function for each entity that takes time as input and provides an entity representation as output. In particular, it has been proven that TEEs are fully expressive for tKG completion in combination with certain score functions (Goel et al., 2019), and thus, they should perform better than TEs, which is in contrast to our findings. We argue that the sparsity of temporal KG data may cause the undesirable empirical performance of TEEs. Every entity has the same dimensionality of time-dependent embeddings, but the majority of entities are only involved in a small number of quadruples. As a result, the TEEs may suffer from the overfitting problem. To verify our assumption, we learn a unique temporal embedding function for all entities instead of learning entity-specific embedding functions. We refer to it as UTEE. Empirical study shows that the UTEE achieves similar or even better results than all other TEEs variants, emphasizing the overfitting problem of TEEs.
Besides, we empirically find that the performance of a fine-tuned baseline can by far exceed the performance observed in all previous studies. For example, T-TransE (Leblay and Chekol, 2018), one of the first temporal KG completion models, achieves superior performance metric in our study that is more than doubled to that reported in recent papers (García-Durán et al., 2018;Goel et al., 2019;Lacroix et al., 2020;Xu et al., 2019). Thus, it is competitive to or even outperforms current SOTA models such as DE-SimplE (Goel et al., 2019) and TComplEx (Lacroix et al., 2020). This suggests that training strategies significantly affect the performance of temporal KG models and are responsible for a substantial fraction of the progress made in recent years. Thus, to fairly compare the effectiveness of different temporal KG models, it is nec-essary to evaluate them on a unified framework. To this end, our study realizes the first fair benchmarking by investigating the interplay between temporal KG interaction models, loss functions, regularization methods, the use of reciprocal relations, and other training techniques in a unified open-source framework 1 . To ensure the composability of the framework, the temporal embedding layer, score functions, and various training strategies are implemented as independent submodules. Thus, one can easily assess the individual benefit of a novel temporal embedding approach via our framework. Additionally, we perform an extensive experimental study in which well-known temporal KG models are fine-tuned by popular training strategies and a wide range of hyperparameter settings. The reported results can be directly used for comparison in future work.

Temporal Knowledge Graph Completion
Temporal knowledge graphs (tKGs) are multirelational, directed graphs with labeled timestamped edges between entities. Let E, R, and T represent a finite set of entities, relations, and timestamps, respectively. Each fact can be denoted by a quadruple q = (e s , r, e o , t), representing a timestamped and labeled edge between a subject entity e s ∈ E and an object entity e o ∈ E regarding a relation r ∈ R at a timestamp t ∈ T . Let F represents the set of all quadruples that are facts, i.e., real events in the world, the tKG completion (tKGC) is the problem of inferring F based on a set of observed facts O, which is a subset of F. Specifically, the task of tKGC is to predict either a missing subject entity (?, r, e o , t) given the other three components or a missing object entity (e s , r, ?, t).
Our study focuses solely on temporal knowledge graph embedding models for the completion task, which do not exploit temporal knowledge graph embedding models for the forecasting task (Trivedi et al., 2017;Han et al., 2020bHan et al., , 2021.

Temporal KG Embedding Models
A tKG embedding (tKGE) model embeds each entity e ∈ E and relation r ∈ R in a vector space. To capture temporal aspects, each model either embeds discrete timestamps into a vector space or learns time-dependent representations for each entity. Besides, each model has a score function 1 https://github.com/TemporalKGTeam/A_Unified_Framework_of_Temporal_Knowledge_Graph_Models that takes the temporal information and the embeddings of the subject, relation, and object as the input and computes a score for each potential quadruple. The higher the quadruple score, the more plausible it is considered to be true by the model. Taking the object prediction as an example, we consider all entities in E and learn a score function φ(e s , r, e o , t) = f (e s (t), r, e o (t)), for models with TEEs and φ(e s , r, e o , t) = f (e s , r, e o , t) for models with TEs. The bold symbols denote the embeddings of the corresponding entities, relation, and time.

Temporal Embeddings
tKGE models differ in their temporal embeddings and score functions. Temporal embedding approaches come in three categories: timestamp embeddings (TEs), where the models learn a representation for each discrete timestamp; time-dependent entity embeddings (TEEs), where an entity embedding function takes time and an entity as inputs and provides a hidden representation as output; and deep temporal representations (DTRs), where the models incorporate time information into deep learning frameworks.
The best known TE is the vanilla TE (abbreviated to T by its authors) proposed by Leblay and Chekol (2018) where each timestamp is mapped in the same vector space as entities and relations. Later, Lacroix et al. (2020) introduced a new regularization scheme to smooth the representation of neighboring timestamps. Another well-known TE is HyTE (Dasgupta et al., 2018), which associates each timestamp with a corresponding hyperplane and projects the embeddings of entities and relations onto timestamp-specific hyperplanes to incorporate temporal information in entity embeddings: e i represents the global embedding of entity e i , ⊥ represents the projection operator, and ω t represents the normal vector of the hyperplane associated with timestamp t. A well-known variants of TEEs is the diachronic entity embeddings (DE) proposed by Goel et al. (2019) e DE i (t)[n] denotes the n th element of the embeddings of entity e i at time t. a e i , ω e i , b e i are entityspecific vectors with learnable parameters. The first γd elements of the vector in Equation 1 capture static features, and the other (1 − γ)d elements capture temporal features. ATiSE (Xu et al., 2019) is another popular TEE that adds time information into entity/relation representations by using additive time series decomposition, where the entity representation is defined as (2) The term e i +α e i ω e i t is the trend component where the coefficient denotes the evolutionary rates, and the vector ω e i represents the corresponding evolutionary direction. β e i sin(2πω e i t) is the corresponding seasonal component, and the Gaussian noise term N (0, Σ e i ) denotes the random component. In principle, other temporal embedding approaches can also be converted into a probabilistic approach by adding Gaussian noise. Thus, to fairly compare with other temporal embeddings and simplify our study, we do not take the noise term into account. The representation of relations in ATiSE is also time-dependent and defined similarly to the entity representation. A representative of DTRs is the TA-approach (García-Durán et al., 2018) that utilizes recurrent neural networks to learn time-aware representations of relations. Specifically, the relation representation is obtained by r T A (t) = LST M (r, t), where the timestamp (date) t is tokenized into digits (year, month, and day). The sequence of temporal tokens and the relation r is used as input to the LSTM. In addition to the five PEs mentioned above, we propose a new TEE where we learn a unique temporal embedding function for all entities to investigate the overfitting problem of DE. We refer to it as UTEE, which is defined as follows: where the amplitude vector a, frequency vector ω, and bias b are identical for all entities.

Score Functions
A large number of score functions have been developed for the KG completion task. A class of these models is the translation-based approaches corresponding to variations of TransE (Bordes et al., 2013;Wang et al., 2014;Nguyen et al., 2016) that models relations as a translation of subject to object embeddings, i.e., s T ransE (e s , r, e o ) = −||e s + r − e o || 2 . Another line of work is bilinear score functions (Nickel et al., 2011;Yang et al., 2014;Trouillon et al., 2016;Kazemi and Poole, 2018) that define product-based functions over embeddings, i.e., s RESCAL (e s , r, e o ) = e T s Re o , where relation matrix R ∈ R d×d contain weights r i,j that capture the interaction between the i-th latent factor of e s and the j-th latent factor of e o . Among the bilinear models, SimplE (Kazemi and Poole, 2018) a simple yet fully expressive model that represents each entity e i ∈ E by two vectors e i,s and e i,o . Depending on whether e i participates in a triple as the subject or object entity, either e i,s or e i,o is used. To address the independence of the two vectors for each entity, SimplE takes advantage of reciprocal relations and uses 1 2 ( e i,s , r, e j,o + e j,s , r −1 , e i,o ) as the score of (e i , r, e j ), where r −1 is the reciprocal relation of r.
In the rest of the paper, we examine the above six temporal embeddings in terms of the two representative score functions (TransE and SimplE) on two benchmark tKG datasets. We refer to a specific combination of temporal embedding approach and score function as an interaction model.

Reciprocal Relations
Lacroix et al. (2018) and Dettmers et al. (2018) introduced the use of reciprocal relation for training knowledge graph embeddings. For every quadruple (e s , r, e o , t) in the dataset, we add (e o , r −1 , e s , t), where r −1 denotes the reciprocal relation of r. The idea of reciprocal relations is to use separate scoring functions for object prediction and subject prediction. Reciprocal relations can help translationbased approaches model symmetric patterns and help bilinear approaches model anti-symmetric and inverse patterns (Kazemi and Poole, 2018).

Related Work
Previous benchmarking studies (Kadlec et al., 2017;Akrami et al., 2020;Rossi et al., 2021) only focus on static knowledge graph models. For example, Ruffinelli et al. (2020) and Ali et al. (2020) realize a fair benchmarking by re-implementing static KGE models and performing an extensive empirical study with a massive search space. However, they do not take temporal knowledge graph models into account. To this end, we provide a unified framework that covers relevant tKGE mod-els and investigate the influence of temporal embeddings on model performance as well as other components. To the best of our knowledge, this is the first benchmarking study for tKGE models.

Experimental Study
In this section, we first introduce the design of our unified framework that enables us to evaluate a large set of different combinations of interaction models, loss functions, regularization methods, the usage of of explicitly modeling reciprocal relations and position-aware entity embeddings. Then we split our experimental study into two parts. In the first part, we examine six temporal embedding methods combined with two representative score functions by performing an extensive set of experiments using advanced training strategies and a wide range of hyperparameter settings via the unified framework. In the second part, we re-evaluate various well-known tKG models from prior studies. We provide evidence that several old tKG models can obtain results competitive to or even better than the SOTA when configured carefully. We present the best configuration of each model and report its best performance on each benchmark that future research can directly use for comparison.

Composable Unified Framework
In the proposed framework, a tKGE model is considered as a composition of six modules that can flexibly be combined: a temporal embedding layer, a static embedding layer, a score function, a loss function, a regularization method, and the usage of reciprocal relations. In particular, the framework can automatically optimize the embedding method: the temporal embeddings can be either combined with entity embeddings or relation embeddings or both; there are different ways to combine static embeddings and temporal embeddings, i.e., addition, concatenation and element-wise multiplication. The framework supports six temporal embedding approaches as introduced in Section 2.2.1, seven score functions, i.e., TransE (Bordes et al., 2013), SimplE(Kazemi and Poole, 2018), DistMult(Yang et al., 2014), three loss functions (MR, CE, and BCE), four regularization methods (L1/L2/L3-norm , and dropout), and two initialization methods (Xavier uniform and Xavier normal). For interaction models with TEs, a smoothness regularization for timestamp embeddings is applied, enforcing neighboring timestamps to have close representations (Lacroix et al., 2020). Additionally, Kazemi and Poole (2018) distinguished an entity between as a head or as a tail entity and learns two embeddings for each entity, which we term position-aware entity embedding and extend to all interaction models. Position-aware entity embeddings can enhance the model's expressiveness. For example, it can help Distmult (Yang et al., 2014) to model anti-symmetric relations: without it, all relations are enforced to be symmetric since h, r, t, τ and t, r, h, τ share the same score regardless of properties of r.

Experimental Setup
Datasets Integrated Crisis Early Warning System (ICEWS) (Boschee et al., 2015) dataset has established itself in the research community as representative samples of tKGs and has been widely applied in recent tKG studies. The ICEWS dataset contains information about political events with specific time annotations, e.g. (Barack Obama, visit, India, 2010-11-06). We apply our model on two subsets of the ICEWS dataset: ICEWS14 contains events in 2014, and ICEWS11-14 corresponds to the facts between 2011 to 2014. Besides, we also used a subset of the Global Database of Events, Language, and Tone (GDELT) (Leetaru and Schrodt, 2013) dataset as a benchmark. To make the extensive configuration search feasible, we extracted a subset named GDELT-m10 consisting of factual events in October, 2015. The statistics and further details are provided in Appendix C.
Hyperparameters We used a large hyperparameter search space to ensure that suitable hyperparameters for each model can be covered. We consider seven embedding dimensions {64, 100, 128, 256, 512, 1024, 2048}. The learning rate can be randomly selected from (0, 0.1]. We use separate weights for regularization of embeddings of entities, relations, and timestamps. A detailed report of the search space is provided in Appendix B. Interaction models In the first part, we evaluate six temporal embedding methods combined with two representative score functions. The formulas of these twelve interaction models are listed in Table  1. Additionally, we select DE-SimplE/TransE (Goel et al., 2019), TNTComplEx (Lacroix et al., 2020), ATiSE(Xu et al., 2019), TTransE (Leblay and Chekol, 2018), TA-TransE (García-Durán et al., Table 1: Formulas of a given quadruple (e i , r, e j , t). e i,s denotes the embedding of e i when the entity is the subject while e i,o denotes the embedding of e i when the entity is the object. In comparison, e i represents the shared embedding of entity e i for both subject and object. t, r represent the embedding of timestamp t and relation r, respectively. ⊥ represents the projection operator. e i (t) denotes the temporal embedding of e i at t.

Temporal Embeddings
TransE SimplE 2018), and HyTE (Dasgupta et al., 2018) for the second part of our study, which are the most famous tKGE models.
Evaluation All models are evaluated on link prediction task. For each test quadruple (s, r, o, t), we create a subject prediction query (?, r, o, t) and an object prediction query (s, r, ?, t). Taking the object prediction as an example, all entities e i ∈ E are ranked according to the score s(s, r, e i , t). We filter from the candidate list all the entities but the ground truth that form a valid quadruple with s, r, and t, i.e., the quadruple occurs either in the training, validation, or test data. We report filtered Mean Reciprocal Ranks (MRR) and Hits@1, 3, 10 averaged over subject prediction and object prediction. For detailed definitions please see Appendix A.

Computational resources and model selection.
We perform large-scale benchmarking with about 4000 experiments and 19000 GPU hours of computation time. All experiments are run on NVIDIA Tesla T4. For each dataset and interaction model, we first randomly generate 40 different configurations from the search space using the Ax framework 2 . After the random hyperparameter search, we search 60 new configurations based on Bayesian optimization to tune the numerical hyperparameters further. Each trial runs for 100 epochs, and an early stopping strategy with a patience of 30 epochs is employed. We select the best-performing configuration according to filtered MRR on validation data. The best configuration will be further trained until its convergence.

Examining Temporal Embeddings
Performance in prior studies vs. in our study.   (Goel et al., 2019;Sadeghian et al., 2021). The results suggest that the performance of old temporal embedding approaches can be largely improved by advanced training strategies and hyperparameter-tuning, which may account for a large fraction of the progress made in recent years. Figure 1 shows the distribution of  filtered MRR for each model on ICEWS14. Each distribution consists of 100 different hyperparameter configurations. We can see that some models show a wide dispersion, and only very few configurations achieve good results. Generally, the impact of the hyperparameter choice is more pronounced on TransE-based models (higher variance) than on SimplE-based models. The hyperparameters of the best performing models are reported in Table  2 (selected hyperparameters) and Table 12 in the appendix (all parameters). Perhaps unsurprisingly, we find that the optimum choice of hyperparameters is often model-and dataset-dependent. Thus, a grid search on a small search space is not suitable to compare model performance because the result may be considerably affected by the specific grid points being used. Besides, we find that the use of reciprocal relations (RR) and position-aware entity embeddings (PEE) often improve model performance. To investigate their impacts, we conduct ablation studies where we do not use RR (or PEE) and keep other hyperparameters same to the best configuration. We report the reduction of filtered metrics in Table 5, which confirms our findings.
TE vs. TEE Since the timestamp embeddings (TE) are independent of entities, they can only capture global patterns at each timestamp. In comparison, the time-dependent entity embedding approaches (DE, ATiSE) learn entity-specific tempo- ral functions (e.g., frequency, amplitude, etc.) as shown in Equation 1 and 2. The time-dependent entity embeddings are expected to capture entityspecific temporal features, and thus, being more expressive. However, we see that the simple timestamp embedding approach (T) proposed by (Leblay and Chekol, 2018) achieves overall the best performance. In particular, it outperforms the time-dependent entity embedding approaches (DE, ATiSE), which is in contrast to the common belief.   in Table 3, the UTEE achieves competitive or even better performance with both translation-based and bilinear score functions on both datasets. Additionally, Table 4 shows the evaluation metrics on the GDELT-m10 dataset. Compared to the ICEWS datasets, the number of entities and relations on GDELT-m10 is much fewer while the amount of timestamped edges is about three times more than the ICEWS14 dataset. Thus, the data sparsity issue is alleviated in the GDELT-m10 dataset. Since TEE approaches need dense data for training, their performance has been improved on the GDELT-m10 dataset, which is better than TEs. The results suggest that even though DE has theoretical full expressiveness and provides more freedom degrees of the temporal movements of each entity representation, their performance would deteriorate significantly on sparse data. We tried to add regularization to entity-specific parameters, e.g., amplitude and frequency, and adjust the portion γ of the temporal embeddings. However, there are no significant improvements. Thus, the time-dependent entity embeddings need to be revisited to realize their theoretical expressiveness.   Table 7 depicts the best performance of wellknown tKGE models from prior studies (numbers in the parentheses) and that found in our study (numbers outside the parentheses). The configu-ration of the best performing models are reported in Table 13 in the appendix. First, we find that the performance of a single model can vary wildly across studies. For example, DE-TransE, T-TransE, and HyTE have been significantly improved using advanced training strategies and hyperparametertuning. Besides, we see that some recent models cannot consistently outperform old models in contrast to the conclusion in prior studies. In particular, T-TransE, which constitutes one of the first tKGE models, achieves results competitive to advanced models, i.e., ATiSE and DE-SimplE, in our study. Even compared to TNTComplEx, which is a very large models with 25.12 million learnable parameters (3 times more than TTransE), the performance difference is not large. We provide explanation for the performance gap between our study and prior study regarding TA-TransE and TNTComplEx in Appendix D.

Conclusion
We assess well-known temporal embeddings of tKGE models via an extensive experimental study. We found that when trained appropriately, the naive timestamp embedding approach performs similarly or even outperforms the more advanced timedependent entity embedding (TEE) approaches, which is in contrast to the results in prior studies. We contribute to the community in at least two ways: i) we provide a unified framework to enable an insightful assessment for novel temporal embedding approaches; ii) reveal the weakness of TEE approaches.

A Evaluation Metrics
For each test quadruple (e s , r, e o , t), we create a subject prediction query (?, r, e o , t) and an object prediction query (e s , r, ?, t). Let ψ es and ψ eo represent the rank for ground truth subject e s and ground truth object e o of the subject prediction query and object prediction query, respectively. We evaluate our models using standard metrics across the link prediction literature: mean reciprocal rank (MRR): Hits@k(k ∈ {1, 3, 10}): the percentage of times that the true entity candidate appears in the top k of ranked candidates.
There are two common filtering settings. The first one is following the ranking technique described in (Bordes et al., 2013), where we remove from the list of corrupted triples all the triples that appear either in the training, validation, or test set. We name it static filtering. forecasting. However, this filtering setting is not appropriate for evaluating the link prediction on temporal KGs. For example, there is a test quadruple (Barack Obama, visit, India, 2015-01-25), and we perform the object prediction (Barack Obama, visit, ?, 2015-01-25). We have observed the quadruple (Barack Obama, visit, Germany, 2013-01-18) in training set. According to the static filtering, (Barack Obama, visit, Germany) will be considered as a genuine triple at the timestamp 2015-01-25 and will be filtered out because the triple (Barack Obama, visit, Germany) appears in the training set in the quadruple (Barack Obama, visit, Germany, 2015-01-18). However, the triple (Barack Obama, visit, Germany) is only temporally valid on 2013-01-18 but not on 2015-01-25. To this end, another filtering scheme was introduced, which is more appropriate for the link forecasting task on temporal KGs. We name it time-aware filtering. In this case, we only filter out the triples that are genuine at the timestamp of the query. In other words, if the triple (Barack Obama, visit, Germany) does not appear at the query time of 2015-01-25, the quadruple (Barack Obama, visit, Germany, 2015-01-25) is considered as corrupted. In this paper, we focus on time-aware filtering. loss to align the model distribution and the data distribution. Han et al. (2020a) proposed to use binary cross entropy (BCE) loss that applies a sigmoid function to the score of each positive or negative quadruples and takes the cross entropy between the resulting probability and that quadruple's label as the loss. It has been shown in (Ruffinelli et al., 2020;Mohamed et al., 2019) that the loss function has a significant impact on the performance of static KGE models. To provide additional evidence on temporal KGE models, we search the best choice of loss functions for each model on each dataset.
Regularization methods L2 regularization is widely used in literature (Leblay and Chekol, 2018). Besides, (Dasgupta et al., 2018) proposed to use L1norm in the regularization term. And (Lacroix et al., 2020) used L3-norm for CP-decomposition. Additionally, (Lacroix et al., 2020) proposed a smoothness regularization for timestamp embeddings that enforce neighboring timestamps to have close representations. Moreover, (Goel et al., 2019) used dropout in its hidden layers. AiTSE normalized the static embeddings e i , the trend component w e i to unit norm after each update.
Other hyperparameters For models with diachronic entity embedding as its temporal encoding heads, we extend the static feature ratio as an extra searchable hyperparameter. The negative sample ratio of the negative sampling policy is 500. Namely, for each positive sample (s, p, o, t), we corrupt the subject and object entity via uniformly sampling from T , where T = {(s , p, o, t)|s ∈ E\s} ∪ ({(s , p, o, t)|t ∈ E\o}. We set our batch size to be 512. Besides, since Adam (Kingma and Ba, 2014) optimizer performs well for the majority of the models (Ali et al., 2020), we decided to progress only with Adam in order to reduce the computational costs. Additionally, for translational models, we set the margin γ to be 100 in the score function.

D Reproducibility Studies
We were not able to reproduce the results of TA-TransE on ICEWS14. A reason might be differences in the implementation details of the frameworks used to train and evaluate the models. Since there exists no official implementation for TA-TransE, it is not possible to check the implementation difference. Also, García-Durán et al. (2018) did not report the full setup, which impedes the reproduction of results. For example, the regularization method and initialization method have not been reported, which can have a significant effect on the results. Lacroix et al. (2020) provides an official implementation of TNTComplEx. However, we were not able to reproduce the same metric number as reported in their paper. Similarly, Sadeghian et al. (2021) also did not successfully reproduce the results of TNTComplEx. The initialization of the embeddings might be a reason. Table 9 shows the average runtime of each training epoch for each interaction model.