TimeTraveler: Reinforcement Learning for Temporal Knowledge Graph Forecasting

Temporal knowledge graph (TKG) reasoning is a crucial task that has gained increasing research interest in recent years. Most existing methods focus on reasoning at past timestamps to complete the missing facts, and there are only a few works of reasoning on known TKGs to forecast future facts. Compared with the completion task, the forecasting task is more difficult that faces two main challenges: (1) how to effectively model the time information to handle future timestamps? (2) how to make inductive inference to handle previously unseen entities that emerge over time? To address these challenges, we propose the first reinforcement learning method for forecasting. Specifically, the agent travels on historical knowledge graph snapshots to search for the answer. Our method defines a relative time encoding function to capture the timespan information, and we design a novel time-shaped reward based on Dirichlet distribution to guide the model learning. Furthermore, we propose a novel representation method for unseen entities to improve the inductive inference ability of the model. We evaluate our method for this link prediction task at future timestamps. Extensive experiments on four benchmark datasets demonstrate substantial performance improvement meanwhile with higher explainability, less calculation, and fewer parameters when compared with existing state-of-the-art methods.


Introduction
Storing a wealth of human knowledge and facts, Knowledge Graphs (KGs) are widely used for many downstream Artificial Intelligence (AI) applications, such as recommendation systems (Guo et al., 2020), dialogue generation (Moon et al., 2019), and question answering (Zhang et al., 2018). KGs store facts in the form of triples * Equal Contribution. † Corresponding author.
( , , ), i.e. (subject entity, predicate/relation, object entity), such as (LeBron_James, plays_for, Cleveland_Cavaliers). Each triple corresponds to a labeled edge of a multi-relational directed graph. However, facts constantly change over time. To reflect the timeliness of facts, Temporal Knowledge Graphs (TKGs) additionally associate each triple with a timestamp ( , , , ), e.g., (LeBron_James, plays_for, Cleveland_Cavaliers, 2014. Usually, we represent a TKG as a sequence of static KG snapshots. TKG reasoning is a process of inferring new facts from known facts, which can be divided into two types, interpolation and extrapolation. Most existing methods (Jiang et al., 2016a;Dasgupta et al., 2018;Goel et al., 2020;Wu et al., 2020) focus on interpolated TKG reasoning to complete the missing facts at past timestamps. In contrast, extrapolated TKG reasoning focuses on forecasting future facts (events). In this work, we focus on extrapolated TKG reasoning by designing a model for the link prediction at future timestamps. E.g., "which team LeBron James will play for in 2022?" can be seen as a query of link prediction at a future timestamp: (LeBron_James, plays_for, ?, 2022). Compared with the interpolation task, there are two challenges for extrapolation. (1) Unseen timestamps: the timestamps of facts to be forecast do not exist in the training set. (2) Unseen entities: new entities may emerge over time, and the facts to be predicted may contain previously unseen entities. Hence, the interpolation methods can not treat the extrapolation task.
The recent extrapolation method RE-NET (Jin et al., 2020) uses Recurrent Neural Network (RNN) to capture temporally adjacent facts information to predict future facts. CyGNet (Zhu et al., 2021) focuses on the repeated pattern to count the frequency of similar facts in history. However, these methods only use the random vectors to represent the previously unseen entities and view the link prediction task as a multi-class classification task, causing it unable to handle the second challenge. Moreover, they cannot explicitly indicate the impact of historical facts on predicted facts.
Inspired by path-based methods (Das et al., 2018;Lin et al., 2018) for static KGs, we propose a new temporal-path-based reinforcement learning (RL) model for extrapolated TKG reasoning. We call our agent the "TIme Traveler" (TITer), which travels on the historical KG snapshots to find answers for future queries. TITer starts from the query subject node, sequentially transfers to a new node based on temporal facts related to the current node, and is expected to stop at the answer node. To handle the unseen-timestamp challenge, TITer uses a relative time encoding function to capture the time information when making a decision. We further design a novel time-shaped reward based on Dirichlet distribution to guide the model to capture the time information. To tackle the unseen entities, we introduce a temporal-path-based framework and propose a new representation mechanism for unseen entities, termed the Inductive Mean (IM) representation, so as to improve the inductive reasoning ability of the model.
Our main contributions are as follows: • This is the first temporal-path-based reinforcement learning model for extrapolated TKG reasoning, which is explainable and can handle unseen timestamps and unseen entities. • We propose a new method to model the time information. We utilize a relative time encoding function for the agent to capture the time information and use a time-shaped reward to guide the model learning. • We propose a novel representation mechanism for unseen entities, which leverages query and trained entity embeddings to represent untrained (unseen) entities. This can stably improve the performance for inductive inference without increasing the computational cost. • Extensive experiments indicate that our model substantially outperforms existing methods with less calculation and fewer parameters.
2 Related Work

Static Knowledge Graph Reasoning
Embedding-based methods represent entities and relations as low-dimensional embeddings in different representation spaces, such as Euclidean space (Nickel et al., 2011;Bordes et al., 2013), complex vector space (Trouillon et al., 2016;, and manifold space (Chami et al., 2020). These methods predict missing facts by scoring candidate facts based on entity and relation embeddings. Other works use deep learning models to encode the embeddings, such as Convolution Neural Network (CNN) (Dettmers et al., 2018;Vashishth et al., 2020) to obtain deeper semantics, or Graph Neural Network (GNN) (Schlichtkrull et al., 2018;Nathani et al., 2019;Zhang et al., 2020) to encode multi-hop structural information. Besides, path-based methods are also widely used in KG reasoning. Lin et al. (2015) and Guo et al. (2019) use RNN to compose the implications of paths. Reinforcement learning methods (Xiong et al., 2017;Das et al., 2018;Lin et al., 2018) view the task as a Markov decision process (MDP) to find paths between entity pairs, which are more explanatory than embedding-based methods.

Temporal Knowledge Graph Reasoning
A considerable amount of works extend static KG models to the temporal domain. These models redesign embedding modules and score functions related to time (Jiang et al., 2016b;Dasgupta et al., 2018;Goel et al., 2020;Lacroix et al., 2020;Han et al., 2020a). Some works leverage messagepassing networks to capture graph snapshot neighborhood information (Wu et al., 2020;Jung et al., 2020). These works are designed for interpolation.
For extrapolation, Know-Evolve (Trivedi et al., 2017) and GHNN (Han et al., 2020b) use temporal point process to model facts evolved in the continuous time domain. Additionally, TANGO (Ding et al., 2021) explores the neural ordinary differential equation to build a continuous-time model. RE-NET (Jin et al., 2020) considers the multihop structural information of snapshot graphs and uses RNN to model entity interactions at different times. CyGNet (Zhu et al., 2021) finds that many facts often show a repeated pattern and make reference to known facts in history. These approaches lack to explain their predictions and cannot handle the previously unseen entities. Explanatory model xERTE  uses a subgraph sampling technique to build an inference graph. Although the representation method that refers to GraphSAGE (Hamilton et al., 2017) makes it possible to deal with unseen nodes, the continuous expansion of inference graphs also severely restricts the inference speed.

Methodology
Analogizing to the previous work on KGs (Das et al., 2018), we frame the RL formulation as "walkbased query-answering" on a temporal graph: the agent starts from the source node (subject entity of the query) and sequentially selects outgoing edges to traverse to new nodes until reaching a target. In this section, we first define our task, and then describe the reinforcement learning framework and how we incorporate the time information into the on-policy reinforcement learning model. The optimization strategy and the inductive mean representation method for previously unseen entities are provided in the end. Figure 2 is the overview of our model.

Task Definition
Here we formally define the task of extrapolation in a TKG. Let E, R, T , and F denote the sets of entities, relations, timestamps, and facts, respectively. A fact in a TKG can be represented in the form of a quadruple ( , , , ), where ∈ R is a directed labeled edge between a subject entity ∈ E and an object entity ∈ E at time ∈ T . We can represent a TKG by the graph snapshots over time. A TKG can be described as G (1, ) = {G 1 , G 2 , ..., G }, where G = {E , R, F } is a multi-relational directed TKG snapshot, and E and F denote entities and facts that exist at time . In order to distinguish the graph nodes at different times, we let a node be a two-tuple with entity and timestamp: = ( , ). Thus, a fact (or event) ( , , , ) can also be seen as an edge from source node to destination node with type .
Extrapolated TKG reasoning is the task of predicting the evolution of KGs over time, and we perform link prediction at future times. It is also forecasting of events occurring in the near future. Given a query ( , , ?, ) or (?, , , ), we have a set of known facts {( , , , )| < }. These known facts constitute the known TKG, and our goal is to predict the missing object or subject entity in the query.

Reinforcement Learning Framework
Because there is no edge among the typical TKG snapshots, the agent cannot transfer from one snapshot to another. Hence, we sequentially add three types of edges. (i) Reversed Edges. For each quadruple ( , , , ), we add ( , −1 , , ) to the TKG, where −1 indicates the reciprocal relation of . Thus, we can predict the subject entity by converting (?, , , ) to ( , −1 , ?, ) without loss of generality. (ii) Self-loop Edges. Self-loop edges can allow the agent to stay in a place and work as a stop action when the agent search unrolled for a fixed number of steps. (iii) Temporal Edges. The agent can walk from node to node through edge , if ( , , , ) exits and < ≤ . Temporal edges indicate the impact of the past fact on the entity and help the agent find the answer in historical facts. Figure 1 shows the graph with temporal edges.
Our method can be formulated as a Markov Decision Process (MDP), and the components of which are elaborated as follows.
States. Let S denote the state space, and a state is represented by a quintuple = ( , , , , ) ∈ S, where ( , ) is the node visited at step and ( , , ) is the elements in the query. ( , , ) can be viewed as the global information while ( , ) is the local information. The agent starts from the source node of the query, so the initial state is 0 = ( , , , , ).
Actions. Let A denote the action space, and A denote the set of optional actions at step , A ⊂ A consists of outgoing edges of node . Concretely, A should be {( , , )|( , , , ) ∈ F , ≤ , < }, but an entity usually has many related historical facts, leading to a large number of optional actions. Thus, the final set of optional actions A is sampled from the set of above outgoing edges.
Transition. The environment state is transferred to a new node through the edge selected by the agent. The transition function : S × A → S defined by ( , A ) = +1 = ( +1 , +1 , , , ), where A is the sampled outgoing edges of . is the current node. Illustration of policy network provides the process for scoring one of the candidate actions ( 1 , 1 , 1 ). TITer samples an action based on the transition probability calculated from all candidate scores. When the search is completed, the time-shaped reward function will give the agent a reward based on the estimated Dirichlet distribution ( ).
Rewards with shaping. The agent receives a terminal reward of 1 if it arrives at a correct target entity at the end of the search and 0 otherwise. If = ( , , , , ) is the final state and ( , , , ) is the ground truth fact, the reward formulation is: Usually, the quadruples with the same entity are concentrated in specific periods, which causes temporal variability and temporal sparsity (Wu et al., 2020). Due to such property, the answer entity of the query has a distribution over time, and we can introduce this prior knowledge into the reward function to guide the agent learning. The time-shaped reward can let the agent know which snapshot is more likely to find the answer. Based on the training set, we estimate a Dirichlet distribution for each relation. Then, we shape the original reward with Dirichlet distributions: where ∈ R is a vector of parameters of the Dirichlet distribution for a relation . We can estimate from the training set. For each quadruple with relation in the training set, we count the number of times the object entity appears in each of the most recent historical snapshots. Then, we obtain a multinomial sample and = { 1 , ..., }. To maximize the likelihood: we can estimate . The maximum can be computed via the fixed-point iteration, and more calcu-lation formulas are provided in Appendix A.5. Because Dirichlet distribution has a conjugate prior, we can update it easily when we have more observed facts to train the model.

Policy Network
We design a policy network ( | ) = ( | ; ) to model the agent in a continuous space, where ∈ A , and are the model parameters. The policy network consists of the following three modules.
Dynamic embedding. We assign each relation ∈ R a dense vector embedding r ∈ R . As the characteristic of entities may change over time, we adopt a relative time representation method for entities. We use a dynamic embedding to represent each node = ( , ) in G , and use e ∈ R to represent the latent invariant features of entities. We then define a relative time encoding function (Δ ) ∈ R to represent the time information. Δ = − and (Δ ) is formulated as follows: where w, b ∈ R are vectors with learnable parameters and is an activation function. , and represent the dimensions of the embedding. Then, we can get the representation of a node : e = [e ; (Δ )].
Path encoding. The search history ℎ = (( , ), 1 , ( 1 , 1 ), ..., , ( , )) is the sequence of actions taken. The agent encodes the history ℎ with a LSTM: Here, r 0 is a start relation, and we keep the LSTM state unchanged when the last action is self-loop.
Action scoring. We score each optional action and calculate the probability of state transition. Let = ( , , ) ∈ A represent an optional action at step . Future events are usually uncertain, and there is usually no strong causal logic chain for some queries, so the correlation between the entity and query is sometimes more important. Thus, we use a weighted action scoring mechanism to help the agent pay more attention to attributes of the destination nodes or types of edges. Two Multi-Layer Perceptrons (MLPs) are used to encode the state information and output expected destination node and outgoing edge representations. Then, the agent obtains the destination node score and outgoing edge score of the candidate action by calculating the similarity. With the weighted sum of the two scores, the agent obtains the final candidate action score ( , ): = sigmoid(W [h ; e ; r ; e ; r ]), where W 1 , W , W and W are learnable matrices. After scoring all candidate actions in A , ( | ) can be obtained through softmax.
To summarize, the parameters of the LSTM, MLP and , the embedding matrices of relation and entity form the parameters in .

Optimization and Training
We fix the search path length to , and an -length trajectory will be generated from the policy network : { 1 , 2 , ..., }. The policy network is trained by maximizing the expected reward over all training samples F : Then, we use the policy gradient method to optimize the policy. The REINFORCE algorithm (Williams, 1992) will iterate through all quadruple in F and update with the following stochastic gradient: Figure 3: Illustration of the IM mechanism. For an unseen entity , " − 3 : 1 , 3 " indicates has cooccurrence relations 1 , 3 at − 3, and updates its representation based on 1 , 3 , and finally gets the IM representation at − 1. Then to answer a query ( , 2 , ?, ), we do a prediction shift based on 2 .

Inductive Mean Representation
As new entities always emerge over time, we propose a new entity representation method for previously unseen entities. Previous works (Bhowmik and de Melo, 2020; can represent unseen entities through neighbor information aggregation. However, newly emerging entities usually have very few links, which means that only limited information is available. E.g., for a query ( _ , _ , ?, 2022), entity "Evan_Mobley" does not exist in previous times, but we could infer this entity is a player through relation "plays_for", and assign "Evan_Mobley" a more reasonable initial embedding that facilitates the inference. Here we provide another approach to represent unseen entities by leveraging the query information and embeddings of the trained entities, named Inductive Mean (IM), as illustrated in Figure 3.
Let G ( , −1) represent the snapshots of the TKG in the test set. The query entity first appears in G and gets a randomly initialized representation vector. We regard as the co-occurrence relation of , if there exists a quadruple which contains ( , ). Note that entity may have different cooccurrence relations over time. We denote ( ) as the co-occurrence relation set of entity at time . Let represent the set that comprises all the trained entities having the co-occurrence relation . Then, we can obtain the inductive mean representation of the entities with the same co-occurrence relation : Entities with the same co-occurrence relation have similar characteristics, so IM can utilize e to gradually update the representation of based on the time flow. 0 ≤ ≤ 1 is a hyperparameter: For relation , we do a prediction shift based on e to make the entity representation more suitable for the current query. To answer a query ( , , ?, ), we use the combination of 's representation at time − 1 and the inductive mean representation e : e , , = e , −1 + (1 − )e .

Experimental Setup
Datasets. We use four public TKG datasets for evaluation: ICEWS14, ICEWS18 (Boschee et al., 2015), WIKI (Leblay and Chekol, 2018a), and YAGO (Mahdisoltani et al., 2015). Integrated Crisis Early Warning System (ICEWS) is an event dataset. ICEWS14 and ICEWS18 are two subsets of events in ICEWS that occurred in 2014 and 2018 with a time granularity of days. WIKI and YAGO are two knowledge base that contains facts with time information, and we use the subsets with a time granularity of years. We adopt the same dataset split strategy as in (Jin et al., 2020) and split the dataset into train/valid/test by timestamps such that: time_of_train < time_of_valid < time_of_test. Appendix A.2 summarizes more statistics on the datasets.
Evaluation metrics. We evaluate our model on TKG forecasting, a link prediction task at the future timestamps. Mean Reciprocal Rank (MRR) and Hits@1/3/10 are performance metrics. For each quadruple ( , , , ) in the test set, we evaluate two queries, ( , , ?, ) and (?, , , ). We use the time-aware filtering scheme (Han et al., 2020b) that only filters out quadruples with query time . The time-aware scheme is more reasonable than the filtering scheme used in (Jin et al., 2020;Zhu et al., 2021). Appendix A.1 provides detailed definitions.
Baseline. As lots of previous works have verified that the static methods underperform compared with the temporal methods on this task, we do not compare TITer with them. We compare our model with existing interpolated TKG reasoning methods, including TTransE (Leblay and Chekol, 2018b), TA-DistMult (García-Durán et al., 2018), DE-SimplE (Goel et al., 2020), and TNT-ComplEx (Lacroix et al., 2020), and state-of-theart extrapolated TKG reasoning approaches, including RE-NET (Jin et al., 2020), CyGNet (Zhu et al., 2021), TANGO (Ding et al., 2021), and xERTE . An overview of these methods is in Section 2.

Implementation Details
Our model is implemented in PyTorch 1 . We set the entity embedding dimension to 80, the relation embedding dimension to 100, and the relative time encoding dimension to 20. We choose the latest outgoing edges as candidate actions for TITer at each step. is 50 for ICEWS14 and ICEWS18, 60 for WIKI, and 30 for YAGO. The reasoning path length is 3. The discount factor of REINFORCE is 0.95. We use Adam optimizer to optimize the parameters, and the learning rate is 0.001. The batch size is set to 512 during training. We use beam search for inference, and the beam size is 100. For the IM, is 0.1. The activation function of is . For full details, please refer to Appendix A.3.

Results and Discussion
Performance on the TKG datasets. Our method can search multiple candidate answers via beam search. Table 1 reports the TKG forecasting performance of TITer and the baselines on four TKG datasets. TITer outperforms all baselines on all datasets when evaluated by MRR and Hits@1 metrics, and in most cases (except ICEWS18), TITer exhibits the best performance when evaluated by the other two metrics. TTransE, TA-DistMult, DE-SimplE, and TNTComplEx cannot deal with unseen timestamps in the test set, so they perform worse than others. The performance of TITer is much higher than RE-NET, CyGNet, and TANGO on WIKI and YAGO. Two reasons cause this phenomenon: (1) the characteristic of WIKI and YAGO that nodes usually have a small number of neighbors gives the neighbor search algorithm an advantage. (2) for WIKI and YAGO, a large number of quadruples contain unseen entities in the test set(See Table 2), but these   Inductive inference. When a query contains an unseen entity, models should infer the answer inductively. For all such queries that contain unseen entities in the ICEWS14 test set, we present experimental results in Figure 4 and Table 6. The performance of RE-NET and CyGNet decays significantly when compared with their result on the whole ICEWS14 test set (see Table 1). Due to the lack of training for unseen entities' embedding and the classification layer for all entities, RE-Net and CyGNet could not reach the performance of a 3-hop neighborhood random search baseline, as illustrated by the dotted line in Figure  4. In contrast, xERTE can tackle such queries by dynamically updating the new entities' representation based on temporal message aggregation, and TITer can tackle such queries by the temporal-pathbased RL framework. We also observe that TITer outperforms xERTE, no matter TITer adopts the IM mechanism or not. The IM mechanism could further boost the performance, demonstrating its effectiveness in representing unseen entities.
Case study. Table 4 visualizes specific reasoning paths of several examples in the test set of ICEWS18. We notice that TITer tends to select a recent fact (outgoing edge) to search for the answer. Although the first two queries have the same subject entity and relation, TITer can reach the ground truths according to different timestamps. As shown in Eq. (6), increases when TITer emphasizes neighbor nodes more than edges. After training, the representations of entities accumulate much semantic information, which helps TITer select the answer directly with less extra information for queries 3 and 4. In comparison, TITer needs more historical information when unseen entities appear. Query 5 is an example of multi-hop reasoning. It indicates that TITer can tackle combinational logic problems.
Efficiency analysis.     NET and CyGNet have much more parameters than other methods. To achieve the best results, xERTE adopts the graph expansion mechanism and the temporal relational graph attention layer to perform a local representation aggregation for each step, leading to a vast amount of calculation. Compared with xERTE, the number of parameters of TITer has reduced by at least a half, and the number of Multi-Adds operations (MACs) has greatly reduced to 0.225M, which is much less than the counterpart, indicating the high efficiency of the proposed model.
In summary, compared to the previous state-ofthe-art models, TITer has saved at least 50.3% parameters and 94.9% MACs. Meanwhile, TITer still exhibits better performance.

Ablation Study
In this subsection, we study the effect of different components of TITer by ablation studies.The results are shown in Table 5 and Figure 5.
Relative time encoding. The relative time representation is a crucial component in our method. Figure 5 shows the notable change from temporal to static on ICEWS18 and WIKI. We remove the relative time encoding module to get the static model. For Hits@1, the temporal model improves 13.19% on ICEWS18 and 18.51% on WIKI, compared with the static model. It indicates that our relative time encoding function can help TITer choose the correct answer more precisely.
Weighted action scoring mechanism. We observe that setting to a constant 0.5 can lead to a drop of 5.49% on Hits@1, indicating that TITer can better choose the source of evidence when making the inference. After training, TITer learns the latent relationship among entities. As expounded in Table  4, TITer prefers to pay more attention to the node for inferring when there exists more information to make a decision, and TITer chooses to focus on edges (relations in history) to assist the inferring for complex queries or unseen entities.
Reward shaping. We observe that TITer outperforms the variant without reward shaping, which means an improvement of 5% on Hits@1. By using Dirichlet prior distribution to direct the decision process, TITer acquires knowledge about the probability distribution of the target's appearance over the whole time span.

Conclusion
In this work, we propose a temporal-path-based reinforcement learning model named TimeTraveler (TITer) for temporal knowledge graph forecasting. TITer travels on the TKG historical snapshots and searches for the temporal evidence chain to find the answer. TITer uses a relative time encoding function and time-shaped reward to model the time information, and the IM mechanism to update the unseen entities' representation in the process of testing. Extensive experimental results reveal that our model outperforms state-of-the-art baselines with less calculation and fewer parameters. Furthermore, the inference process of TITer is explainable, and TITer has good inductive reasoning ability. Details of Training. Same as MINERVA, we use an additive control variate baseline to reduce the variance and add an entropy regularization term to the cost function scaled by a constant to encourage diversity in the paths sampled by the policy. The scaling constant is initialized to 0.01, and it will decay exponentially with the number of training epochs. The attenuation coefficient is 0.9. The discount factor of REINFORCE is 0.95. We use Adam optimizer to optimize the parameters, and the learning rate is 0.001. The weight decay of the optimizer is 0.000001. We clip gradients greater than 10 to avoid the gradient explosion. The batch size is set to 512.
Details of Testing. We use beam search to obtain a list of predicted entities with the corresponding scores. The beam size is set to 100. Multiple paths obtained through beam search may lead to the same target entity, and we keep the highest path score among them as the final entity score. For the IM module, we set the time decay factor to 0.1.
Details of other Methods. We use the released code to implement DE-SimplE 4 , TNTComplEx 5 , CyGNet 6 , RE-NET 7 , and xERTE 8 . We use the default parameters in the code. Partial results in Table  1 are from Ding et al., 2021). The authors of CyGNet only made object predictions when evaluating their model. We find that the subject prediction is more difficult than the object prediction for these four datasets. We use the code of CyGNet to train two models to predict object and subject, respectively. TANGO (Ding et al., 2021) does not release the code, so we use the results reported in their paper.

A.4 Model Robustness
We run TITer on all datasets five times by using five different random seeds with fixed hyperparameters. Table 9 reports the mean and standard deviation of TITer on these datasets. It shows that TITer demonstrates a small standard deviation, which indicates its robustness.
represents the count of corresponding category.
The gradient of the log-likelihood is: where Ψ is the digamma function. The maximum can be computed via the fixed-point iteration: We can also maximize the leave-one-out likelihood. The digamma function Ψ can also be inverted efficiently by using a Newton-Raphson method.