TR-Rules: Rule-based Model for Link Forecasting on Temporal Knowledge Graph Considering Temporal Redundancy

,


Introduction
Knowledge Graphs (KGs) store various real-world facts in the form of (s,r,o) in which s and o represent subject and object real-world entities while r denotes a relation between s and o.KGs are critical in many downstream applications such as question answering (Hao et al., 2017), recommender systems (Hildebrandt et al., 2019) and information retrieval (Liu et al., 2018).Traditional knowledge graph can be viewed as a static knowledge base, however, plausibility of most real-world facts are dependent on time, i.e., facts in KGs may not stand perpetually.Thus, TKGs are introduced to resolve this limitation, where each fact is represented by a quadruple (s,r,o,t) with timestamp t incorporated.Link forecasting or extrapolation on TKG which aims at predicting possible links at future timestamps, has been one of the main tasks being studied.Recently, embedding-based models like RE-NET (Jin et al., 2019), CyGNet (Zhu et al., 2021), have been proposed and perform well for extrapolation on TKGs.However, compared with rule-based models, embedding-based models are believed lacking interpretability which leads to limitations on their downstream applications.TLogic (Liu et al., 2022) is the first model which leverages rulebased methods on TKG link forecasting and gains competitive results as well as more explainable predicting process since it generates human readable rules.It mines cyclic rules by performing temporal random walk on TKGs first, and then extracting and generalizing walks into symbolic rules.
Nevertheless, existing models only mine cyclic rules in TKGs and ignore acyclic rules which, also contain important information for link prediction.Besides, previous works do not notice that under temporal settings, (s,r,o) which has to be unique in static KGs, might appears repeatedly with different timestamps.And this temporal property can cause the model to obtain inappropriate confidence of rules through conventional confidence calculation algorithm, which further leads to incorrect predictions.We call this issue Temporal Redundancy.
In this paper, we first define Temporal Redundancy and give an intuitive illustration on why Temporal Redundancy causes inappropriate confidence calculation.Then we propose TR-Rules, a rulebased TKG link forecasting model which resolves Temporal Redundancy through a simple but effective aggregation strategy.TR-Rules aggregates those matched rule bodies before any new rule heads appear as one body during the rule support counting process.Unlike previous works merely mine cyclic rules on TKG, by performing temporal random walk on TKG, TR-Rules also mines acyclic rules which is crucial for link prediction but unexplored by existing models.Experimental results show that, without considering temporal redundancy, simply using acyclic rules together with cyclic rules gives worse performance, which is the probable reason why acyclic rules have not been involved and studied by existing models.Ablation study also demonstrates that acyclic rules are more sensitive to the number of random walks sampled for rule mining, which consequently results in a more promising performance.
Our contributions are as follows: • We discover and define the problem of temporal redundancy which results in inappropriate rule confidence calculation.
• We propose TR-Rules, a rule-based temporal knowledge graph extrapolation model which resolves temporal redundancy with a simple algorithm and, to our best knowledge, is the first model that mines acyclic rules on temporal knowledge graphs.
• We evaluate TR-Rules on three benchmarks and the results show that TR-Rules achieves state-of-the-art performance.
• We also study the performance of acyclic rules and cyclic rules of different length respec-tively and experimental results prove the effectiveness of TR-Rules in solving temporal redundancy and demonstrate that the performance of acyclic rules is more promising due to its higher sensitivity to the number of sampled walks.
2 Related Work

TKG Extrapolation
Most of the existing TKG extrapolation models are embedding-based models which aim at learning representations of entities, relations as well as timestamps in vector space and then obtain scores of facts through proposed score functions to measure the plausibility of those facts being true.RE-NET (Jin et al., 2019) employs RNN-based encoder and RGCN-based encoder to capture sequential and structural information of facts for future events prediction.CyGNet (Zhu et al., 2021) introduces copy-generation mechanism to learn more information from repetitive facts in history.HIP (He et al., 2021) proposes three score functions to pass information from temporal, structural and repetitive patterns.CluSTeR (Li et al., 2021) adopts reinforcement learning to search clues from history for link forecasting.CENET (Xu et al., 2022) focuses more on the non-historical facts and uses contrastive learning to distinguish whether historical or non-historical information is more important for events prediction.Apart from embedding-based models, xERTE (Han et al., 2021) and TLogic (Liu et al., 2022) provide more explainable predictions.xERTE performs extraction on subgraphs around the queries to model structural dependencies and temporal dynamics.TLogic mines temporal rules via adopting temporal random walks on TKGs, which are applied to forecast plausible events at future timestamps afterwards.(Galárraga et al., 2013) proposes one of the earlist rule mining systems that learns closed rules on knowledge bases.AnyBURL (Meilicke et al., 2019) is a random-walk-based model which transforms sampled walks from KG into general rules and then applies obtained rules to predict missing facts.SAFRAN (Ott et al., 2021) focuses on the effect of rules redundancy on aggregation especially noisy-or aggregation, and proposes an approach to cluster rules before application.

TKG Extrapolation
In this paper, we use E , R and T to represent all the entities , relations and timestamps respectively.While |E| and |R| denote the number of entities and relations.TKG can be considered as a sequence of graphs i.e.G = {G 1 ...G n } in which G i = {(s, r, o, t i )} where s, o ∈ E, r ∈ R and t i ∈ T .TKG extrapolation is to answer a query (s, r, ?, t q ) or (?, r, o, t q ) based on the previously observed facts {G i |t i < t q }.Normally, the model generates a ranked list of candidate entities according to their plausibility.

Random Walk on TKG
Following the definition given by TLogic, a nonincreasing random walk on TKG with length l ∈ N can be defined as: ((e l+1 , r l , e l , t l ), (e l , r l , e l−1 , t l−1 ), ..., (e2, r1, e1, t1)) with t l ≥ t l−1 ... ≥ t1 The non-increasing restriction ensures random walks only go back in time forming chains complying with temporal causality.

Temporal Rules
Here, we give the definition of temporal cyclic and acyclic rules.Let E i and T i be variables and r i , e i represent fixed constants (fixed constant means that it has to be a specific entity or relation).A temporal cyclic rule can be represented as: where And a temporal acyclic rule can be defined as: where T h > T b , e h and e b are fixed entities.It is notable that in TR-Rules, just as the definition given above, we confine the length of acyclic rules to be 1.The left side of a rule is called rule head and the right side of a rule is called rule body.A temporal rule in the above forms indicates that when the rule body stands(satisfying the variable constraints and temporal restrictions), the rule head at future timestamp T n+1 will be true.When applied, each variable in rule head or rule body can be instantiated with specific entities or relations.
For better illustration, we give two examples of cyclic and acyclic rules, both of which are reasonable and critical in link forecasting.For facts of (Ameriaca, Intent to negotiate, China, T 1 ) and (America, M ake a visit, China, T 2 ) where T 1 < T 2 , we can generalize them into a cyclic rules of length 1: From facts (Jason, M ajor, Education, T 1 ) and (Jason, Job, T eacher, T 2 ), a reasonable temporal acyclic rule can be generalized which is:

Confidence Estimation
Generally, temporal rules may not always be a tautology.Thus, we need to estimate the confidence of each temporal rule before application.Existing rule-based TKG extrapolation model TLogic leverages the standard confidence which is widely used in static KG completion models and take timestamps into consideration.Take a cyclic rule R: The body support of R can be defined as the number of rule body instances, i.e., the number of sequences ((e 1 , r b1 , e 2 , t 1 ), ..., (e l , r bl , e l+1 , t l )) where (e i , r bi , e i+1 , t i ) ∈ G and t i ≤ t i+1 for i ∈ [0, l − 1].And the rule support of R is defined as the number of body instances whose rule head stands at future timestamp t l+1 , i.e., (e 1 , r h , e l+1 , t l+1 ) ∈ G.The confidence is calculated as where c is a hyper-parameter for smoothing and w is a weight for rules of different types and different lengths.This confidence estimation algorithm works for static KG completion but when it comes to temporal settings, it suffers from Temporal Redundancy which will be defined and discussed in the next section.resolves it.Then we introduce the rule learning module of TR-Rules where specifically we show the acyclc rules mining algorithm of TR-Rules.Finally, the rule application module of TR-Rules is introduced to show how we combine cyclic and acyclic rules together to predict future events.

Temporal Redundancy
As mentioned above, standard confidence counts the number of body instances whose rule heads stand at future timestamps.It works well for static KGs, however, it is notable that in TKGs a triplet can appear more than once, which means that it is possible the same body instance can be counted multiple times for rule support only if it holds true at different timestamps before the timestamp where its corresponding rule head stands.Especially, when there are multiple body instances holding true between any two temporally adjacent rule head instances, we call this phenomenon Temporal Redundancy.
In figure 2, Temporal Redundancy takes place between T 2 and T 7 .When Temporal Redundancy happens, the confidence of rules is over estimated, because many interactions between facts have strongly connections with the accumulation of the causes.In figure 2, we take a cyclic rule: (X, M ake a visit, Y ) ← (X, Intent to negotiate, Y ) which is mined from ICEWS14 and some relevant facts for example.In this case, when estimating the confidence of this rule, suppose there is no other relevant facts, according to the defini-tion of standard confidence, the result should be 1.Because there are 3 body instances (T he U S envoy, Intent to negotiate, China) at (T 1 , T 3 , T 6 ), which means the body support is 3 and for body instance at T 1, there are head instances Intent to negotiate holding true at (T 2 , T 9 ) making it counted as a rule support, for body instances at (T 3 , T 6 ), head instance at T 9 makes them counted as rule support.Thus the rule support is 3 either and ignoring the parameters for smoothing, the confidence of this rule is 1.However, in fact, the head instance (T he U S envoy, M ake a visit, China) standing at T 7 might be the result of the two intentions at (T 3 , T 6 ) together.In other word, only a single intention may not necessarily results in a visit, which intuitively contradicts to the confidence being 1.Since when the confidence of a rule being 1, it means that the rule head will surely hold true at future timestamps as long as there exists valid body instances.From this point of view, standard confidence calculation is not suitable for temporal settings.
In TR-Rules, we solve this issue by proposing a simple but effective window based algorithm for confidence estimation called window confidence.Instead of focusing on counting the number of body instances whose rule heads stand in future as the rule support, window confidence calculates the rule support based on the rule head instances.During the confidence estimation of each rule, we sort the timestamps of all the rule head instances and divide the timeline into windows by those timestamps.Each window starts at timestamp 0 or a timestamp of rule head instances mentioned above and ends at the timestamp next to its start point or the last timestamp.Like in figure 2, we obtain 3 windows according to the rule head instances.Starting from the first window, if there exists at least one body instance, then we add 1 to the rule support and shift to the next window.We can see that by using window confidence, the confidence of the example rule above is 1 2 since only window 1 and window 2 involve valid body instances.What window confidence intends to do is that aggregate the temporal redundancy inside windows with a designed function.In TR-Rules, the function we utilize just aggregate all body instances inside a window as one.From this point of view, the standard confidence applied in TLogic is a trivial version of window confidence which associates with no aggregation function.

Rule Learning
In this section we introduce the learning procedures of cyclic and acyclic rules respectively.Our cyclic rules mining algorithm is similar to the rule learning stage in TLogic.We first randomly sample a quadruples (e 1 , r h , e n+1 , t n+1 ) which is considered to be the rule head.Then we perform random walks traversing back in time and starting from the entity e n+1 .For the first step, to satisfy the temporal restriction mentioned in section 3.3, the random walker is confined to sample edges connecting to the current node at timestamps t n where t n+1 > t n .Then, in the following steps, the walker samples adjacent edges whose timestamps t i satisfies t i ≤ t i+1 where 2 ≤ i ≤ n − 1 iteratively.When it comes to the last step, the walker is restricted to sample a valid edge that connects to e 1 so that all the walks together with the rule head form a cyclic path.If there is no available edges during any steps, the current random walk will stop and we will start a new one by randomly sampling a rule head.Since we traverse back in time during the learning procedure, when generating temporal rules, we need to rebuild the path in reverse order by substituting relations in edges with their inverse and exchanging the objects and subjects which gives: (e 1 , r ′ 1 , e 2 , t 1 )...(e n , r ′ n , e n+1 , t n+1 ).Finally, we replace the entities and timestamps with variables and the generated rules with length n are specific to relation r h .Specifically, if an entity appears multiple times in a rule, then it has to be replaced by the same variable.
In TR-Rules, when sampling the new edge from current node, we use the exponentially weighted transition distribution which can be defined as: where C, t c represent the current edge and the timestamp of current edge and T c denotes a set of possible timestamps associated with available edges in the next step: (2) where obj c represents the object entity of the current edge, e ∈ E and G denotes the whole TKG.This distribution gives higher probability when t l is close to t c which makes the random walker more likely to select closer edges during sampling.This is based on the intuition that closer facts might contribute more to predictions compared with facts that happen at further timestamps.
In TR-Rules, we also propose a learning algorithm for acyclic rules.Similarly, we first sample a random quadruples (e h , r h , e i , t h ) to be the rule head.Then, unlike the cyclic mining procedure, starting from e h , the random walkers are made to select edge adjacent to e h which is (e h , r b , e j , t b ) where t b < t h .Then we replace e h with a variable and set e i , e j to be fixed.Consequently, an acyclic rule specific to r h is obtained.
When estimating confidence, we first sample a fixed number of instanced paths and leverage the window confidence mentioned above.The output of the rule learning module is a set of rules with their corresponding confidence.

Rule Application
Given a query Q : (s, r, ?, t), we first find the rules whose head relations match with r in Q and sort them in descending order according to their confidence.Then, for each rule, we traverse the TKGs to find matching body instances which satisfy the temporal constraints and the object entity in each rule head is considered to be one of the answer candidates if there exist valid body instances.For a candidate answer c generated by rule R, we leverage the score function f proposed in TLogic to measure the plausibility of (s, r, c, t) being true: where a, λ are hyper-parameters, B(R, c) denotes all the body instances of rule R that leads to candidate c and t 1 (b) means the earliest timestamp in the body instance b.If there are multiple body instances, then t 1 returns the one closest to t.This score function considers not only the rule confidence but also the time difference.Obviously, the confidence part enables candidates derived by rules with higher confidence to receive higher scores.
And the time difference part assigns higher scores to candidates generated by close body instances, which is based on the phenomenon that edges in rules incline to hold true when the time difference is low.Normally, one candidate can be generated by multiple rules, we use Noisy-OR to aggregate scores of c obtained by different rules.The final scores of c is calculated as: where C represents all the scores of candidate c derived from multiple rules.The intuition behind Noisy-OR is that the results represent the possibility that at least one rule holds for candidate c.It is possible that no rule learned can answer the given query.In this paper, we follow the simple baseline proposed in TLogic which will generate candidate answers according to the object distribution in the training set.

Experimental Setup
Datasets We evaluate TR-Rules on three datasets: ICEWS14, ICEWS18 and ICEWS0515.All of the three datasets are subsets of Integrated Crisis Early Warning System (Boschee et al., 2015), which store political events take place in 2014, 2018 and from 2005 to 2015 respectively.Under TKG extrapolation settings, all datasets are split into train, valid and test according to the timestamps in ascending order, which do not overlap with each other.The statistics of these three datasets are listed in Table 1.

Metrics and Implementation Details
The metrics we use to evaluate TR-Rules are: mean reciprocal rank (MRR) and Hits@1/3/10.For N testing queries, the MRR is computed as rank i and Hits@K is calculated as Hits@K = 1 N i I(rank i < K), where I is the indicator function.In this paper, we use the timeaware filtering protocol proposed in xERTE (Han et al., 2021), which filters out all the true entities {o|(s, r, o, t) ∈ G t } given query q : (s, r, ?, t), except the answer entity of query q.Compared with the traditional filtering protocol that filters out all the true entities o as long as the triplet (s, r, o) holds true at any timestamps, time-aware filtering is more reasonable.
As for the implementation details, we learn cyclic rules with length of 1,2,3 and acyclic rules.We set the the number of sampled walks during rules learning stage to 200 for cyclic rules on all datasets and for acyclic rules, we set it to 5000 on ICEWS14 and 1000 on ICEWS0515 and ICEWS18.During the confidence estimation process, we sample 500 body instances for all kinds of rules.The smoothing parameter c is 3 and we filter out rules with confidence less than 0.01 and body instances number less than 2. We set w in (1) to 1 for cyclic rules in all lengths and 0.5 for acyclic rules.a, λ in (3) are set to 0.5 and 1 for all three datasets.Following TLogic, we also set time windows during rule applications which filter out facts not in the period spanning from t q − w size to t q , where t q represents the timestamp of the query and w size denotes the time window size.We w size to 200 for ICEWS18, 1000 for ICEWS0515 and use all facts on ICEWS14, which is the same as TLogic does.

Main Results
Table 2 reports results of TR-Rules and all baseline models under time-aware filtering protocol on three datasets.We can see that TR-Rules achieves state-of-the-art performance in all metrics except Hits@10 on ICEWS14 and Hits@1 on ICEWS0515 where TR-Rules gives the second best results.Table 3 presents some high confidence rules of variant types mined by TR-Rules, which demonstrates the better interpretability of TR-Rules.

Ablation Study
To study the contribution of acyclic rules and the impact of Temporal Redundancy on different kinds of rules, we perform ablation study on three datasets and the results are given in Table 4.As for the notations, C1 means we only mine cyclic rules of length one, A means we only mine acyclic rules, C means we only mine cyclic rules but including length of 1,2,3 and R means we use window confidence for solving Temporal Redundancy.TR-Rules(C) is equivalent to TLogic which mines cyclic rules in TKGs.
We can see that the introduction of window confidence that addresses Temporal Redundancy improves the performance of TR-Rules in most cases including mining only cyclic rules or only acyclic rules or both.Besides, window confidence improves the performance of acyclic rules more than that of cyclic rules on three datasets.The possi-ble explanation is that, first, we mine more acyclic rules, the superiority of window confidence is amplified.In addition, without the cyclic restriction, the matched body instances of acyclic rules are slightly less likely to be relevant to the corresponding head instances.In other words, acyclic rules suffer more from temporal redundancy due to its acyclic forms.However, the aggregation operation in window confidence perfectly alleviate this issue.This also explains why TR-Rules(C+A) which just additionally involves acyclic rules comparing to TLogic, gives worse results than TLogic does.In other words, under temporal settings, window confidence provides the proper way to make use of acyclic rules through fixing the bias in confidence calculation caused by temporal redundancy.As discussed in TLogic, we can see that only cyclic rules of length 1 can give competitive performance on all datasets.

Acyclic Rules Performance Analysis
We also study the performance of both kinds of rules on ICEWS14 when the number of sampled walks during rule learning are set to different values.In figure 3, the orange line displays the performance of acyclic rules with standard confidence being utilized, the blue line is the performance of acyclic rules with window confidence being utilized and the green line is the performance of cyclic rules with standard confidence.As we can see in figure 3, the performance of acyclic rules with whichever confdence calculation algorithms improves dramatically as the number of sampled walks grows.The improvements are up to 14.89%(Hits@10) and 10.07%(MRR).However, in comparison, the performance of cyclic rules changes no more than 0.2% as the number of walks grows.The possible reason is that acyclic rules involve fixed entities (e.g.South Korea instead of Country) while entities are replaced with variables in cyclic rules.Hence, acyclic rules need to capture enough information in TKGs by more samples and cyclic rules need less sampled walks because only some specific walks can be generalized into a paradigm.The performance of cyclic rules even get slightly worse because 200 sampled walks are enough for mining strong and general rules and as sampling more walks, more rare but accidental rules are obtained which might affect the inference accuracy.Thus, we can conclude that acyclic rules are more sensitive to the number of walks and yield promising results.

Temporal Redundancy Impact Analysis
As we have discussed in section 5.3, window confidence benefits the performance of both cyclic rules and acyclic rules.In figure 3, we can clearly see the difference between the orange line and blue line increases which means the impact of Temporal Redundancy also grows as the number of sampled walks gets larger and window confidence resolves it effectively.Figure 4 describes the distribution of confidence of rules calculated by standard and window confidence algorithm.As we can see in the figure on the left which displays the confidence distribution of all rules, generally, confidence computed by window confidence algorithm inclines to be in lower confidence section.This phenomenon coincides with the intuition that by solving temporal redundancy, TR-Rules gives lower but more practical confidence under temporal settings, which can be viewed as refinements for temporal rules and the original inflated confidence is corrected.It demonstrates that our method indeed works and verifies the correctness of our motivations.The rest two figures display the confidence distribution of acyclic rules and cyclic rules respectively.We can see that the difference between these two distribu-tions is larger in the middle figure which supports our conclusion of acyclic rules suffer more from temporal redundancy.It further demonstrates that our proposed window confidence serves as an effective way to utilize acyclic rules in TKGs for extrapolation task.

Conclusion
In this paper, we define the problem of Temporal Redundancy and propose TR-Rules, a rule-based TKG extrapolation model which solves Temporal Redundancy by replacing standard confidence with window confidence.TR-Rules also firstly mines acyclic rules in TKGs which are proved to have more promising performance compared with cyclic rules because of their higher sensitivity to the number of sampled random walks.Experimental results show that TR-Rules achieves state-of-the-art performance in most metrics on three datasets.As for future work, it is promising to explore more sophisticated aggregation methods for window confidence calculation, such as employing cluster or machine learning methods.

Limitations
As we discussed in Section 4.1, the aggregation function in window confidence is quite simple.It cannot model the interactions among body instances in different windows which might result inaccurate confidence in some cases.Besides, for rule-based models, especially those random-walkbased models like TR-Rules and TLogic, it is impossible to mine very long rules due to the unaffordable time consumption.In TR-Rules, we mine cyclic rules of length 1,2,3 and acyclic rules of length 1.Thus, as we discuss in conclusion, developing more sophisticated aggregation functions in window confidence might yields better performance.Moreover, employing some other rule learning algorithms other than random walk sampling might obtain more high quality rules.

Figure 1 :
Figure 1: An example subgraph of Temporal knowledge graph involving political facts and there are cyclic and acyclic rules existing in it which provide crucial information for link forecasting.

Figure 2 :
Figure 2: Illustration of the issue of Temporal Redundancy and window confidence.

Figure 3 :
Figure 3: Results of cyclic rules and acyclic rules with different number of sampled walks on ICEWS14.

Figure 4 :
Figure 4: The confidence distribution of rules mined by TR-Rules(window) and TLogic(standard) on ICEWS0515.
r, e, t) ∈ G and t < t c } step 1, {t|(obj c , r, e 1 , t) ∈ G and t ≤ t c } step n, {t|(obj c , r, e, t) ∈ G and t ≤ t c } otherwise.

Table 2 :
Results of TR-Rules on ICEWS14, ICEWS05-15 and ICEWS18.Best results are in bold and second best results are underlined.

Table 4 :
Ablation study of TR-Rules on ICEWS14, ICEWS05-15 and ICEWS18.Best results are in bold and second best results are underlined.