Pre-trained Language Model with Prompts for Temporal Knowledge Graph Completion

Temporal Knowledge graph completion (TKGC) is a crucial task that involves reasoning at known timestamps to complete the missing part of facts and has attracted more and more attention in recent years. Most existing methods focus on learning representations based on graph neural networks while inaccurately extracting information from timestamps and insufficiently utilizing the implied information in relations. To address these problems, we propose a novel TKGC model, namely Pre-trained Language Model with Prompts for TKGC (PPT). We convert a series of sampled quadruples into pre-trained language model inputs and convert intervals between timestamps into different prompts to make coherent sentences with implicit semantic information. We train our model with a masking strategy to convert TKGC task into a masked token prediction task, which can leverage the semantic information in pre-trained language models. Experiments on three benchmark datasets and extensive analysis demonstrate that our model has great competitiveness compared to other models with four metrics. Our model can effectively incorporate information from temporal knowledge graphs into the language models.


Introduction
In recent years, temporal knowledge graphs(TKGs) have attracted much attention.TKGs describe each fact in quadruple (subject, relation, object, timestamp).Compared to static knowledge graphs, TKGs need to consider the impact of timestamps on events.For example, (Donald Trump, PresidentOf, America, 2018) holds while (Donald Trump, Presi-dentOf, America, 2022) does not.There are missing entities or relations in the TKGs, therefore, * temporal knowledge graph completion (TKGC) is one of the most important tasks of temporal knowledge graphs.TKGC task can be divided into two categories: interpolation setting and extrapolation setting (Jin et al., 2020).Interpolation setting aims to predict missing facts in the known timestamps while extrapolation setting attempts to infer future facts in the unknown ones.The latter is much more challenging, and in this work, we focus on the extrapolation setting.Some TKGC methods are developed from static knowledge graph completion (KGC).Such as adding time-aware score functions to KGC models (Jiang et al., 2016;Dasgupta et al., 2018), adding time-aware relational encoders to graph neural networks (Jin et al., 2020;He et al., 2021), adding a new time dimension to the tensor decomposition (Lacroix et al., 2020;Shao et al., 2022), etc.In addition to those KGC-based models, reinforcement learning (Sun et al., 2021), time-aware neural network modeling (Zhu et al., 2021), and other methods are also applied to TKGC.However, the methods mentioned above have some drawbacks, as follows:

*Corresponding author
(1) Insufficient temporal information extraction from timestamps.Most existing TKGC methods model timestamps explicitly or implicitly.Explicit modeling utilizes lowdimensional vectors to represent timestamps.However, real-life timestamps are infinite, and explicit modeling cannot learn all timestamp representations and predict events with unseen timestamps.Implicit modeling does not represent timestamps directly but takes timestamps to connect multiple knowledge graphs by determining the sequential relationship of these knowledge graphs.This approach often requires modeling the knowledge graph one by one, requires a lot of computation, and timestamps are used only to determine before and after things happen.All the above methods do not give full play to the temporal information of timestamps.
(2) Insufficient information mining of associations in relations in TKGC.Existing methods often focus on the structural information of the triples or quadruples when modeling KGs without enough consideration of the implied information in relations.This problem is particularly evident in TKGs because some relations contain information with potential temporal hints.As shown in Figure 1, between three different pairs of subject and object entities, after establishing relation Discuss by telephone, one day apart, they all establish relation Consult.If relation Discuss by telephone is established between the same pair of entities, there is a high probability that they will establish relation Consult within a short period.Among the entity pairs in ICEWS14, there are 10,887 types of relation pairs, out of which 2,652 exhibit obvious temporal correlations, where one relation in the pair high probably occurred before the other, and they have a stable time interval between them.
To address these problems, we propose a novel temporal knowledge graph completion method based on pre-trained language models (PLMs) and prompts.TKGs contain timestamps, and events occurring at different occurrence times have sequential relationships with each other, which are well-suited as inputs to sequence models.Inspired by the successful application of pre-trained language models in static knowledge graph representation (Yao et al., 2019;Kim et al., 2020;Petroni et al., 2019;Lv et al., 2022), we apply PLMs to temporal knowledge graph completion to get implicit semantic information.However, simply splicing entities and relations in the input of PLMs generates incoherent sentences, resulting in the inability to use PLMs (Lv et al., 2022) fully.Therefore, We sample the quadruples in TKGs and construct prompts for each type of timestamps, which we call time-prompts.Then we train PLMs with a masking strategy.In this way, TKGC can be converted into a masked token prediction task.
The contributions of our work can be summarized as follows: • To the best of our knowledge, we are the first to convert the temporal knowledge graph completion task into the pre-trained language model masked token prediction task.
• We construct prompts for each type of interval between timestamps to better extract semantic information from timestamps.
• We apply our experiments on a series of datasets of ICEWS and achieve satisfactory results compared to graph neural network learning methods.
2 Related Work

Static KG representation
Static KG representation learning can roughly be divided into distance-based models, semantic matching models, graph neural network models, and PLM-based models.
PLM-based models have also been considered for static KG representation in recent years due to the ability to capture context information.KG-BERT (Yao et al., 2019) first introduces PLMs into static KG representation.Among PLM-based models, prompt-learning has attracted much attention in recent years and has been shown to be effective on many NLP tasks.LAMA (Petroni et al., 2019) first introduces prompt-based knowledge to PLM.
Other prompt-based models based on LAMA are dedicated to improving the presentation of KGs by automatic prompt generation or by adding soft prompts (Shin et al., 2020;Zhong et al., 2021;Liu et al., 2021).PKGC (Lv et al., 2022) proposes a new prompt-learning method to accommodate the open-world assumption based on KG-BERT.

Temporal KG representation
Temporal KG representation requires consideration of how the facts are modeled in time series.Some temporal KG representation models are extended from static models.TTransE (Jiang et al., 2016) incorporates temporal information into the scoring function based on TransE (Bordes et al., 2013), andHyTE(Dasgupta et al., 2018) extends TransH (Wang et al., 2014) similarly.TNT-ComplEx (Lacroix et al., 2020) extends Com-plEx (Trouillon et al., 2016) inspired by the CP decomposition of order-4 tensor.
These expanded approaches consider timestamps as an additional dimension but lack consideration from a temporal perspective.Some models attempt to combine message-passing and temporal information to solve the problem.RE-NET (Jin et al., 2020) applies R-GCN (Schlichtkrull et al., 2018) for message passing for each snapshot and then uses temporal aggregation across multiple snapshots.HIP Network (He et al., 2021) utilizes structural information passing and temporal information passing to model snapshots.RE-GCN (Li et al., 2021) uniformly encodes the evolutional representations representation of entities and relations corresponding to different timestamps to apply to the extrapolational TKGC task.
Besides, some models use other strategies to model TKG.CyGNet (Zhu et al., 2021) is divided into a copy mode and a generative mode to predict missing entities using neural networks with a time dictionary.TITer (Sun et al., 2021) introduces reinforcement learning in TKG representation learning.

Preliminary
Temporal Knowledge Graph G is a set of networks of entities and relations that contain timestamps.It can be defined as G = {E, R, T , Q}, where E is the set of entities, R is the set of relations and T is the set of timestamps.Q = {(s, r, o, t)} ⊆ E × R × E × T is the quadruple set, where s and o are the subject entity (head entity) and object entity (tail entity), r is the relation between them at timestamp t.G t = {(s, r, o) ⊆ E × R × E} is called the TKG snapshot at t, and it can be taken as a static KG filtering the triple set from G at t.
Temporal Knowledge Graph completion (TKGC) is the task of predicting the evolution of future KGs given KGs of a known period.Given a quadruple (s, r, ?, t n ) or (?, r, o, t n ), we have a set of known facts from TKG snapshots G (t i <tn) to predict the missing object entity or subject entity in the quadruple.The probability of prediction of missing the entity o in quadruple (s, r, ?, t n ) can be formalized as follows: (1)

Methodology
In this paper, we propose PPT, a novel PLM-based model with prompts to solve TKGC task.The framework of our model is illustrated in Figure 2. We sample quadruples and convert them into pretrained language model inputs.The prediction of [MASK] token is the completed result.

Prompts
We design different prompts for entities (entprompts), relations (rel-prompts), and timestamps (time-prompts) to convert quadruples into a form suitable for input to PLMs.We add a soft prompt [EVE] before the beginning of each fact tuple due to introducing soft prompts in the input sentences can improve the expressiveness of the sentences (Han et al., 2021).Ent-prompts.We convert each entity into a special token [ENT-i] according to its index.We use a special token instead of the name of an entity because, in the prediction task, we need to predict the whole entity but not a part of it.To maintain the semantic information from entities, we do average pooling of embedding for all words in each entity as the initial embedding of its token.Rel-prompts.For each relation, we convert it into its original phrase.It is worth noting that to maintain the coherence of sentences, we supplemented each relation with the preposition it was missing.
For example, we supplement the relation Make a visit to Make a visit to.Time-prompts.We convert the time interval between two timestamps into a phrase that can describe the period.We construct a dictionary called interval-dictionary, which maps each period to a prompt.As shown in Figure 3, we convert each Figure 2: Illusion of PPT for TKGC.Quadruples are sampled and normalized to convert into PLM inputs with prompts.We calculate the time interval of adjacent quadruples in TSG to get TIG.We use the prompts to convert TIG into the input of PLM and then make the prediction for the mask.This way, The TKGC task is converted into a pre-trained language model masked token prediction task.
[SHT] on the same day

Construction for Graphs
Unlike sampling one fact tuple as input to a pretrained language model in some static knowledge graph models (Yao et al., 2019;Lv et al., 2022), we sample multiple fact tuples simultaneously because we need to model the temporal relationship between facts.We take the head/tail entity for each quadruple in the training dataset and randomly sample each quadruple from the entire training dataset while fixing the head/tail entity.The sampled quadruples are then arranged in chronological order.We demonstrate different sampling strategies in A.1.The sampled list is called Temporal Specialization Graph (TSG).TSG can be described as a time-ordered sequence T SG = [q 0 , q 1 . . ., q n ], We have a total of three types of TSG, which are T SG s obj , T SG r sub and T SG o rel : T SG s obj (n) =[q s 0 , q s 1 . . ., q s n ], (2) where we fix object entity obj to sample T SG s obj , fix subject entity sub to get sample T SG r sub , and fix relation rel to sample T SG o rel .We set a minimum sampling quadruple number K min and a maximum sampling quadruple number K max .
The timestamps in TSGs are independent and cannot reflect the time relationship between events.We convert each TSG to a Time Interval Graph (TIG) by calculating the time interval of adjacent quadruples.We take the earliest time in TSG as the initial time τ 0 and calculate the time interval between the timestamp in (s i , r i , o i , t i ) and the timestamp in where q * i (s, r, o) means keeping the fact triple (s i , r i , o i ) of q * i .

Training
The algorithm of our training strategy can be summarized in Algorithm 1.We do not train each quadruple separately in the training set for each epoch because we believe that independent quadruples cannot provide temporal information in TKGs.We sample each entity multiple times by fixing it at the object entity position and the subject entity position, thus generating TSGs of entities.Similarly, we fix the relations in the quadruples and, for each relation generate the TSGs of the relations.Then we convert all the TSGs to TIGs.For each quadruple in a TIG, we convert the entities, relation, and time interval into PLM inputs with prompts described in Section.4.1.We use a pre-trained language model with the masking strategy (also known as a masked language model, MLM) (Devlin et al., 2019) to train our model.Masked language models aim to predict masked parts based on their surrounding context.When training, we mask 30% of tokens in an input sequence.

Objective optimization discussion
The distribution of all facts in Eq 1 can be considered as the joint distribution of facts on all timestamps: (4) It is not realistic to focus on all quadruples in the TKG.When predicting the missing subject entities, we fix the object entities because relations in the neighborhood are of most interest to entities.Further, we simulate the original quadruple distribution by sampling, thus Eq 4 can be approximated as: where K is the number of sampling.
We calculate the generation probability of the quadruples by the pre-trained language model's ability to predict unknown words.We use seq k to present the converted inputs with prompts of For example, as illustrated in Figure 2, here are two quadruples in TSG:(49, 62, 12, 2) in timestamp t 1 and (49, 38, 18, 130) in timestamp t n−1 , the time interval between them is 128 days, ∆ 1 = t n−1 − t 1 .Then the quadruple (49,38,18,128)  The formalization of prediction can be defined as follows: where P LM (•) means inputting a sequence into the pre-trained language model.Combining Eq 1 and Eq 7, we convert the TKGC task into an MLM prediction task: where Prompt(•) means converting entities, relations, and timestamps into input sequences for PLM.
By Eq 8, the original knowledge-completion task can be equated to the pre-trained language model masked token prediction task.

Experimental Setup
Datasets.Intergrated Crisis Early Warning System (ICEWS) (Boschee et al., 2015) is a repository that contains coded interactions between sociopolitical actors with timestamps.We utilize three TKG datasets based on ICEWS named ICEWS05-15((García-Durán et al., 2018); from 2005 to 2015), ICEWS14 ((García-Durán et al., 2018); from 1/1/2014 to 12/31/2014) and ICEWS18( (Boschee et al., 2015); from 1/1/2018 to 10/31/2018) to perform evaluation.Statistics of these datasets are listed in Table 1.Evaluation Protocals.Following prior work (Li et al., 2021), we split each dataset into a training set, validation set, and testing set in chronological order following extrapolation setting.Thus, we guarantee that timestamps of train < timestamps of valid < timestamps of test.Some methods (Jin et al., 2020;Zhu et al., 2021;Wu et al., 2020) apply filter schema to evaluate the results by removing all the valid facts that appear in the training, validation, or test sets from the ranking list.Since TKGs are evolving in time, the same event can occur at different times (Li et al., 2021).Therefore, we apply raw schema to evaluate our experiments by removing nothing.We report the result of Mean Reciprocal Ranks(MRR) and Hits@1/3/10 (the proportion of correct test cases that are ranked within the top 1/3/10) of our approach and baselines following raw schema.Baselines.We compare our model with two categories of models: static KGC models and TKGC models.We select DistMult (Yang et al., 2015), ComplEx (Trouillon et al., 2016), R-GCN (Schlichtkrull et al., 2018), ConvE (Dettmers et al., 2018), ConvTransE (Shang et al., 2019), Ro-tatE (Sun et al., 2019) as static models.We select HyTE (Dasgupta et al., 2018), TTransE (Jiang et al., 2016), TA-DistMult(García-Durán et al., 2018), RGCRN (Seo et al., 2018), CyGNet (Zhu et al., 2021), RE-NET (Jin et al., 2020), RE-GCN (Li et al., 2021) as baselines of TKGC.Hyperparameters.We use bert-base-cased1 as our pre-trained model.Bert-base-cased has been pre-trained on a large corpus of English data in a self-supervised fashion.Bert-base-cased has a parameter size of 110M with 12 layers and 16 attention heads, and its hidden embedding size is 768.Without loss of generality, we also list other pre-trained models in A.3.The input sequence length, min sampling number, and max sampling number of each dataset are listed in Table 2.When training, we mask 30% tokens randomly, and we choose AdamW as our optimizer.The learning rate is set as 5e-5.We make a detailed analysis of the parameters in A.2.

Results
We report the results of PPT and baselines in Table 3.It can be observed that PPT outperforms all static models much better.Compared with ConvTransE, which has the best results among static models, we achieve 28.3%, 21.97%, and 14.69% improvement with MRR metric in the three datasets, respectively.We believe temporal information matters in TKGC tasks, while static models do not utilize temporal information.
As can be seen that PPT performs better than HyTE, TTransE, and TA-DistMult.These models are under the interpolation setting.For instance, we achieve 41.22%, 46.53%, and 62.18% improvements with MRR metric in the three datasets compared to TA-DistMult.We believe that HyTE and TA-DistMult only focus on independent graphs and do not establish the temporal correlation between graphs.TTransE embeds timestamps into the scoring function while not taking full advantage of them.
With MRR, Hits@1, and Hits@3 metrics on ICEWS05-15 and ICEWS14, PPT achieves the best results compared to other TKGC models.For instance, PPT improves 6.5% over the second-best result with Hit@1 metric.On ICEWS18, PPT has a slight gap with the best model RE-GCN.We believe this is because ICEWS18 has more entities than other datasets.GNN-based models using the message-passing mechanism have better learning ability for such graphs with many nodes.Furthermore, RE-GCN adds additional edges to assist learning for the static parts of the graph.
Besides the masking strategy for our model, we also attempt other forms of application for pretrained language models, which are illustrated in A.3.

Ablation study
To investigate the contribution of time-prompts in our model, we conduct ablation studies for our model by testing all datasets under the same parameter settings of different variants.The experiment results are shown in Table 4.
PPT w/o prompts denotes PPT without timeprompts.In this variant, we set all timestamps as 0. To ensure that the sequence length does not affect the experiments, we replaced all the timeprompts with on the same day.PPT w/o prompts gets worse results than raw PPT with all metrics on three datasets except with Hits@10 on ICEWS14.ICEWS14 has a smaller number of entities and data size than the other two datasets, so it is possible to achieve better results in some metrics after removing the timestamps.
PPT rand prompts denotes PPT with random timestamps set.We replace raw timestamps in quadruples with other timestamps randomly.Random timestamps should not affect the results if our model does not learn the timestamp information correctly.As shown in Table 4, the raw model shows better results than this variant on all metrics.These experiments demonstrate that applying time-prompts in our model can benefit the learning of temporal information between events.

Attention analysis
To visually show that our model can learn from temporal knowledge graphs, as shown in Figure 4, we visualize attention patterns of PPT.We need to complete the missing tail entity in a test quadruple (263, 104, ?, 7536).As mentioned, we sample data from earlier than timestamp 7536 to form the input sequence and obtain the attention weights from the pre-trained model.In this example, the ground truth is .We observe that in our model, the prediction of [MASK] is made by considering all the previous sampling samples together.PPT notes that the same relation physical assault to occurred a day earlier and captures the temporal information from token the, next, and day.Therefore, PPT can make correct predictions based on historical events and chronological relationships.

Time-sensitive relation analysis
Using ICEWS05-15 as an example, we analyze the time-sensitive relations present in the dataset.
For different relations between the same pairs of entities, there is a clear order of occurrence among some of them.For example, the relation Obstruct passage, block is always followed by ones related to assistance, such as Appeal for aid, Appeal for humanitarian aid, and Provide humanitarian aid.
Similarly, the relation Acknowledge or claim responsibility is always followed by those related to negotiation, such as Express intent to cooperate militarily, Meet at a 'third' location, and Demand material cooperation.We provide more examples in A.5.

Conclusions
This paper proposes a novel temporal knowledge graph completion model named pre-trained language model with prompts for TKGC (PPT).We use prompts to convert entities, relations, and timestamps into pre-trained model inputs and turn TKGC problem into a masked token prediction problem.This way, we can extract temporal information from timestamps accurately and sufficiently utilize implied information in relations.Our proposed method achieves promising results compared to other temporal graph representation learning methods on three benchmark TKG datasets.For future work, we plan to improve the sampling method in temporal knowledge graphs to get more timespecific inputs.We are also interested in combining GNNs and pre-trained language models in temporal knowledge graph representation learning.

Limitations
This paper proposes a pre-trained language model with prompts for temporal knowledge graph completion.However, there are some limitations in our method: 1) Our prompts in the temporal knowledge graphs, especially the time-prompts, are built manually.It needs to be reconstructed manually for different knowledge graphs.We are exploring a way to build prompts in temporal knowledge graphs automatically.2) Our model uses a random sampling method, which suffers from the problem of few high-quality training samples and high sam-ple noise.For future work, a more effective way to sample is worth exploring.

A Appendix
A.1 Sampling Analysis We design two sampling strategies, one is the uniform sampling strategy, and the other is the frequency-based sampling strategy.The uniform sampling strategy assigns equal sampling weights to each entity.The frequency-based sampling strategy assigns different weights to each entity based on the different frequencies of each entity appearing in the dataset, where entities with higher occurrences have a higher probability of being sampled.
As shown in

A.2 Hyperparameter Analysis
To test the effect of different sequence lengths and the maximum number of samples on the effect of the model, we analyze these hyperparameters on ICEWS14.Due to GPU performance limitations, we do not perform experiments on longer sequences.
As shown in Table 7, we get the best results with setting seq_len = 256, max_sample = 12.We believe that the effect of sequence length is small while the number of samples matters.A larger number of samples can provide more semantic contextual information for the prediction but overly lengthy sampling can cause a decline in effectiveness by not focusing on the most effective information in learning.

A.3 Variants
In addition to the model we propose in the paper, we also try some variants, all experiments are done with seq_len = 256, max_sample = 12 on ICEWS14.As demonstrated in Table 8, PPT_CLS does not use the mask training strategy but takes [CLS] to do classification with a fully connected layer as the decoder; PPT_LSTM uses a bi-directional LSTM to encode all tokens, maxpool the out embeddings, and use a fully-connected layer as a decoder.These models do not get satisfactory results compared to our raw model.PPT_CLS only uses sequence embedding to predict the result is not enough because the sequence embedding is suitable for classification task which needs to be focused on the whole input sequence.However, in our task, we need to consider the impact of each token.For PPT_LSTM, we believe that the representation learned by the pre-trained language model is high-level semantic knowledge, especially when additional tokens (entities and relations) are added.Simple neural network models are unable to capture this high-level semantic knowledge and instead cause a decrease in effectiveness.

A.4 Different PLMs
Besides bert-base-cased, we also attempt other pre-trained language models: bert-base-uncased3 and bert-large-cased4 .As shown in Table 9.All experiments are done with setting seq_len = 128, min_sample = 2, max_sample = 8 on ICEWS14.We find that the experimental results with different PLMs are similar, indicating that our approach does not rely on a specific pre-trained language model and has the ability to generalize.

Figure 1 :
Figure 1: An example of the time-related semantic information between relations in three pairs of entities.
during the week [SHT] after a week [SHT] after two weeks [MID] after half a year [LNG] after one year

Figure 3 :
Figure 3: Illusion of interval-dictionary.The left side of the vertical axis indicates the interval between two timestamps, and the right side indicates the time-prompts corresponding to the timestamp interval.[SHT] for short intervals (∆ t ≤ 60), [MID] for medium intervals (60 < ∆ t ≤ 365), [LNG] for long intervals (∆ t > 365).

Figure 4 :
Figure 4: Illustrations of attention patterns of PPT.The quadruple that needs to be completed is (263, 104, ?, 7536), we sample 2 quadruples with earlier timestamps than the test example and fixed object entities.Transparencies of colors reflect the attention scores of other tokens to [MASK].
Algorithm 1: Training for PPT Input: TKG G with training data, maximum number of epochs max_epochs, maximum number of sampling TSG of one entity or one relation B, minimum sampling sequence length Kmin, maximum sampling sequence length Kmax.
k = random(Kmin, Kmax); Sample a T SG rel with length = k; Convert T SG rel into T IG rel ; add T IG rel to S; end end foreach T IG ∈ S do // convert TIG into input with prompts seq = Prompt(T IG); // train in PLM with masking strategy MASK_TRAIN(P LM (seq)); end epoch ← epoch + 1; until epoch = max_epochs; ;

Table 1 :
Statistics of the datasets we use.

Table 3 :
(Li et al., 2021)datasets.The best results are boldfaced, and the second best ones are underlined.The results of baselines are from RE-GCN(Li et al., 2021).

Table 4 :
Ablation experiments results of PPT.The best results are boldfaced and the second best ones are underlined.

Table 6 :
Table 6, the frequency-based sampling strategy has better results on ICEWS14.We believe this is because if an entity appears frequently, it is more likely to have relations with other entities and should get more attention.Results of different sampling strategies of PPT on ICEWS14.

Table 7 :
Results of different hyperparameters of PPT on ICEWS14.The best results are boldfaced and the second best ones are underlined.