Question Answering Over Temporal Knowledge Graphs

Temporal Knowledge Graphs (Temporal KGs) extend regular Knowledge Graphs by providing temporal scopes (start and end times) on each edge in the KG. While Question Answering over KG (KGQA) has received some attention from the research community, QA over Temporal KGs (Temporal KGQA) is a relatively unexplored area. Lack of broad coverage datasets has been another factor limiting progress in this area. We address this challenge by presenting CRONQUESTIONS, the largest known Temporal KGQA dataset, clearly stratified into buckets of structural complexity. CRONQUESTIONS expands the only known previous dataset by a factor of 340x. We find that various state-of-the-art KGQA methods fall far short of the desired performance on this new dataset. In response, we also propose CRONKGQA, a transformer-based solution that exploits recent advances in Temporal KG embeddings, and achieves performance superior to all baselines, with an increase of 120% in accuracy over the next best performing method. Through extensive experiments, we give detailed insights into the workings of CRONKGQA, as well as situations where significant further improvements appear possible. In addition to the dataset, we have released our code as well.


Introduction
Temporal Knowledge Graphs (Temporal KGs) are multi-relational graph where each edge is associated with a time duration.This is in contrast to a regular KG where no time annotation is present.For example, a regular KG may contain a fact such as (Barack Obama, held position, President of USA), while a temporal KG would contain the start and end time as well -(Barack Obama, held position, President of USA, 2008USA, , 2016)).Edges may be associated with a set of non-contiguous time intervals as well.These temporal scopes on facts can be either automatically estimated (Talukdar et al., 2012) or user contributed.Several such Temporal KGs have been proposed in the literature, where the focus is on KG completion (Dasgupta et al. 2018;García-Durán et al. 2018;Leetaru and Schrodt 2013;Lacroix et al. 2020;Jain et al. 2020).
The task of Knowledge Graph Question Answering (KGQA) is to answer natural language questions using a KG as the knowledge base.This is in contrast to reading comprehension-based question answering, where typically the question is accompanied by a context (e.g., text passage) and the answer is either one of multiple choices (Rajpurkar et al., 2016) or a piece of text from the context (Yang et al., 2018).In KGQA, the answer is usually an entity (node) in the KG, and the reasoning required to answer questions is either single-fact based (Bordes et al., 2015), multi-hop (Yih et al. 2015, Zhang et al. 2017) or conjunction/comparison based reasoning (Talmor and Berant, 2018).Temporal KGQA takes this a step further where:  Jia et al. (2018a).We do not have an explicit number of temporal questions for ComplexWebQuestions, but since it is constructed automatically using questions from WebQuestions, we expect the percentage to be similar to WebQuestions (16%).Please refer to Section 2.1 for details.
Temporal KG embeddings are another upcoming area where entities, relations and timestamps in a temporal KG are embedded in a low-dimensional vector space (Dasgupta et al. 2018, Lacroix et al. 2020, Jain et al. 2020, Goel et al. 2019).Here too, the main application so far has been temporal KG completion.In our work, we investigate whether temporal KG Embeddings can be applied to the task of Temporal KGQA, and how they fare compared to non-temporal embeddings or off-the-shelf methods without any KG Embeddings.
In this paper we propose CRONQUESTIONS, a new dataset for Temporal KGQA.CRONQUES-TIONS consists of both a temporal KG and accompanying natural language questions.There were three main guiding principles while creating this dataset: 1.The associated KG must provide temporal annotations.2. Questions must involve an element of temporal reasoning.3. The number of labeled instances must be large enough that it can be used for training models, rather than for evaluation alone.Guided by the above principles, we present a dataset consisting of a Temporal KG with 125k entities and 328k facts, along with a set of 410k natural language questions that require temporal reasoning.
On this new dataset, we apply approaches based on deep language models (LM) alone, such as T5 (Raffel et al., 2020), BERT (Devlin et al., 2019), and KnowBERT (Peters et al., 2019), and also hybrid LM+KG embedding approaches, such as Entities-as-Experts (Févry et al., 2020) and Em-bedKGQA (Saxena et al., 2020).We find that these baselines are not suited to temporal reasoning.In response, we propose CRONKGQA, an enhancement of EmbedKGQA, which outperforms baselines across all question types.CRONKGQA achieves very high accuracy on simple temporal reasoning questions, but falls short when it comes to questions requiring more complex reasoning.Thus, although we get promising early results, CRONQUESTIONS leaves ample scope to improve complex Temporal KGQA.Our source code along with the CRONQUESTIONS dataset can be found at https://github.com/apoorvumang/CronKGQA.

Temporal QA data sets
There have been several KGQA datasets proposed in the literature (Table 1).In SimpleQuestions (Bordes et al., 2015) one needs to extract just a single fact from the KG to answer a question.MetaQA (Zhang et al., 2017) and WebQuestionsSP (Yih et al., 2015) require multi-hop reasoning, where one must traverse over multiple edges in the KG to reach the answer.ComplexWebQuestions (Talmor and Berant, 2018) contains both multi-hop and conjunction/comparison type questions.However, none of these are aimed at temporal reasoning, and the KG they are based on is non-temporal.
Temporal QA datasets have mostly been studied in the area of reading comprehension.One such dataset is TORQUE (Ning et al., 2020), where the system is given a question along with some context (a text passage) and is asked to answer a multiple choice question with five choices.This is in contrast to KGQA, where there is no context, and the answer is one of potentially hundreds of thousands of entities.
TempQuestions (Jia et al., 2018a) is a KGQA dataset specifically aimed at temporal QA.It consists of a subset of questions from WebQuestions, Free917 (Cai and Yates, 2013) and Complex-Questions (Bao et al., 2016)  nature.They gave a definition for "temporal question" and used certain trigger words (for example 'before', 'after') along with other constraints to filter out questions from these datasets that fell under this definition.However, this dataset contains only 1271 questions -useful only for evaluation -and the KG on which it is based (a subset of FreeBase (Bollacker et al., 2008)) is not a temporal KG.Another drawback is that FreeBase has not been under active development since 2015, therefore some information stored in it is outdated and this is a potential source of inaccuracy.

Temporal QA algorithms
To the best of our knowledge, recent KGQA algorithms (Miller et al. 2016;Sun et al. 2019;Cohen et al. 2020;Sun et al. 2020) work with nontemporal KGs, i.e., KGs containing facts of the form (subject, relation, object).Extending these to temporal KGs containing facts of the form (subject, relation, object, start time, end time) is a non-trivial task.TEQUILA (Jia et al., 2018b) is one method aimed specifically at temporal KGQA.TEQUILA decomposes and rewrites the question into nontemporal sub-questions and temporal constraints.Answers to sub-questions are then retrieved using any KGQA engine.Finally, TEQUILA uses constraint reasoning on temporal intervals to compute final answers to the full question.A major drawback of this approach is the use of pre-specified templates for decomposition, as well as the assumption of having temporal constraints on entities.Also, since it is made for non-temporal KGs, there is no direct way of applying it to temporal KGs where facts are temporally scoped.
3 CRONQUESTIONS: The new Temporal KGQA dataset CRONQUESTIONS, our Temporal KGQA dataset consists of two parts: a KG with temporal annotations, and a set of natural language questions requiring temporal reasoning.

Temporal KG
To prepare our temporal KG, we started by taking all facts with temporal annotations from the Wiki-Data subset proposed by Lacroix et al. (2020).We removed some instances of the predicate "member of sports team" in order to balance out the KG since this predicate constituted over 50 percent of the facts.Timestamps were discretized to years.This resulted in a KG with 323k facts, 125k entities and 203 relations.However, this filtering of facts misses out on important world events.For example, the KG subset created using the aforementioned technique contains the entity World War II but no associated fact that tells us when World War II started or ended.This knowledge is needed to answer questions such as "Who was the President of the USA during World War II?."To overcome this shortcoming, we first extracted entities from WikiData that have a "start time" and "end time" annotation.From this set, we then removed entities which were game shows, movies or television series (since these are not important world events, but do have a start and end time annotation), and then removed entities with less than 50 associated facts.This final set of entitities was then added as facts in the format (WWII, significant event, occurred, 1939(WWII, significant event, occurred, , 1945)).The final Temporal KG consisted of 328k facts out of which 5k are event-facts.

Temporal Questions
To generate the QA dataset, we started with a set of templates for temporal reasoning.These were made using the five most frequent relations from our WikiData subset, namely • member of sports team   • employer This resulted in 30 unique seed templates over five relations and five different reasoning structures (please see Table 2 for some examples).Each of these templates has a corresponding procedure that could be executed over the temporal KG to extract all possible answers for that template.However, similar to Zhang et al. (2017), we chose not to make this procedure a part of the dataset, to remove unwelcome dependence of QA systems on such formal candidate collection methods.This also allows easy augmentation of the dataset, since only question-answer pairs are needed.
In the same spirit as ComplexWebQuestions, we then asked human annotators to paraphrase these templates in order to generate more linguistic diversity.Annotators were given slot-filled templates with dummy entities and times, and asked to rephrase the question such that the dummy entities/times were present in the paraphrase and the question meaning did not change.This resulted in 246 unique templates.
We then used the monolingual paraphraser developed by Hu et al. (2019) to automatically generate paraphrases using these 246 templates.After verifying their correctness through annotators, we ended up with 654 templates.These templates were then filled using entity aliases from WikiData to generate 410k unique question-answer pairs.
Finally, while splitting the data into train/test folds, we ensured that 1. Paraphrases of train questions are not present in test questions.2. There is no entity overlap between test questions and train questions.Event overlap is allowed.The second requirement implies that, if the question "Who was president before Obama" is present in the train set, the test set cannot contain any question that mentions the entity 'Obama'.While this policy may appear like an overabundance of caution, it ensures that models are doing temporal reasoning rather than guessing from entities seen during training.Lewis et al. (2020) noticed an issue in WebQuestions where they found that almost 30% of test questions overlapped with training questions.The issue has been seen in the MetaQA dataset as well, where there is significant overlap between test/train entities and test/train question paraphrases, leading to suspiciously high performance on baseline methods even with partial KG data (Saxena et al., 2020), which suggests that models that apparently perform well are not necessarily performing the desired reasoning over the KG.
A drawback of our data creation protocol is that question/answer pairs are generated automatically.Therefore, the question distribution is artificial from a semantic perspective.(Complex-WebQuestions has a similar limitation.)However, since developing models that are capable of temporal reasoning is an important direction for natural language understanding, we feel that our dataset provides an opportunity to both train and evaluate KGQA models because of its large size, notwithstanding its lower-than-natural linguistic variety.In Section 6.4, we show the effect that training data size has on model performance.
Summarizing, each of our examples contains 1.A paraphrased natural language question.2. A set of entities/times in the question.3. A set of 'gold' answers (entity or time).
The entities are specified as WikiData IDs (e.g., Q219237), and times are years (e.g., 1991).We include the set of entities/times in the test questions as well since similar to other KGQA datasets (MetaQA, WebQuestions, ComplexWebQuestions) and methods that use these datasets (PullNet, EmQL), entity linking is considered as a separate problem and complete entity linking is as-sumed.We also include the seed template and head/tail/time annotation in the train fold, but omit these from the test fold.

Question Categorization
In order to aid analysis, we categorize questions into "simple reasoning" and "complex reasoning" questions (please refer to Table 4 for the distribution statistics).Simple reasoning: These questions require a single fact to answer, where the answer can be either an entity or a time instance.For example the question "Who was the President of the United States in 2008?" requires a single fact to answer the question, namely (Barack Obama, held position, President of USA, 2008, 2016) Complex reasoning: These questions require multiple facts to answer and can be more varied.
For example "Who was the first President of the United States?"This requires reasoning over multiple facts pertaining to the entity "President of the United States".In our dataset, all questions that are not "simple reasoning" questions are considered complex questions.These are further categorized into the types "before/after'', "first/last" and "time join"please refer Table 2 for examples of these questions.

Temporal KG Embeddings
We investigate how we can use KG embeddings, both temporal and non-temporal, along with pretrained language models to perform temporal KGQA.We will first briefly describe the specific KG embedding models we use, and then go on to show how we use them in our QA models.In all cases, the scores are turned into suitable losses with regard to positive and negative tuples in an incomplete KG, and these losses minimized to train the entity, time and relation representations.

ComplEx
ComplEx (Trouillon et al., 2016) represents each entity e as a complex vector u e ∈ C D .Each relation r is represented as a complex vector v r ∈ C D as well.The score φ of a claimed fact where (•) denotes the real part and c is the complex conjugate.Despite further developments, ComplEx, along with refined training protocols (Lacroix et al., 2018) remains among the strongest KB embedding approaches (Ruffinelli et al., 2020).

TComplEx, TNTComplEx
Lacroix et al. ( 2020) took an early step to extend ComplEx with time.Each timestamp t is also represented as a complex vector w t ∈ C D .For a claimed fact (s, r, o, t), their TComplEx scoring function is Their TNTComplEx scoring function uses two representations of relations r: v T r , which is sensitive to time, and v r , which is not.The scoring function is the sum of a time-sensitive and a time-insensitive part:

TimePlex
TimePlex (Jain et al., 2020) augmented Com-plEx with embeddings u t ∈ C D for discretized time instants t.To incorporate time, TimePlex uses three representations for each relation r, viz., and writes the base score of a tuple (s, r, o, t) as where α, β, γ are hyperparameters.

CRONKGQA: Our proposed method
We start with a temporal KG, apply a time-agnostic or time-sensitive KG embedding algorithm (Com-plEx, TComplEx, or TimePlex) to it, and obtain entity, relation, and timestamp embeddings for the temporal KG.We will use the following notation.
• E is the matrix of entity embeddings • T is the matrix of timestamp embeddings • E.T is the concatenation of E and T matrices.This is used for scoring answers, since the answer can be either an entity or timestamp.In case entity/timestamp embeddings are complex valued vectors in C D , we expand them to real valued vectors of size 2D, where the first half is the real part and the second half is the complex part of the original vector.
We first apply EmbedKGQA (Saxena et al., 2020) directly to the task of Temporal KGQA.In its original implementation, EmbedKGQA uses Com-plEx (Section 4.1) embeddings and can only deal with non-temporal KGs and single entity questions.
In order to apply it to CRONQUESTIONS, we set the first entity encountered in the question as the "head entity" needed by EmbedKGQA.Along with this, we set the entity embedding matrix E to be the ComplEx embedding of our KG entities, and initialize T to a random learnable matrix.EmbedKGQA then performs prediction over E.T .Next, we modify EmbedKGQA so that it can use temporal KG embeddings.We use TComplEx (Section 4.2) for getting entity and timestamp embeddings.CRONKGQA (Figure 1) utilizes two scoring functions, one for predicting entity and one for predicting time.Using a pre-trained LM (BERT in our case) CRONKGQA finds a question embedding qe.This is then projected to get two embeddings, qe ent and qe time , which are question embeddings for entity and time prediction respectively.Entity scoring function: We extract a subject entity s and a timestamp t from the question.If either is missing, we use a dummy entity/time.Then, using the scoring function φ(s, r, o, t) from equation 2, we calculate a score for each entity e ∈ E as where E is the set of entities in the KG.This gives us a score for each entity being an answer.Time scoring function: Similarly, we extract a subject entity s and object entity o from the question, using dummy entities if none are present.Then, using 2, we calculate a score for each times- The scores for all entities and times are concatenated, and softmax is used to calculate answer probabilities over this combined score vector.The model is trained using cross entropy loss.

Experiments and diagnostics
In this section, we aim to answer the following questions: 1. How do baselines and CRONKGQA perform on the CRONQUESTIONS task?(Section 6.2.) 2. Do some methods perform better than others on specific reasoning tasks?(Section 6.3.) 3. How much does the training dataset size (number of questions) affect the performance of a model?(Section 6.4.) 4. Do temporal KG embeddings confer any advantage over non-temporal KG embeddings?(Section 6.5.)

Other methods compared
It has been shown by Petroni et al. (2019) and Raffel et al. ( 2020) that large LMs, such as BERT and its variants, capture real world knowledge (collected from their massive, encyclopedic training corpus) and can directly be applied to tasks such as QA.In these baselines, we do not specifically feed our version of the temporal KG to the model - we instead expect the model to have the real world knowledge to compute the answer.BERT: We experiment with BERT, RoBERTa (Liu et al., 2019) and KnowBERT (Peters et al., 2019) which is a variant of BERT where information from knowledge bases such as WikiData and WordNet has been injected into BERT.We add a prediction head on top of the [CLS] token of the final layer and do a softmax over it to predict the answer probabilities.T5: In order to apply T5 (Raffel et al., 2020) to temporal QA, we transform each question in our dataset to the form 'temporal question: question ?'.For evaluation there are two cases: 1.Time answer: We do exact string matching between T5 output and correct answer.2. Entity answer: We compare the system output to the aliases of all entities in the KG.The entity having an alias with the smallest edit distance (Levenshtein, 1966) to the predicted text output is taken as the predicted entity.Entities as experts: Févry et al. (2020) proposed EaE, a model which aims to integrate entity knowledge into a transformer-based language model.For temporal KGQA on CRONQUES-TIONS, we assume that all grounded entity and time mention spans are marked in the question1 .We will refer to this model as T-EaE-add.We try another variant of EaE, T-EaE-replace, where instead of adding the entity/time and BERT token embeddings, we replace the BERT embeddings with the entity/time embeddings for entity/time mentions.2

Main results
Table 5 shows the results of various methods on our dataset.We see that methods based on large pre-trained LMs alone (BERT, RoBERTa, T5), as well as KnowBERT, perform significantly worse than methods that are augmented with KG embeddings (temporal or non-temporal).This is probably because having KG embeddings specific to our temporal KG helps the model to focus on those entities/timestamps.In our experiments, BERT performs slightly better than KnowBERT, even though KnowBERT has entity knowledge in its parameters.T5-3B performs the best among the LMs we tested, possibly because of the large number of parameters and pre-training.
Even among methods that use KG embeddings, CRONKGQA performs the best on all metrics, followed by T-EaE-replace.Since EmbedKGQA has non-temporal embeddings, its performance on questions where the answer is a time is very lowcomparable to BERT -which is the LM used in our EmbedKGQA implementation.
Another interesting thing to note is the performance on simple reasoning questions.CRONKGQA far outperforms baselines for simple questions, achieving close to 0.99 hits@1, which is much lower for T-EaE (0.329).We believe there might be a few reasons that contribute to this: 1.There is the inductive bias of combining embeddings using TComplEx scoring function in CRONKGQA, which is the same one used in creating the entity and time embeddings, thus making the simple questions straightforward to answer.However, not relying on a scoring function means that T-EaE can be extended to any KG embedding, whereas CRONKGQA cannot.Solid line is for simple reasoning and dashed line is for complex reasoning type questions.For each dataset size, models were trained until validation hits@10 did not increase for 10 epochs.Please refer to Section 6.4 for details.
2. Another contributing reason could be that there are fewer parameters to be trained in CRONKGQA while a 6-layer Transformer encoder needs to be trained from scratch in T-EaE.
Transformers typically require large amounts of varied data to train successfully.

Performance across question types
Table 6 shows the performance of KG embedding based models across different types of As stated above in Section 6.2, CRONKGQA performs very well on simple reasoning questions (simple entity, simple time).Among complex question types, all models (except EmbedKGQA) perform the best on time join questions (e.g., 'Who played with Roberto Dinamite on the Brazil national football team').This is because such questions typically have multiple answers (such as all the players when Roberto Dinamite was playing for Brazil), which makes it easier for the model to make a correct prediction.In the other two question types, the answer is always a single entity/time.Before/after questions seem most challenging for all methods, with the best method achieving only 0.288 hits@1.

Effect of training dataset size
Figure 2 shows the effect of training dataset size on model performance.As we can see, for T-EaE-add, increasing the training dataset size from 10% to 100% steadily increases its performance for both simple and complex reasoning type questions.This effect is somewhat present in CRONKGQA for complex reasoning, but not so for simple reasoning type questions.We hypothesize that this is because T-EaE has more trainable parameters -it has a 6-layer transformer that needs to be trained from scratch -in contrast to CRONKGQA that needs to merely fine tune BERT and train some shallow projection layers.These results affirm our hypothesis that having a large, even if synthetic, dataset is useful for training temporal reasoning models.

Temporal vs. non-temporal KG embeddings
We conducted further experiments to study the effect of temporal vs. non-temporal KG embeddings.We replaced the temporal entity embeddings in T-EaE-replace with ComplEx embeddings, and treated timestamps as regular tokens (not associated with any entity/time mentions).CRONKGQA-CX is the same as EmbedKGQA.The results can be seen in generation procedure, leading to a question distribution that is artificial from a semantic perspective.However, having a large dataset provides an opportunity to train models, rather than just evaluate them.We experimentally show that increasing the training dataset size steadily improves the performance of certain methods on the TKGQA task.
We first apply large pre-trained LM based QA methods on our new dataset.Then we inject KG embeddings, both temporal and non-temporal, into these LMs and observe significant improvement in performance.We also propose a new method, CRONKGQA, that is able to leverage Temporal KG Embeddings to perform TKGQA.In our experiments, CRONKGQA outperforms all baselines.These results suggest that KG embeddings can be effectively used to perform temporal KGQA, although there remains significant scope for improvement when it comes to complex reasoning questions.

Figure 2 :
Figure2: Model performance (hits@10) vs. training dataset size (percentage) for CRONKGQA and T-EaEadd.Solid line is for simple reasoning and dashed line is for complex reasoning type questions.For each dataset size, models were trained until validation hits@10 did not increase for 10 epochs.Please refer to Section 6.4 for details.

Table 1 :
KGQA dataset comparison.Statistics about percentage of temporal questions for WebQuestions are taken from

Table 2 :
that are temporal in When did Obama hold the position of President of USA Simple entity Which award did {head} receive in {time} Which award did Brad Pitt receive in 2001 Before/After Who was the {tail} {type} {head} Who was the President of USA before Obama First/Last When did {head} play their {adj} game When did Messi play their first game Time join Who held the position of {tail} during {event} Who held the position of President of USA during WWII Example questions for different types of temporal reasoning.{head},{tail}and {time} correspond to entities/timestamps in facts of the form(head, relation, tail, timestamp).{event} corresponds to entities in event facts eg.WWII.{type} can be one of before/after and {adj} can be one of first/last.Please refer to Section 3.2 for details.

Template
When did {head} play in {tail} Seed QnWhen did Messi play in FC BarcelonaHuman ParaphrasesWhen was Messi playing in FC Barcelona Which years did Messi play in FC Barcelona When did FC Barcelona have Messi in their team What time did Messi play in FC BarcelonaMachine ParaphrasesWhen did Messi play for FC Barcelona When did Messi play at FC Barcelona When has Messi played at FC Barcelona

Table 3 :
Slot-filled paraphrases generated by humans and machine.Please refer to Section 3.2 for details.

Table 4 :
Number of questions in our dataset across different types of reasoning required and different answer types.Please refer to Section 3.2.1 for details.
The CRONKGQA method.(i) A temporal KG embedding model (Section 4) is used to generate embeddings for each timestamp and entity in the temporal knowledge graph (ii) BERT is used to get two question embeddings: qe ent and qe time .(iii) Embeddings of entity/time mentions in the question are combined with question embeddings using equations 4 and 5 to get score vectors for entity and time prediction.(iv) Score vectors are concatenated and softmax is used get answer probabilities.Please refer to Section 5 for details.

Table 5 :
Performance of baselines and our methods on the CRONQUESTIONS dataset.Methods above the midrule do not use any KG embeddings, while the ones below use either temporal or non-temporal KG embeddings.Hits@10 are not available for T5-3B since it is a text-to-text model and makes a single prediction.Please refer to Section 6.2 for details.

Table 7
Another observation is that questions having temporal answers achieve very low accuracy (0.057 and 0.062 respectively) in both CRONKGQA-CX and T-EaE-replace-CX, which is much lower than what these models achieve with TComplEx.This shows that having temporal KG embeddings is essential for achieving good performance for KG embedding-based methods.In this paper we introduce CRONQUESTIONS, a new dataset for Temporal Knowledge Graph Question Answering.While there exist some Temporal KGQA datasets, they are all based on non-temporal KGs (e.g., Freebase) and have relatively few questions.Our dataset consists of both a temporal KG as well as a large set of temporal questions requiring various structures of reasoning.In order to develop such a large dataset, we used a synthetic

Table 6 :
Hits@1 for different reasoning type questions.'SimpleEntity' and 'Simple Time' correspond to simple question type in Table5while the others correspond to complex question type.Please refer to section 6.3 for more details.

Table 7 :
Hits@1 for CRONKGQA and T-EaE-replace using ComplEx(CX) and TComplEx(TCX) KG embeddings.Please refer to Section 6.5 for more details.