MTGER: Multi-view Temporal Graph Enhanced Temporal Reasoning over Time-Involved Document

The facts and time in the document are intricately intertwined, making temporal reasoning over documents challenging. Previous work models time implicitly, making it difficult to handle such complex relationships. To address this issue, we propose MTGER, a novel Multi-view Temporal Graph Enhanced Temporal Reasoning framework for temporal reasoning over time-involved documents. Concretely, MTGER explicitly models the temporal relationships among facts by multi-view temporal graphs. On the one hand, the heterogeneous temporal graphs explicitly model the temporal and discourse relationships among facts; on the other hand, the multi-view mechanism captures both time-focused and fact-focused information, allowing the two views to complement each other through adaptive fusion. To further improve the implicit reasoning capability of the model, we design a self-supervised time-comparing objective. Extensive experimental results demonstrate the effectiveness of our method on the TimeQA and SituatedQA datasets. Furthermore, MTGER gives more consistent answers under question perturbations.


Introduction
In the real world, many facts change over time, and these changes are archived in the document such as Wikipedia.Facts and time are intertwined in documents with complex relationships.Thus temporal reasoning is required to find facts that occurred at a specific time.To investigate this problem, Chen et al. (2021) propose the TimeQA dataset and Zhang and Choi (2021) propose the SituatedQA dataset.For example, Figure 1 illustrates a question involving implicit temporal reasoning.From the human perspective, to answer this question, we first need to find relevant facts in the document and obtain new facts based on existing facts (left of Figure 1(d)).We need to deduce the answer from end, we propose Multi-view Temporal Graph Enhanced Reasoning framework (MTGER) for temporal reasoning over time-involved documents.We construct a multi-view temporal graph to establish correspondence between facts and time and to explicitly model the temporal relationships between facts.The explicit and implicit temporal relations between facts in the heterogeneous graph enhance the temporal reasoning capability, and the crossparagraph interactions between facts alleviate the inadequate interaction.
Specifically, each heterogeneous temporal graph (HTG) contains factual and temporal layers.Nodes in the factual layer are events, and nodes in the temporal layer are the timestamps (or time intervals) corresponding to the events.Different nodes are connected according to the discourse relations and relative temporal relations.In addition, we construct a time-focused HTG and a fact-focused HTG to capture information with different focuses, forming a multi-view temporal graph.We complement the two views through adaptive fusion to obtain more adequate information.At the decoder side, we use a question-guided fusion mechanism to dynamically select the temporal graph information that is more relevant to the question.Finally, we feed the time-enhanced representation into the decoder to get the answer.Furthermore, we introduce a self-supervised time-comparing objective to enhance the temporal reasoning capability.
Extensive experimental results demonstrate the effectiveness of our proposed method on the TimeQA and SituatedQA datasets, with a performance boost of up to 9% compared to the stateof-the-art QA model and giving more consistent answers when encountering input perturbations.
The main contributions of our work can be summarized as follows: • We propose to enhance temporal reasoning by modeling time explicitly.As far as we know, it is the first attempt to model time explicitly in temporal reasoning over documents.
• We devise a document-level temporal reasoning framework, MTGER, which models the temporal relationships between facts through heterogeneous temporal graphs with a complementary multi-view fusion mechanism.
• Extensive experimental results demonstrate the effectiveness and robustness of our method on the TimeQA and SituatedQA datasets.

Task Definition
Document-level textual temporal reasoning tasks take a long document (e.g., typically thousands of characters) with a time-sensitive question as input and output the answer based on the document.Formally, given a time-sensitive question Q, a document D, the goal is to obtain the answer A which satisfies the time and relation constraints in question Q.The document D = {P 1 , P 2 , ..., P k } contains k paragraphs, which are either from a Wikipedia page or retrieved from Wikipedia dumps.
The answer can be either an extracted span or generated text, and we take the generation approach in this paper.Please refer to Appendix A.1 for the definition of time-sensitive questions.

Overview
As depicted in Figure 2, MTGER first encodes paragraphs and constructs a multi-view temporal graph, then applies temporal graph reasoning over the multi-view temporal graph with the timecomparing objective and adaptive fusion, and finally feeds the time-enhanced features into the decoder to get the answer.

Text Encoding and Graph Construction
Textual Encoder We use the pre-trained FiD (Izacard and Grave, 2021b) model to encode long text.FiD consists of a bi-directional encoder and a decoder.It encodes paragraphs individually at the encoder side and concatenates the paragraph representations as the decoder input.Given a document D = {P 1 , P 2 , ..., P k } containing K paragraphs, each paragraph contains M tokens: P i = {x 1 , x 2 , ..., x m }, h represents the hidden dimension.Following the previous method (Izacard and Grave, 2021a), the question and paragraph are concat in a "question: title: paragraph:" fashion.The textual representation H text ∈ R k×m×h is obtained by encoding all paragraphs individually using the FiD Encoder.
Graph Construction We construct a multi-view heterogeneous temporal graph based on the relationship between facts and time in the document, as illustrated in Figure 2 and G time = (V, E time ), which has the same nodes and different edges.
Procedure Constructing a multi-view temporal graph consists of document segmentation, node extraction and edge construction.Firstly, we segment the document into chunks (or paragraphs) based on headings, with similar content within each chunk.Afterward, we extract time and fact nodes representing time intervals and events, using regular expressions.Finally, we construct edges based on the relationship between nodes and chunks.Besides, to mitigate graph sparsity problem, we introduce a global node to aggregate information among fact nodes.We will introduce nodes, edges, and views in the following.
Nodes According to the roles represented in temporal reasoning, we define six node types: global node is responsible for aggregating the overall information; fact nodes represent the events at the sentence-level granularity (e.g.He enrolled at Stanford); time nodes are divided into four categories according to their temporal states, before, after, between, and in, which represent the time interval of the events (e.g.Between 1980and March 1988[1980, 1998.25]).In the temporal graph, the global node and fact nodes are located at the factual layer and time nodes at the temporal layer.
Edges In the temporal layer, we build edges according to the temporal relationship between time nodes, including before, after and overlap.In the factual layer, we build edges according to the discourse relationship between fact nodes.For facts in the same paragraph, they usually share a common topic.Accordingly, we construct denselyconnected intra-paragraph edges among these fact nodes.For facts in different paragraphs, we pick two fact nodes in each of the two paragraphs to construct inter-paragraph edges.Temporal and factual layers are bridged by time-to-fact edges, which are uni-directed, from times to facts.The global node is connected with all fact nodes, from facts to the global node.The Appendix A.8 and A.9 provide an example of temporal graph and a more detailed graph construction process.
Views We construct two views, the fact-focused view and the time-focused view.Multiple views make it possible to model both absolute relationships between time expression (e.g.1995 is before 2000 because 1995 < 2000) and relative relationships between events (e.g.Messi joins Inter Miami CF after his World Cup championship).If only one view exists, it is difficult to model both relationships simultaneously.In the time-focused view, time comparisons occur between time nodes, and fact nodes interact indirectly through time nodes as bridges; in the fact-focused view, the relative temporal relationships between facts are directly modeled.The model can sufficiently capture the temporal relationships among facts by complementing each other between the two views.To obtain the fact-focused view, we replace the discourse relation edges between fact nodes in the time-focused view with the temporal relation edges between the corresponding time nodes and replace the edges of time nodes with discourse relations of fact nodes in a similar way.The comparison of the two views is shown at the top of Figure 2(b).

Multi-view Temporal Graph Reasoning
Temporal Graph Reasoning First, we initialize the nodes representations using the text representation and then perform a linear transformation to the nodes according to their types.
where h 1 , ..., h l are textual representations corresponding to the node v, W t ∈ R n×n is the linear transformation corresponding to node type t, n is number of nodes, V (0) ∈ R n×h serves as the first layer input to the graph neural network.
We refer to the heterogeneous graph neural network (Schlichtkrull et al., 2018;Busbridge et al., 2019) to deal with different relations between nodes.In this paper, we use heterogeneous graph attention network.
We define the notation as follows: represents the node transformation, query transformation and key transformation under relation r, respectively.They are all learnable parameters.
First, we perform a linear transformation on the nodes according to the relations they are located.
Afterwards, we calculate the attention scores 1 in order to aggregate the information.
Finally, we obtain the updated node representation by aggregating based on the attention score.
Here we use the multi-head attention mechanism (Vaswani et al., 2017), where K is the number of attention heads.
We obtain the final graph representation through L layer heterogeneous graph neural network.
1 Formula of multi-head attention is omitted for readability.
Adaptive Multi-view Graph Fusion Through the graph reasoning module introduced above, we can obtain the fact-focused and time-focused graph representation, V f ∈ R n×h , V t ∈ R n×h , respectively.These two graph representations have their own focus, and to consider both perspectives concurrently, we adopt an adaptive fusion manner (Li et al., 2022;Zhang et al., 2023) to model the interaction between them, as illustrated in Figure 2(b).
where W f ∈ R h×h and W t ∈ R h×h are learnable parameters.
Self-supervised Time-comparing Objective To further improve the implicit reasoning capability of models over absolute time, we design a selfsupervised time-comparing objective.First, we transform the extracted TimeX into a time interval represented by a floating-point number.After that, the fact nodes involving TimeX in the graph are combined two by two.Three labels are generated according to the relationship between the two time intervals: before, after, and overlap.Assuming that there are N fact nodes in the graph, N (N − 1)/2 self-supervised examples can be constructed, and we use cross-entropy to optimize them.
where h i , h j are the nodes representations and y i,j is the pseudo-label.[; ] denotes vector concatenate.

Time-enhanced Decoder
Question-guided Text-graph Fusion Now we have the text representation H text and the multiview graph representation V f use .Next, we dynamically fuse the text representation and the graph representation guided by the question to get the time-enhanced representation H f use , which will be fed into the decoder to generate the final answer, as illustrated in Figure 2(c).
where q query is query embedding, MHCA is multihead cross-attention, and AdapGate is mentioned in the graph fusion section.
Time-enhanced Decoding Finally, the timeenhanced representation H f use is fed into the decoder to predict the answer, and we use the typical teacher forcing loss to optimize the model.

Training Objective
The training process has two optimization objectives: teacher-forcing loss for generating answers by maximum likelihood estimation and timecomparing loss for reinforcing time-comparing ability in the graph module.We use a multi-task learning approach to optimize them.
where λ is a hyper-parameter.
3 Experiments SituatedQA is an open-domain QA dataset where each question has a specified temporal context, which is based on the NQ-Open dataset.Following the original experimental setup, we use EM and F1 on the TimeQA and EM on the Sit-uatedQA dataset as evaluation metrics.In addition, we calculate metrics for each question type.For more information about the datasets, please refer to Appendix A.1.
For the large-scale language model baseline, we sample 10 answers per question, @mean and @max represent the average and best results, respectively.We use gpt-3.5-turbo@max in subsequent experiments unless otherwise stated.
Implementation Details We use Pytorch framework, Huggingface Transformers for pre-trained models and PyG for graph neural networks.We use the base size model with hidden dimension 768 for all experiments.For all linear transformations in graph neural networks, the dimension is 768×768.The learning rate schedule strategy is warm-up for the first 20% of steps, followed by cosine decay.We use AdamW as the optimizer (Loshchilov and Hutter, 2019).All experiments are conducted on a single Tesla A100 GPU.The training cost of TimeQA and SituatedQA is approximately 4 hours and 1.5 hours, respectively.Please refer to Appendix A.3 for detailed hyperparameters.

Main Result
Results on TimeQA Table 1 shows the overall experiment results of baselines and our methods on the TimeQA dataset.The first part shows the results of the LLM baselines.Even though the LLMs demonstrate their amazing reasoning ability, it performs poorly in temporal reasoning.The second part shows the results of supervised baselines, where BigBird performs on par with gpt-3.5-turbo,and FiD performs better than BigBird.
Our method MTGER outperforms all baselines, achieving 60.40 EM / 69.44 F1 on the easy split and 53.19 EM / 61.42 F1 on the hard split.MT-GER obtains 2.43 EM / 1.92 F1 and 3.39 EM / 2.95 F1 performance boost compared to the stateof-the-art QA model, FiD.We think this significant performance boost comes from explicit modelling of time and interaction across paragraphs.In addition, MTGER++ further improves the performance through long-context adaptation pre-training for the reader model, achieving 60.95 EM / 69.89 F1 on the easy split and 54.11 EM / 62.40 F1 on the hard split.Please refer to Appendix A.4 for more details about MTGER++.Results on TimeQA Human-Paraphrased The questions in TimeQA human-paraphrased are rewritten by human workers in a more natural language manner, making the questions more diverse and fluent.As shown in Table 2, the performance of the LLMs increases rather than decreases, which we think may be related to the question becoming more natural and fluent.All supervised baselines show some performance degradation.Our method still outperforms all baselines and has less performance degradation compared with supervised baselines.The experimental results illustrate that our method is able to adapt to more diverse questions.
Results on SituatedQA To investigate the generalization of MTGER for temporal reasoning, we According to the results, our method is particularly good at handling hard questions that require implicit reasoning.We believe the explicit temporal modelling provided by the heterogeneous temporal graphs improves the implicit temporal reasoning capability of models.In the next section, we will perform an ablation study on different modules of MTGER to verify our conjecture.

Ablation Study
We investigate the effects of text-graph fusion, time-comparing objective and multi-view temporal graph.For multi-view temporal graphs, we conduct a more detailed analysis by sequentially removing the multi-view graph, heterogeneous temporal graph and homogeneous temporal graph.Experimental results are shown in Table 5. and test set, respectively.This suggests that the question-guided text-fusion module can improve overall performance by dynamically selecting more useful information.

Effect of Text-graph Fusion
Effect of Time-comparing Objective As shown in Table 5(b), removing the time-comparing objective has almost no effect on the easy split, and the performance degradation in the dev and test set on the hard split is 0.78 EM / 0.54 EM, indicating that the time-comparing objective mainly improves the implicit temporal reasoning capability of models.

Effect of Multi-view Temporal Graph
We sequentially remove the multi-view temporal graph (c.1 keep only one heterogeneous temporal graph); replace the heterogeneous temporal graph with a homogeneous temporal graph (c.2 do not distinguish the relations between nodes); and remove the graph structure (c.3) to explore the effect of graph structure in detail.
Removing the multi-view temporal graph (c.1) brings an overall performance degradation of 0.48 EM / 0.78 EM and 0.37 EM / 0.77 EM in the dev set and test set, respectively, implying that the complementary nature of multi-view mechanism helps to capture sufficient temporal relationships between facts, especially implicit relationships.We replace the heterogeneous temporal graph with a homogeneous temporal graph (c.2), which results in the GNN losing the ability to explicitly model the temporal relationships between facts, leaving only the ability to interact across paragraphs.The performance degradation is slight in the easy split, while it causes significant performance degradation of 1.52 EM / 1.64 EM compared with (c.1) in the hard split, which indicates that explicit modelling of temporal relationships between facts can significantly improve the implicit reasoning capability.
Removing the graph structure also means removing both text-graph fusion and time-comparing objective, which degrades the model to a FiD model (c.3).At this point, the model loses the crossparagraph interaction ability, and there is an overall degradation in performance, which suggests that the cross-paragraph interaction can improve the overall performance by establishing connections between facts and times.

Consistency Analysis
To investigate whether the model can consistently give the correct answer when the time specifier of questions is perturbed, we conduct a consistency analysis.The experimental results in Table 7 show that our method exceeds the baseline by up to 18%, which indicates that our method is more consistent and robust compared to baselines.Please refer to Appendix A.7 for details of consistency analysis.

Case Study
We show two examples from the consistency analysis to illustrate the consistency of our model in the face of question perturbations, as shown in Table 6.Both examples are from the TimeQA hard dev set.
The first example shows the importance of implicit temporal reasoning.From the context, we know that Beatrix received her bachelor's degree in 1954 and her master's degree in 1956.The master's came after the bachelor's, so we can infer that she was enrolled in a master's degree between 1954 and 1956, and 1954-1955 lies within this time interval, so she was enrolled in Brown University during this period.Since FiD lacks explicit temporal modelling, its implicit temporal reasoning ability is weak and fails to predict the correct answer.
The second example shows the importance of question understanding.The question is about which school he attends, which is not mentioned in the context.FiD incorrectly interprets the question as to where he is working for, fails to understand the question and gives the wrong answer.Our method, which uses a question-guided fusion mechanism, allows for a better question understanding and consistently gives the correct answer.
5 Related Work 5.1 Temporal Reasoning in NLP Knowledge Base Temporal Reasoning Knowledge base QA retrieves facts from a knowledge base using natural language queries (Berant et al., 2013;Bao et al., 2016;Lan and Jiang, 2020;He et al., 2021;Srivastava et al., 2021).In recent years, some benchmarks specifically focus on temporal intents, including TempQuestions (Jia et al., 2018a) and TimeQuestions (Jia et al., 2021).TEQUILA (Jia et al., 2018b) decomposes complex questions into simple ones by heuristic rules and then solves simple questions via general KBQA systems.EXAQT (Jia et al., 2021) uses Group Steiner Trees to find subgraph and reasons over subgraph by RGCN.SF-TQA (Ding et al., 2022) generates query graphs by exploring the relevant facts of entities to retrieve answers.
Textual Temporal Reasoning Textual temporal reasoning pays more attention to the temporal understanding of the real-world text, including both absolute timestamps (e.g.before 2019, in the late 1990s) and relative temporal relationships (e.g.A occurs before B).To address this challenge, Chen et al. (2021) proposes the TimeQA dataset, and Zhang and Choi (2021) propose the SituatedQA dataset.Previous work directly adopts long-context QA models (Izacard and Grave, 2021b;Zaheer et al., 2020) and lacks explicit temporal modelling.
In this paper, we focus on temporal reasoning over documents and explicitly model the temporal relationships between facts by graph reasoning over multi-view temporal graph.

Question Answering for Long Context
Since the computation and memory overhead of Transformer-based models grows quadratically with the input length, additional means are required to reduce the overhead when dealing with long-context input.ORQA (Lee et al., 2019) and RAG (Lewis et al., 2020b) select a small number of relevant contexts to feed into the reader through the retriever, but much useful information may be lost in this way.FiD (Izacard and Grave, 2021b) and M3 (Wen et al., 2022) reduce the overhead at the encoder side by splitting the context into paragraphs and encoding them independently.However, there may be a problem of insufficient interaction at the encoder side, and FiE (Kedia et al., 2022) introduces a global attention mechanism to alleviate this problem.Longformer, LED, LongT5, and BigBird (Beltagy et al., 2020b;Guo et al., 2022;Zaheer et al., 2020) reduce the overhead by sliding window and sparse attention mechanism.

Conclusion
In this paper, we devise MTGER, a novel temporal reasoning framework over documents.MTGER explicitly models temporal relationships through multi-view temporal graphs.The heterogeneous temporal graphs model the temporal and discourse relationships among facts, and the multi-view mechanism performs information integration from the time-focused and fact-focused perspectives.Furthermore, we design a self-supervised objective to enhance implicit reasoning and dynamically aggregate text and graph information through the question-guided fusion mechanism.Extensive experimental results demonstrate that MTGER achieves better performance than state-of-the-art methods and gives more consistent answers in the face of question perturbations on two documentlevel temporal reasoning benchmarks.

Limitations
Although our proposed method exhibits excellent performance in document-level temporal reasoning, the research in this field still has a long way to go.We will discuss the limitations as well as possible directions for future work.First, the automatically constructed sentence-level temporal graphs are slightly coarse in granularity; a fine-grained temporal graph can be constructed by combining an event extraction system in future work to capture fine-grained event-level temporal clues accurately.Second, our method does not give a temporal reasoning process, and in future work, one can consider adding a neural symbolic reasoning module to provide better interpretability.

A Appendix
A.1 Datasets Details TimeQA TimeQA contains a main dataset and a human-paraphrased dataset, where the questions in the main dataset are synthesized by the templates the questions in the human-paraphrased dataset are rewritten manually by humans.The questions in the dataset include both explicit and implicit forms.Explicit question types contain inexplicit and between-explicit, and implicit question types contain in-implicit, between-implicit, before-implicit, and after-implicit.The easy split contains only explicit questions, and the hard split contains most implicit questions and a few explicit questions.The statistics of the dataset are shown in Table 8 and Table 9.
SituatedQA The same question may have different answers depending on the context.Situat-edQA is an open-domain QA dataset that requires a specific context to produce the correct answer.The dataset contains two types of context, temporal and geographic, and we use the questions the temporal context in this paper.The evaluation metrics in the original paper include EM-One (the answer exactly matches the correct context) and EM-Any (the answer matches any of the annotated contexts), and we use EM-One as the evaluation metric.The original task form of SituatedQA is open-domain QA, and we transform it into a longcontext QA by retrieving relevant fragments using a dense retriever.Please refer to Appendix A.2 for the retrieval details.
Explicit and Implicit Questions Explicit Questions: The timestamps in the questions appear in the document.Implicit Questions: (a) the timestamps in the questions do not appear in the document (b) vague timestamps, such as the 1990s and 21st century (c) the timestamps involving commonsense knowledge, such as World War II lasted from 1939 to 1945.Time-sensitive Questions We refer to the definition in (Chen et al., 2021): (a) each question contains a time identifier (b) changing the time identifier causes the answer to change (c) it requires temporal reasoning to answer the question.

A.2 Details of SituatedQA Retrieval
Following the DPR (Karpukhin et al., 2020)   windows of length 100.We take the Pyserini (Lin et al., 2021) library, using DPR as the dense retriever3 .We select the top 20 most similar segments for each question as the context.The English Wikipedia dumps we used is as of Feb 20, 2021.

A.3 Hyperparameters
The hyperparameters we used during our experiments are shown in Table 10.We search through the listed hyperparameters and end up using the bolded ones in each row.

A.4 Stronger Reader with Long-context Adaptation
We perform long-context adaptation pre-training for models that do not support long-context input.We replace the reader backbone with Uni-fiedQA (Khashabi et al., 2020), a strong QA model.However, UnifiedQA is built on vanilla T5 (Raffel et al., 2020) and does not support long context input, which needs additional adaptations.We use a similar approach to FiD (Izacard and Grave, 2021b), forcing the model to encode paragraphs separately at the encoder side, delaying the interaction across paragraphs to the decoder side, and use a FiD-style input format for training 3000 steps on long-context QA task for adaption.Finally, we obtain a UnifiedQA model adapted to longcontext input.We replace the reader of MTGER with the long-context adapted UnifiedQA to obtain MTGER++.

A.5 Prompts
The prompts we use are shown in Figure 4, and Figure 5.We try three different prompts and choose the one that performs best on the TimeQA dev set.We maintain a few-shot examples pool.For the k-shot prompt, k examples are drawn from it at a time, and we ensure that these k examples can cover all question types in the current dataset split.Due to the maximum length limitation of the LLMs, the few-shot examples cannot take the full context.Therefore, we filter out some context irrelevant to the answer based on the original annotation data to ensure that the context does not exceed the maximum length limit.

A.6 Entire Results of LLMs
We conduct experiments using gpt-3.5-turbo4(Ope-nAI, 2022) and text-davinci-003 (Ouyang et al., 2022) on the TimeQA dataset.The entire experiment results are shown in Table 11.Gpt-3.5-turbo has significantly outperformed text-davinci-003 in most cases, but they still have a large gap compared to our method.

Figure 1 :
Figure 1: An example of a time-sensitive question involving implicit reasoning, best viewed in color.(a) describes the question, answer and constraints (b) shows a time-involved document (c) depicts the timeline of events in the document (d) illustrates the reasoning process in the human perspective, including three steps.

Figure 2 :
Figure 2: Overview of MTGER.Best viewed in color.(a) Encoding the context and constructing multi-view temporal graph (b) Temporal garph reasoning over multi-view temporal graph with time-comparing objective and adaptive multi-view fusion (c) Question-guided text-graph fusion and answer decoding.

Table 1 :
Main results on the TimeQA Dataset.All supervised models are base size and we report the average results of three runs.Best and second results are highlighted by bold and underline.

Table 2 :
Results on TimeQA Human-Paraphrased dataset.Gap represents the performance gap compared with the main dataset.

Table 3 :
Results on the SituatedQA Temp test set.Evaluation metric is EM from the correct temporal context.

Table 4 :
Results for different question types on the TimeQA test set.Evaluation metric is EM.

Table 5 :
Question-guided text-graph fusion module dynamically selects the graph information that is more relevant to the question.As shown in Table 5(a), removing this module results in the performance degradation of 0.54 EM / 0.43 EM and 0.87 EM / 0.80 EM in the dev set Results of Ablation Study on the TimeQA dataset.Evaluation metric is EM.

Table 6 :
Case study from the consistency analysis.Q stands for the original question, Q ′ stands for the perturbated question and Answer ′ stands for the answer after question perturbation.
, we divide Wikipedia into segments by no overlapping

Table 9 :
The proportion of the #question of different categories in the TimeQA training set.

Table 11 :
Entire results of LLM baseline on the TimeQA dataset.