Fusing Temporal Graphs into Transformers for Time-Sensitive Question Answering

Answering time-sensitive questions from long documents requires temporal reasoning over the times in questions and documents. An important open question is whether large language models can perform such reasoning solely using a provided text document, or whether they can benefit from additional temporal information extracted using other systems. We address this research question by applying existing temporal information extraction systems to construct temporal graphs of events, times, and temporal relations in questions and documents. We then investigate different approaches for fusing these graphs into Transformer models. Experimental results show that our proposed approach for fusing temporal graphs into input text substantially enhances the temporal reasoning capabilities of Transformer models with or without fine-tuning. Additionally, our proposed method outperforms various graph convolution-based approaches and establishes a new state-of-the-art performance on SituatedQA and three splits of TimeQA.


Introduction
Long-document time-sensitive question answering (Chen et al., 2021) requires temporal reasoning over the events and times in a question and an accompanying long context document.Answering such questions is a challenging task in natural language processing (NLP) as models must comprehend and interpret the temporal scope of the question as well as the associated temporal information dispersed throughout the long document.For example, consider the time-sensitive questions about George Washington's position provided in Figure 1.The relevant events and temporal information regarding George Washington's position are scattered across many different sentences in the context document.Since there is no one single text segment containing the answer, the model must integrate and reason over events and times Q1: What was George Washington's position between 1776 -1780?A: Commander-in-Chief Q2: What was George Washington's position from 1790 -1797?A: Presidency, Chancellor George Washington (February 22, 1732-December 14, 1799) was an American political leader.... TLDR (hundred words) Congress created the Continental Army on June 14, 1775, and Samuel and John Adams nominated Washington to become its Commander-in-Chief.... TLDR (hundred words) Washington bade farewell to his officers at Fraunces Tavern in December 1783 and resigned his commission days later.In 1788, the Board of Visitors of the College of William & Mary decided to re-establish the position of Chancellor, and elected Washington to the office on January 18. ... TLDR (hundred words) He started as the president of United Sates in Jan 1788, …, in 1798, one year after that, he stepped down his presidency position.throughout the context document.Additionally, this example illustrates how changing the time expression in the question may also result in a change in the answer: in this case, replacing between 1776 -1780 with from 1790 to 1797 changes the answer from Commander-in-Chief to Presidency and Chancellor.

Context Document
Though not designed directly for question answering, there is a substantial amount of research on temporal information extraction (Chambers et al., 2014;Ning et al., 2018a;Zhang andXue, 2018, 2019;Han et al., 2019;Ning et al., 2019;Vashishtha et al., 2019;Ballesteros et al., 2020;Yao et al., 2020;Zhang et al., 2022a).Such models can help reveal the structure of the timeline underlying a document.However, there is little existing research on combining such information extraction systems with question answering Transformer models (Izacard and Grave, 2021) to effectively reason over temporal information in long documents.
In this work, we utilize existing temporal information extraction systems to construct temporal graphs and investigate different fusion methods to inject them into Transformer models.
We evaluate the effectiveness of each temporal graph fusion approach on long-document timesensitive question answering datasets.Our contributions are as follows: 1. We introduce a simple but novel approach to fuse temporal graphs into the input text of question answering Transformer models.2. We compare our method with prior approaches such as fusion via graph convolutions, and show that our input fusion method outperforms these alternative approaches.3. We demonstrate that our input fusion approach can be used seamlessly with large language models in an in-context learning setting.4. We perform a detailed error analysis, revealing the efficacy of our method in fixing temporal reasoning errors in Transformer models.
2 Related Work

Extracting Temporal Graphs
Research on extracting temporal graphs from text can be grouped into the extraction of event and time graphs (Chambers et al., 2014;Ning et al., 2018b), contextualized event graphs (Madaan and Yang, 2021), and temporal dependency trees and graphs (Zhang andXue, 2018, 2019;Yao et al., 2020;Ross et al., 2020;Choubey and Huang, 2022;Mathur et al., 2022).Additionally, some prior work has focused on the problem of extracting temporal relations between times and events (Ning et al., 2018a(Ning et al., , 2019;;Vashishtha et al., 2019;Han et al., 2019;Ballesteros et al., 2020;Zhang et al., 2022a).Huang et al. (2022) and Shang et al. (2021) answer temporal questions on text, but they focus on temporal event ordering questions over short texts rather than time-sensitive questions over long documents.Li et al. (2023) focus on exploring large language models for information extraction in structured temporal contexts.They represent the extracted time-sensitive information in code and then execute Python scripts to derive the answers.In contrast, we concentrate on temporal reasoning in the reading comprehension setting, using unstructured long documents to deduce answers.This poses more challenges in information extraction and involves more complex reasoning, which motivates our integration of existing temporal information extraction systems with transformer-based language models.The most similar work to ours is Mathur et al. (2022), which extracts temporal dependency graphs and merges them with Transformer models using learnable attention mask weights.We compare directly to this approach, and also explore both graph convolutions and input modifications as alternatives to fusing temporal graphs into Transformer models.

Fusing Graphs into Transformer Models
The most common approaches for fusing graphs into Transformer models are graph neural networks (GNN) and self-attention.In the GNN-based approach, a GNN is used to encode and learn graph representations which are then fused into the Transformer model (Yang et al., 2019;Feng et al., 2020;Yasunaga et al., 2021;Zhang et al., 2022b).In the self-attention approach, the relations in the graphs are converted into token-to-token relations and are then fused into the self-attention mechanism.For example, Wang et al. (2020) uses relative position encoding (Shaw et al., 2018) to encode a database schema graph into the BERT representation.Similarly, Bai et al. (2021) utilize attention masks to fuse syntax trees into Transformer models.We explore GNN-based fusion of temporal graphs into question answering models, comparing this approach to the attention-based approach of Mathur et al. (2022), as well as our simpler approach which fuses the temporal graph directly into the Transformer model's input.

Method
Our approach applies temporal information extraction systems to construct temporal graphs and then fuses the graphs into pre-trained Transformer models.We consider two fusion methods: 1. Explicit edge representation fusion (ERR): a simple but novel approach to fuse the graphs into the input text.2. Graph neural network fusion: a GNN is used to fuse the graphs into the token embeddings or the last hidden layer representations (i.e., contextualized embeddings) of the Transformer model.The overall approach is illustrated in Figure 2.

Graph Construction
Given a time-sensitive question and a corresponding context document, we construct a directed temporal graph where events and time expressions are nodes and temporal relations are edges of the type BEFORE, AFTER, INCLUDES, INCLUDED BY, SIMULTANEOUS, or OVERLAP.
We extract the single timestamp included in each question, which is either explicitly provided by the dataset (as in SituatedQA (Zhang and Choi, 2021)) or alternatively is extracted via simple regular expressions (the regular expressions we use achieve 100% extraction accuracy on TimeQA (Chen et al., 2021)).We add a single question-time node to the graph for this time.
For the document, we apply CAEVO 1 to iden- tify the events, time expressions, and the temporal relations between them.CAEVO follows the standard convention in temporal information extraction that events are only simple actions (typically verbs) with linking of these actions to subjects, objects, and other arguments left to dependency parsers (Pustejovsky et al., 2003;Verhagen et al., 2009Verhagen et al., , 2010;;UzZaman et al., 2013).We add documentevent and document-time nodes for each identified event and time, respectively, and add edges between nodes for each identified temporal relation.
To link the question-time node to the documenttime nodes, we use SUTime2 to normalize time expressions to time intervals, and deterministically compute temporal relations between question time and document times as edges.For example, given a document-time node "the year 2022" and a question-time node "from 1789 to 1797" from the question "What was George Washington's position from 1789 to 1797?", the times will be normalized to [2022-01-01, 2022-12-31] and [1789-01-01, 1797-12-31] respectively, and the temporal relation between them can then be computed as AFTER.
To link the question-time node to document events, for each document-event node, we calculate the shortest path in the temporal graph between it and the question-time node and recursively apply standard transitivity rules (see Appendix A.1) along the path to infer the temporal relation.For example, given a path A is BEFORE B and B IN-CLUDES C, we can infer the relation between A and C is BEFORE.An example of a constructed temporal graph for Q1 in Figure 1 is illustrated in Figure 3.

Graph Fusion
For both fusion methods, we concatenate the question and corresponding context document as an input sequence to the Transformer model.For example, given the question and document from Figure 1 Q1, the input is: question: What was George Washington's position between 1776 -1780?context: . . .Congress created the Continental Army on June 14, 1775 . . .

Explicit Edge Representation
In the ERR method, we mark a temporal graph's nodes and edges in the input sequence, using <question time> and </question time> to mark the question-time node and relation markers such as <before> and </before> to mark the nodes in the context document and their relations to the question time.Thus, the ERR input for the above example is: question: What was George Washington's position <question time>between 1776-1780</question time>?context: . . .Congress <before>created</before> the Continental Army on <before>June 14, 1775</before>. . .This approach aims to make the model learn to attend to parts of the input sequence that may contain answer information.For instance, the model may learn that information related to the answer may be found near markers such as <overlap>, <includes>, <included by>, and <simultaneous>.Additionally, the model may learn that answer-related information may exist between <before> and <after>, even if the answer does not have any nearby temporal information.

GNN-based Fusion
In GNN-based fusion, we add <e> and </e> markers around each node, and apply a relational graph convolution (RelGraphConv; Schlichtkrull et al., 2018) over the marked nodes.RelGraphConv is a variant of graph convolution (GCN; Kipf and Welling, 2017) that can learn different transformations for different relation types.We employ the RelGraphConv to encode a temporal graph and update the Transformer encoder's token embedding layer or last hidden layer representations (i.e., contextualized embeddings).We utilize the Rel-GraphConv in its original form without any modifications.
Formally, given a temporal graph G = (V, E), we use representations of the <e> markers from the Transformer model's token embedding layer or the last hidden layer as initial node embeddings.The output of layer l + 1 for node i ∈ V is: where N r (i) denotes all neighbor nodes that have relation r with node i, 1 c i,r is a normalization constant that can be learned or manually specified, σ is the activation function, W 0 is the self-loop weight, and W r is the learnable weights.We refer readers to Schlichtkrull et al. (2018) for more details.

Datasets
We evaluate on two time-sensitive question answering datasets: Time-Sensitive Question Answering (TimeQA; Chen et al., 2021) and SituatedQA (Zhang and Choi, 2021).We briefly describe these two datasets below and provide statistics on each dataset in Table 9 of Appendix A.6.
TimeQA The TimeQA dataset is comprised of time-sensitive questions about time-evolving facts paired with long Wikipedia pages as context.The dataset has two non-overlapping splits generated by diverse templates which are referred to as TimeQA Easy and TimeQA Hard, with each split containing 20k question-answer pairs regarding 5.5K timeevolving facts and 70 relations.TimeQA Easy contains questions which typically include time expressions that are explicitly mentioned in the context document.In contrast, TimeQA Hard has time expressions that are not explicitly specified and therefore require more advanced temporal reasoning.For example, both questions in Figure 1 are hard questions, but if we replace the time expressions in the questions with In 1788, they will become easy questions.In addition, smaller Human-paraphrased Easy and Hard splits are also provided.

SituatedQA
SituatedQA is an open-domain question answering dataset comprising two subsets: Temporal SituatedQA and Geographical Situ-atedQA.We focus on Temporal SituatedQA, which we will hereafter refer to as SituatedQA.Each question in SituatedQA is accompanied by a temporal annotation that could change the answer to the question if it is modified.For instance, the question "Which COVID-19 vaccines have been authorized for adults in the US as of Jan 2021?" has a corresponding answer of "Moderna, Pfizer."However, if we change the time to "Apr 10, 2021," the answer becomes "Moderna, Pfizer, J&J."As Situat-edQA is a re-annotation of a subset of NQ-Open (Kwiatkowski et al., 2019a;Lee et al., 2019) by FiD (Izacard and Grave, 2021)3 for each question as context documents.
Evaluation Metrics We use the official evaluation methods and metrics provided in the code release of the datasets.For TimeQA, we report the exact match and F1 scores.For SituatedQA, we report the exact match score.

Baselines
We compare our models with the previous state-ofthe-art on the TimeQA and SituatedQA, which we describe in this section and summarize in Table 1.
For TimeQA, we compare against: FiD & BigBird Chen et al. ( 2021) adapt two longdocument question answering models, BigBird (Zaheer et al., 2020) and FiD (Izacard and Grave, 2021), to the TimeQA dataset.Before finetuning the models on TimeQA, they also finetune on Natural Questions (Kwiatkowski et al., 2019b)  on TimeQA and SituatedQA, respectively.Appendix A.5 provides other implementation details such as hyperparameters, graph statistics, software versions, and external tool performance.We perform model selection before evaluating on the test sets, exploring different graph subsets with both the ERR and GNN based fusion approaches introduced in Section 3.2.Table 6 in appendix A.2 shows that the best ERR method uses a document-time-to-question-time (DT2QT) subgraph and the best GNN method uses the full temporal graph by fusing it into the token embedding layer representations of the Transformer model.We hereafter refer to the LongT5 model fused with a DT2QT graph using the ERR method as LongT5 ERR , and the LongT5 model fused with a full temporal graph using the GNN method as LongT5 GNN .

Main Results
We summarize the performance of baseline models and those trained with our graph fusion methods in Table 2.
Which baseline models perform best?On TimeQA, our LongT5 model without temporal graph fusion performs better than or equivalent to all other baseline models across every split and metric except for the Easy split.The best-performing model reported on TimeQA Easy is DocTime.On SituatedQA, LongT5 with no fusion performs as well as the best-reported results on this dataset.

Which graph fusion methods perform best?
Using LongT5, we consider both of our ERR and GNN fusion methods described in Section 3.2.On TimeQA, the LongT5 GNN model fails to outperform LongT5 without fusion, while the LongT5 ERR model improves over LongT5 on every split and dataset, exhibiting particularly large gains on the Hard splits.On Situated QA, both LongT5 ERR and LongT5 GNN models improve over the no-fusion LongT5 baseline, with ERR again providing the best performance.The somewhat inconsistent performance of the GNN fusion method across datasets (beneficial on SituatedQA while detrimental on TimeQA) suggests the need for a different GNN design for TimeQA, which we leave to future work.
To explore the differences between LongT5 ERR and LongT5 GNN models, we analyze 20 randomly sampled examples from TimeQA Hard where LongT5 ERR is correct but LongT5 GNN is incorrect.From our manual analysis of these 20 examples, all 20 examples share the same pattern: LongT5 GNN fails to capture explicit temporal expressions in the context and relate them to the question's timeline, which is crucial for deducing the right answer.This suggests that directly embedding precomputed temporal relations between time nodes into the input is more efficient than implicitly doing so through the GNN, allowing the model to utilize them more easily.more temporal reasoning skills, like TimeQA Hard (where our model achieves a 5.8-point higher exact match score than DocTime) and SituatedQA (where our model achieves a 4.3-point higher exact match score than TSM).
Our approach offers further advantages over alternative models due to its simplicity, as summarized in Table 1.The best prior work on TimeQA Easy, DocTime, requires training a temporal dependency parser on additional data, using CAEVO, and modifying the Transformer model's attention mechanism.The best prior work on SituatedQA, TSM, requires an 11-billion parameter T5 model which is pre-trained on the entirety of Wikipedia.In contrast, our approach only uses SUTime to construct a graph, requires only minor adjustments to the Transformer model's input, and outperforms prior work using a model with only 250 million parameters.
Why does our model not achieve state-of-art performance on TimeQA Easy as it does on other splits and datasets?On TimeQA Easy, there is a performance gap between our LongT5 ERR model and DocTime.Because the DocTime model has not been released we cannot directly compare with its predicted results.Instead, we randomly select 50 errors from our LongT5 ERR 's output on the TimeQA Easy development set for error analysis.Table 3 shows that most of the errors are false negatives, where the model's predicted answers are typically co-references to the correct answers (as in Table 3 example 1) or additional correct answers that are applicable in the given context but are not included in the gold annotations (as in Table 3 example 2).The remaining errors are primarily related to semantic understanding, including the inability to accurately identify answer entities (e.g.identifying Greek Prime Minister George Papandreou as an employer in Table 3 example 3), the inability to interpret negation (e.g. in Table 3 example 4, where "rejected" implies that Veloso did not join Benfica), and the inability to reason about time with numerical calculations (e.g."a year later" in Table 3 example 5 implies 1846).Addressing the semantic understanding errors may require incorporating additional entities and their types into the graphs, as well as better processing of negation information and relative times.
To better understand the extent of false negatives in TimeQA Easy, we re-annotated the 392 test examples where the predictions of the replicated FiD model and our LongT5 ERR model are partially correct (i.e., EM = 0 and F1 > 0).We then incorporated additional coreferent mentions into the gold label set for these examples.For instance, if the original gold answer was "University of California," we added its coreferent mention "University of California, Los Angeles" to the gold answers.We then evaluate both the replicated FiD (the best-performing model we can reproduce) and our LongT5 ERR model on the re-annotated TimeQA Easy split.The last two rows of model increases by 11.9.This suggests that our model may be incurring greater penalties for finding valid coreferent answers than baseline methods.
Does our ERR method benefit large language models using in-context learning?We have focused so far on temporal graph fusion when finetuning models, but large language models such as ChatGPT (OpenAI, 2022) and LLaMA (Touvron et al., 2023) can achieve impressive performance without additional fine-tuning via in-context learning.Therefore, we tested the performance of Chat-GPT (gpt-3.5-turbo)both with and without ERR for fusing the question-time-to-document-time graph.Following previous work (Khattab et al., 2022;Trivedi et al., 2022;Yoran et al., 2023) and considering the cost of ChatGPT's commercial API, we randomly sample 500 examples from TimeQA Easy and TimeQA Hard for evaluation.The prompt format for ChatGPT remains the same as the input format described in Section 3.2.1,except that we concatenate in-context learning few-shot exemplars and task instructions before the input.We evaluate ChatGPT with and without graph fusion using an 8-shot setting.Examples of prompts are provided in Table 11 of Appendix A.8. Table 4 shows that our ERR graph fusion method improves the performance of ChatGPT on TimeQA Easy and particularly on TimeQA Hard.We note that this improvement is possible because our method can easily integrate with state-of-the-art large language models, as our approach to temporal graph fusion modifies only the input sequence.Prior work which relies on modifying attention mechanisms or adding graph neural network layers is incompatible with this in-context learning setting.

Analysis
In this section, we analyze our LongT5 ERR model on the TimeQA development set.

How do predictions differ compared to FiD?
We compare the predictions of LongT5 ERR to the replicated FiD model in  each other's errors on TimeQA Easy (269 vs. 243), LongT5 ERR corrects many more of FiD's errors than the reverse on TimeQA Hard (494 vs. 260).To further analyze these cases, we sampled 10 errors from the set where LongT5 ERR was correct while FiD was incorrect as well as the set where the FiD was correct while LongT5 ERR was incorrect.We did this across both TimeQA Easy and TimeQA Hard, totaling 40 examples.Among the 20 examples in which LongT5 ERR was correct and FiD was incorrect, 17 have node markers near the answers in the ERR input sequence, and the most frequent ones are <included by> and <overlap>.The remaining 3 examples have unanswerable questions.
In the examples in which FiD was correct while LongT5 ERR was incorrect, we observe that 13 examples are additional correct answers (i.e., false negatives), while the other 7 examples are semantic understanding errors similar to those discussed previously.These results suggest that our ERR graph fusion approach is providing the model with useful targets for attention which allow it to produce more correct answers.
How does the length of the document affect performance?We compare the performance of LongT5 ERR to the replicated FiD model on various document lengths, as depicted in Figure 4.
LongT5 ERR performs less competitively than FiD on the Easy split for longer documents.This could be attributed to a high frequency of false negatives in LongT5 ERR , as discussed previously.Additionally, it could be that LongT5 ERR is less efficient at string matching on longer documents than FiD.Most of the question times in the Easy split are explicitly mentioned in the context document, which can be solved via string matching rather than temporal reasoning.However, our LongT5 ERR model shows a substantial improvement on TimeQA Hard, outperforming the FiD model across most lengths.

Conclusion
In this paper, we compared different methods for fusing temporal graphs into Transformer models for time-sensitive question answering.We found that our ERR method, which fuses the temporal graph into the Transformer model's input, outperforms GNN-based fusion as well as attention-based fusion models.We also showed that, unlike prior work on temporal graph fusion, our approach is compatible with in-context learning and yields improvements when applied to large language models such as ChatGPT.Our work establishes a promising research direction on fusing structured graphs with the inputs of Transformer models.In future work, we intend to use better-performing information extraction systems to construct our temporal graphs, enhance our approach by adding entities and entity type information to the graphs, and extend our method to spatial reasoning.

Limitations
We use CAEVO and SUTime to construct temporal graphs because of their scalability and availability.
Using more accurate neural network-based temporal information extraction tools may provide better temporal graphs but may require domain adaptation and retraining.While we did not find graph convolutions to yield successful fusion on TimeQA, we mainly explored variations of such methods proposed by prior work.We also did not explore self-attention based fusion methods, as preliminary experiments with those methods yielded no gains, and DocTime provided an instance of such methods for comparison.But there may exist other variations of graph convolution and self-attention based fusion methods beyond those used in prior work that would make such methods more competitive with our input-based approach to fusion.
We also did not deeply explore the places where graph neural networks failed.For example, the graph convolution over the final layer contextualized embeddings using the full temporal graph yielded performance much lower than all other graph convolution variants we explored.We limited our exploration of failures to the more successful explicit edge representation models.the GPT-4 model.However, due to the costs of calling the commercial GPT-4 API, we only randomly sampled 100 examples each from TimeQA Easy and TimeQA Hard, resulting in a total of 200 samples for the experiment.The results are shown in Table 12.ERR method achieves similar performance improvements with GPT-4 as it does with ChatGPT (gpt-3.5-turbo).

A.10 Additional Error Analysis
Table 13 presents error examples, in addition to the examples shown in Table 3.

A.11 Analysis of the GNN Fusion Method
Table 14 shows examples from TimeQA Hard where LongT5 ERR is correct but LongT5 GNN is incorrect.

A.12 Other Experimented Methods
The following are some of the methods we have tried but yielded lower performance than our main methods reported in Table 6. 1.We tried to use the coreference resolution tools to process the context documents and then construct and fuse temporal graphs based on the processed text, but we found that the coreference resolution preprocessing hurt the performance of the models.2. We tried to fuse constructed temporal graphs into FiD models using relation-aware attention (Wang et al., 2020), but we found that the fused FiD models performed almost the same as the non-fused FiD models.3. Our preliminary study found that the performance of the models can be significantly improved if only the most relevant paragraph in the context document is used as the context.

Figure 1 :
Figure 1: Time-sensitive question-answer pairs with a context document.The times and answers are in red and blue, respectively.

Figure 4 :
Figure 4: Comparison of the performance on different lengths of context documents.

Table 1 :
, we use the top 100 passages retrieved from Wikipedia Comparison of methods, base models' size, external tools, additional data, and temporal knowledge fusion methods.NaturalQ: Natural Questions, TDGP: Temporal Dependency Graph Parser, TT: BERT-based Temporal Tagger, TR: timestamps retriever.

Table 2 :
Performance on the test sets.TimeQA HP denotes the human-paraphrased splits of TimeQA.The highestperforming model is bolded.Confidence intervals for our results are provided in table 7 in appendix A.3.
(Chen et al., 2021;Mathur et al., 2022)ing that training on Natural Questions results in the best performance.The best model from Chen et al. (2021) on TimeQA Easy and Hard is FiD, while BigBird performs best on the Humanparaphrased TimeQA Easy and Hard splits.2020)commonlyused on long-document question answering tasks.To fairly compare with previous work(Chen et al., 2021;Mathur et al., 2022), we pre-train the LongT5 model on Natural Questions(Kwiatkowski et al., 2019a)and then fine-tune it Table 14 of appendix A.11 shows three of the analyzed examples.On TimeQA, the LongT5 ERR model achieves a new state-of-the-art on three of the four splits, with TimeQA Easy being the exception.On SituatedQA, the LongT5 ERR model achieves a new state-of-theart, outperforming the best reported results on this dataset.Our model excels on datasets that require

Table 3 :
Error categories and examples on the TimeQA Easy development set.Model predictions are in bold.Underlines denote the correct answers.

Table 4 :
Table 2 show that while the exact match score for FiD increases by 8.7, the exact match score for our LongT5 ERR Performance of ChatGPT on 500 sampled test examples, with and without our ERR method.

Table 5 .
While LongT5 ERR and FiD both correct about the same number of

Table 5 :
Comparison of the predictions of LongT5 ERR vs. the predictions of FiD.

Table 8 :
The statistics of constructed graphs.HP = Human-Paraphrased.

Table 9 :
Statistics for the four datasets.Avg.# Tokens is the average number of tokens in the context document.We use the LongT5 model's tokenizer to tokenize context documents.* The length is based on the retrieved top 100 paragraphs from Wikipedia, which serve as a context document for each question.If the total length of a context document exceeds 10,000 tokens, we truncate the context accordingly.