Mitigating Temporal Misalignment by Discarding Outdated Facts

While large language models are able to retain vast amounts of world knowledge seen during pretraining, such knowledge is prone to going out of date and is nontrivial to update. Furthermore, these models are often used under temporal misalignment, tasked with answering questions about the present, despite having only been trained on data collected in the past. To mitigate the effects of temporal misalignment, we propose fact duration prediction: the task of predicting how long a given fact will remain true. In our experiments, we demonstrate that identifying which facts are prone to rapid change can help models avoid reciting outdated information and determine which predictions require seeking out up-to-date knowledge sources. We also show how modeling fact duration improves calibration for knowledge-intensive tasks, such as open-retrieval question answering, under temporal misalignment, by discarding volatile facts. Our data and code are released publicly at https://github.com/mikejqzhang/mitigating_misalignment.


Introduction
A core challenge in deploying NLP systems lies in managing temporal misalignment, where a model that is trained on data collected in the past is evaluated on data from the present (Lazaridou et al., 2021).Temporal misalignment causes performance degradation in a variety of NLP tasks (Luu et al., 2021;Dhingra et al., 2022;Zhang and Choi, 2021).This is particularly true for knowledge-intensive tasks, such as open-retrieval question answering (QA) (Chen et al., 2017), where models must make predictions based on world knowledge which can rapidly change.Furthermore, such issues are only exacerbated as the paradigm for creating NLP systems continues to shift toward relying on large pretrained models (Zhang et al., 2022;Chowdhery et al., 2022) that are prohibitively expensive to retrain and prone to reciting outdated facts.

Question Asked in t q =2021:
What's the tallest building in the world of all time?
Who sang the American Anthem at the Super Bowl?

QA System trained in t M =2018:
Pink sang the American…

Retrieved Evidence from t E =2018
The Burj Khalifa is the tallest building… In the example on the left, the temporal misalignment between when the system was trained and evaluated has no affect on the answer.On the right, the answer has changed, causing the system to output an outdated answer with high confidence.To account for this, we apply our fact duration prediction system to adjust the system's confidence accordingly.
Prior work has attempted to address these issues by updating the knowledge stored within the parameters of an existing pretrained model (Cao et al., 2021;Mitchell et al., 2022;Onoe et al., 2023).Another line of work has proposed using retrieval-based systems, which utilize a nonparametric corpus of facts that can be updated over time (Karpukhin et al., 2020;Guu et al., 2020;Lewis et al., 2021).Both methods, however, are incomplete solutions as they rely on an oracle to identify which facts need to be updated and to continuously curate a corpus of up-to-date facts.
Given the difficulty of keeping existing models up-to-date, we propose an alternative solution where we abstain from presenting facts that we predict are out of date. 1 To accomplish this, we introduce fact duration prediction, the task of predicting how frequently a given fact changes, and establish several classification and regressionbased baselines.We also explore large-scale sources of distant supervision for our task, including fact durations extracted from temporal knowledge bases (Chen et al., 2021b) and durationrelated news text (Yang et al., 2020).We provide rich discussion on this challenging task, exploring the relationship between fact duration prediction and temporal commonsense (Zhou et al., 2020).
We provide two sets of evaluations for our fact duration prediction systems.First, as intrinsic evaluations, we report how close our systems' duration estimates are to ground truth labels.We find that models that are trained with only distant supervision can predict the duration of 65% of temporally dependent facts from real search queries in Natu-ralQuestions (Kwiatkowski et al., 2019) to within 3 years, compared to 11% from a simple averageduration baseline.Second, in extrinsic evaluations, we measure how systems' duration estimates can improve an open-retrieval QA system's predictions under temporal misalignment.We mainly focus on improving calibration (as depicted in Figure 1).Our approach can reduce expected calibration error by 50-60% over using system confidence alone on two QA systems (Roberts et al., 2020;Karpukhin et al., 2020) on the SituatedQA dataset (Zhang and Choi, 2021).
Lastly, we also explore other ways of applying our fact duration systems in QA.We experiment with adaptive inference in ensembled open/closedbook QA systems, using duration prediction to decide when retrieval is necessary due to temporal misalignment.We also apply fact duration prediction in a scenario where retrieval is performed over heterogeneous corpus containing both outdated and recent articles, and systems must weigh the relevance of an article against its recency.In summation, we present the first focused study on mitigating temporal misalignment in QA through estimating the duration of facts.

Settings
We aim to address temporal misalignment (Luu et al., 2021) in knowledge-intensive tasks, such as open-retrieval QA. Figure 1 illustrates our setting.We assume a QA model that is developed in the past and is evaluated on a query from a later date.This system suffers from temporal misalignment, return-   atedQA).Confidence estimates are taken from calibration models that have been trained for each QA system.All models are trained on 2018 answers from NQ-Open.
On the bottom, we compare against an oracle system which zeros the confidence of predictions whose answers have changed between 2018 and 2021.
ing outdated answers for some questions whose answer has changed in the meantime.Table 1 reports the QA performance of existing systems on SituatedQA (Zhang and Choi, 2021), a subset of questions from NQ-Open (Kwiatkowski et al., 2019;Lee et al., 2019) that has been reannotated with the correct answer as of 2021.In this dataset, 48% of questions are updated within the temporal gap (2018 to 2021).We can see that the current models, without considering temporal misalignment, experience performance degradation on both answer accuracy (EM) and calibration. 2n this table, we also explore using an oracle that identifies which answers have changed and zeroes the QA system's confidence in such predictions.While this does not change the system's accuracy, it helps models identify incorrect predictions, improving calibration metrics across the board.In real-world scenarios, however, we do not know which facts are outdated.Thus, in this work we build a fact duration model which predicts facts that are likely outdated and use it to adjust the confidence of the QA model.We introduce our fact duration prediction and QA settings in detail below.

Fact Duration Prediction
We define the fact duration prediction task as follows: given a fact f , systems must predict its duration d, the amount of time that the fact remained true for.We consider datasets that represent facts in a variety of formats: QA pairs, statements, knowledge-base relations.For modeling purposes, we convert all facts to statements.For example, the fact f ="The last Summer Olympic Games were held in Athens."has a duration of d = 4 years.
Error Metrics We evaluate fact duration systems by measuring error compared to the gold reference duration: Year MAE is the mean absolute error in their predictions in years and Log-Sec MSE is mean squared error in log-seconds.

QA under Temporal Misalignment
The open-retrieval QA task is defined as follows: Given a question q i , a system must produce the corresponding answer a, possibly relying on retrieved knowledge from an evidence corpus E. When taking temporal misalignment into consideration, several critical timestamps can affect performance: • Model Training Date (t M ): When the training data for M was collected or annotated.• Evidence Date (t E ): When E was authored. 3 Query Date (t q ): When q was asked.
For studying QA under temporal misalignment, we further specify that systems must produce appropriate answer at the time of the query a tq .For example, the question q ="Where are the next Summer Olympics?"asked at t q = 2006 has answer a 2006 ="Beijing".We define the magnitude of the temporal misalignment (m) to be the amount of time between a a model's training date and the query date (m = t M − t q ).We will compare this with the duration of the fact being asked d = f (q, a tq ).If m > d, we should lower the confidence of the model on this question.
For simplicity, we do not take an answer's start date into account.Ideally, determining whether a given QA pair (q, a) has gone out of date should also consider the answer's start date (t s ) and a model's training date (t m ), and confidence can be lowered if t s + d < t m + m.While we expect this approximation to have less of an impact when settings where the misalignment period is small with respect to the distribution of durations, we perform error analysis on examples where considering start date hurts performance in Appendix C.
Calibration Metrics Even without temporal misalignment, models will not always know the correct answer.Well calibrated model predictions, however, allow us to identify low-confidence predictions and avoid presenting users with incorrect information (Kamath et al., 2020).Under temporal misalignment, calibration further requires identifying which predictions should receive reduced confidence because the answer has likely changed.We consider following calibration metrics: • AUROC: Area under the ROC curve evaluates a calibration system's performance at classifying correct and incorrect predictions over all possible confidence thresholds (Tran et al., 2022).
• Expected Calibration Error (ECE): Computed by ordering predictions by estimated confidence then partitioning into 10 equally sized buckets.ECE is then macro-averaged absolute error each bucket's average confidence and accuracy.
• Risk Control (RC@XX): Uncertainty estimates are often used for selective-prediction, where models withhold low-confidence predictions below some threshold (< τ ), where τ is set to achieve a target accuracy (XX%) on some evaluation set.We measure how well τ generalizes to a new dataset (Angelopoulos et al., 2022).To compute RC@XX, we set τ based on predictions from t M , then compute the accuracy on predictions from t q with confidence ≥ τ .In the ideal case, the difference (|∆|) between RC@XX and XX should be zero.

Data
We first describe the datasets used for evaluation, split by task.We then describe our two large-scale sources for distant supervision.Appendix B contains further prepossessing details and examples.

Evaluation Datasets
QA under Misalignment Our primary evaluations are on SituatedQA (Zhang and Choi, 2021), a dataset of questions from NQ-Open (Kwiatkowski et al., 2019) with temporally or geographically dependent answers.We use the temporally-dependent subset, where each question has been annotated with a brief timeline of answers that includes the correct answer as of 2021, the prior answer, and the dates when each answer started to be true.We evaluate misalignment between t M = 2018 and t q = 2021 using the answers from NQ-Open for a 2018 and answers from SituatedQA as a 2021 .
While several recent works have proposed new  datasets for studying temporal shifts in QA (Kasai et al., 2022;Livska et al., 2022), these works focus on questions about new events, where answers do not necessarily change (e.g., "How much was the deal between Elon Musk and Twitter worth?").We do not study such shifts in the input distribution over time.We, instead, study methods for managing the shift in the output distribution (i.e., answers changing over time).Adjusting model confidence due to changes in input distribution has been explored (Kamath et al., 2020); however, to the best of our knowledge, this is the first work on calibrating over shifts in output distribution in QA.
Fact Duration Following suit with the QA evaluations above, we also evaluate fact duration prediction on SituatedQA.To generate fact-duration pairs, we use the annotated previous answer as of 2021, converting the question/answer pair into statement using an existing T5-based (Raffel et al., 2020) conversion model (Chen et al., 2021a).We then use distance between the 2021 and previous answer's start date as the fact's duration, d.
Temporal Commonsense Temporal commonsense focuses on inferences about generic events (e.g., identifying that glaciers move over centuries and a college tours last hours).In contrast, fact duration prediction requires making inferences about specific entities.For instance, determining the duration of an answer to a question like "Who does Lebron James plays for?" requires entity knowledge to determine that Lebron James is a basketball player and commonsense knowledge to determine that basketball players often change teams every few years.Previous work (Onoe et al., 2021) has demonstrated the non-trivial nature of combining entity-specific and commonsense knowledge.
Due to the differences described above, we do not use temporal commonsense datasets for evaluating fact duration prediction.We, however, still evaluate on them to explore how these tasks compare.In particular, we evaluate our fact duration systems on the event duration subset of MCTACO (Zhou et al., 2019).Each MCTACO example consists of a multiple-choice question about the duration of some event in a context sentence, which we convert into duration statements.We evalute using the metrics proposed by the original authors.Following Yang et al. (2020), we select all multiple choice options whose duration falls within a tuned threshold of the predicted duration.EM measures accuracy, evaluating whether the gold and predicted answer sets exactly match.F1 measures the average F1 between the gold and predicted answer sets.

Distant Supervision Sources
Temporal Knowledge Bases have been used in numerous prior works for studying how facts change over time.TimeQA (Chen et al., 2021b) is one such work that curates a dataset of 70 different temporally-dependent relations from Wikidata and uses handcrafted templates to convert into decontextualized QA pairs, where the question specifies a time period.To convert this dataset into factduration pairs (f, d), we first convert their QA pairs into a factual statements by removing the date and using a QA-to-statement conversion model (Chen et al., 2021a).We then determine the duration of each facts to be the length of time between the start date of one answer to the question and the next.
News Text contains a vast array of facts and rich temporal information.Time-Aware Pretraining dataset (TimePre) (Yang et al., 2020) curates such texts from CNN and Daily Mail news articles using regular expressions to match for durationspecifying phrases (e.g., "Crystal Palace goalkeeper Julian Speroni has ended the uncertainty over his future by signing a new 12 month contract to stay at Selhurst Park.").Pretraining on this dataset has previously been shown to improve performance on temporal commonsense tasks.

Dataset Summary
Table 2 reports data statistics and Figure 2 presents the distribution of durations from each dataset.While most facts in SituatedQA and TimeQA change over the course of months to decades, facts in MCTACO and TimePre cover a wider range.Here, we describe our fact duration prediction systems.We include two simple lowerbound baselines: Random samples a duration and Average uses the average duration from each dataset.
Following prior work on temporal common sense reasoning (Yang et al., 2020), we develop BERT-based (Devlin et al., 2018) models. 4We frame fact duration prediction as cloze questions to more closely match the system's pretraining objective (Schick and Schütze, 2020).To this end, we append ", lasting [MASK][MASK]" onto each fact, eliciting the model to fill in the masked tokens with a duration.We use two mask tokens as typically duration information requires at least two tokens, one for value and another for unit.For our TimePre and MCTACO datasets, we similarly replace the target durations with two mask tokens.Table 8 in Appendix B contains examples.Predictions are made by averaging the encoded representations of the two "[MASK]" tokens, then using this representation as an input to a single hidden layer network.Using this same representation, we train two models with regression-based and classification-based learning objectives described below.
Classification Model frames the task as a 13-way classification task where each class corresponds to a duration in Figure 2. 5 We train using cross entropy loss, selecting the closest duration class as the pseudo-gold label for each fact.Because this model can only predict a limited set of durations, we report its upperbound by always selecting the class closest to the gold duration.
Regression Model uses the mean squared error loss in units of log seconds, where the output from the hidden layer predicts a scalar value.

Results
We experiment with training on TimeQA and TimePre individually and on the union of both datasets.Figure 3 reports duration prediction and temporal commonsense performance.Overall, we find that our trained systems outperform simple random and average baselines on SituatedQA.This is indicative of strong generalizability from our distantly-supervised fact-duration systems, even when baselines benefit from access to the gold label distribution.We also provide a histogram of errors from our systems in Figure 3 where we can see that over 60% of our classification-based system's predictions are within 3 years of the gold duration, while predicting the exact duration remains challenging.Below, we reflect on the impact of our different modeling choices and research questions.
Regression vs. Classification Regression-based models tend to outperform their classificationbased counterparts.The instances where this is not true can be attributed to an insufficient amount of training data.In Figure 3, we can see the different types of errors each model makes.The classification system predicts duration within 1 year more frequently, but the regression system predicts duration within 4 years more frequently.
Supervision from KB vs. News Text We find that training on temporal knowledge-base relations (TimeQA) alone vastly outperforms training on news text (TimePre) alone for fact-duration prediction; however, the opposite is true when comparing performance on temporal commonsense (MC-TACO).Training on both datasets tends to improve our regression-based system, but yields mixed results for our classification-based system.We hypothesize that the closeness in label distribution (see Figure 2) between the training and evaluation sets impacts the performance significantly.

Fact Duration vs. Temporal Commonsense
While fact duration prediction and temporal commonsense are conceptually related, we find that strong performance on either task does not necessarily transfer to the other.As discussed above, this can be attributed to differences in label distributions; however, label distribution also serves as a proxy variable for the type of information being  queried for in either task.Commonsense knowledge primarily differentiates events that take place over different orders of magnitude of time (e.g., seconds versus years).Differentiating whether an event takes place over finer-grained ranges (e.g., one versus two years), however, cannot be resolved with commonsense knowledge alone, and further require fact retrieval.We find that NQ contains queries for facts that change over a smaller range of durations (between 1-10 years), and, therefore, commonsense knowledge alone is insufficient.

Calibrating QA under Temporal Misalignment
Here, we return our motivating use-case of using fact duration prediction to calibrate openretrieval QA systems under temporal misalignment.
We assume an access to base calibration system, c(q, a) ∈ [0, 1] that has the same training date as the QA system its calibrating.We then use fact duration to generate misalignment-aware confidence score c m through simple post-hoc augmentation, scaling down the original confidence score by a discount factor based on the degree of misalignment and the predicted fact duration.We compute this factor differently for each of our fact duration systems.
• Classification: Here, the system's output is a probability distribution over different duration classes, p(d|q, a).We set the discount factor to be the CDF of this distribution evaluated at m:6 c m = c(q, a) d≤m P (d|q, a).• Regression: Here, the output is a single predicted duration d.We set the discount factor to the binary value indicating whether or not the misalignment period has exceeded the predicted duration: c m = c(q, a)1{d < m} As classification systems predict a distribution over fact durations, we are able to use the CDF of this distribution to make granular adjustments to confidence over time.In contrast, our regression systems predict a single value for fact duration, and confidence adjustments over time are abrupt, leaving confidence unchanged or setting it to zero.

Models
Base QA and Calibration Systems We experiment with three QA systems throughout our study: • 1 ⃝ T5: We use T5-large (Roberts et al., 2020) which has been trained with salient-spanmasking and closed-book QA on NQ-Open.
• 2 ⃝ DPR (t e =2018): We use DPR (Karpukhin et al., 2020), an open-book system which retrieves passages from a t e = 2018 Wikipedia snapshot and is also trained on NQ-Open.
• N ⃝ DPR (t e =2021): We use the same model as 2 ⃝, but swap the retrieval corpus with an updated Wikipedia snapshot that matches query times-  We first adjust confidence Per-Example, which is our full system.We then adjust confidence Uniformly across all examples, such that the net decrease in confidence across the entire test set is equivalent.
tamp (t e = 2021) following Zhang and Choi (2021), which showed partial success in returning up-to-date answers.For each QA system, we train a calibrator that predicts the correctness of the QA system's answer.We follow Zhang et al. (2021) for the design and input features to calibrator, using the model's predicted likelihood and encoded representations of the input (details in Appendix A).
Fact Duration Systems For both our regression and classification based models, we use the systems trained over both with TimeQA.We also include results using an oracle fact duration system, which zeroes the confidence for all questions that have been updated since the training date.

Results
Table 3 reports the results from our calibration experiments.Both QA models suffer from temporal degradation, and zero-ing out the confidence of outdated facts with oracle information improves the calibration performance.Using model prediction durations shows similar gains.Both re-gression and classification duration predictions lower the confidence of models, improving calibration metrics across the board.We find that our classification-based model consistently outperforms our regression-based model on our calibration task, despite the opposite being true for our fact-duration evaluations.We attribute this behavior to our classification-based system's error distribution, as it gets more examples correct to within 1 year (Figure 3).Classification-based system also hedge over different duration classes by predicting a distribution, which we use to compute the CDF.
Retrieval-Based QA: Update or Adjust In Table 3, we compare the performance of DPR with static and hot-swapped retrieval corpora from t e = 2018 and t e = 2021.While updating the retrieval corpus improves EM accuracy, adjusting model confidence using fact duration on a static corpus performs better on all calibration metrics.This suggests that, when users care about having accurate confidence estimates or seeing only highconfidence predictions, confidence adjustments can be more beneficial than swapping retrieval corpus.
Ablations: Per-Example vs Uniform Adjustment We compare our system, which adjusts confidence on a per-example basis, against one that uniformly decreases confidence by the same value υ across the entire test set: c m = max(c(q, a)−υ, 0).These ablations still depend on our fact-duration systems to determine the υ such that the total confidence over the entire test set is the same for both methods.Table 4 reports the results from this ablation study.We find that uniformly adjusting confidence improves ECE, which is expected given the decrease in the QA systems EM accuracy after misalignment.We find, however, that our perexample adjustment methods outperform uniform confidence adjustments.

Comparisons against Prompted LLMs
While we primarily focus on calibrating fine-tuned QA models, recent work has also explored calibrating prompted large language models (LLMs) for QA (Si et al., 2022;Kadavath et al., 2022;Cole et al., 2023).Furthermore, recent general-purpose LLMs (e.g., ChatGPT) have demonstrated the ability to abstain from answering questions on the basis that their knowledge cutoff date is too far in the past; however, it is not publicly known how these systems exhibit this behavior.
In this experiment, we investigate one such system, GPT-4 (OpenAI, 2023), and its ability to abstain from answering questions with rapidly changing answers.We prompt GPT-4 to answer questions from SituatedQA, and find that it abstains from answering 86% of all questions (95% of questions whose answers have been updated between 2018 and 2021 and on 79% of examples that have not).Overall, this behavior suggests that GPT-4 overestimates how frequently it must to abstain from answering user queries.Furthermore, GPT-4 does not provide how frequently the answer is expected to change.Collectively, GPT-4's tendency to over-abstain and its lack of transparency limits its usefulness to users.In contrast, our approach provides users with an duration estimate indicating why a prediction may not be trustworthy.Further experimental details and example outputs are reported in Appendix D.

Beyond Calibration: Adaptive Inference
In this section, we explore using our misalignmentaware confidence scores to decide how to answer a question.Below, we motivate and describe two adaptive inference scenarios where systems may choose between two methods for answering a question using their fact duration predictions.Hybrid: Closed + Open ( 1 ⃝+ N ⃝): Besides the computational benefits from not always having to use retrieval, forgoing retrieval for popular questions can also improve answer accuracy (Mallen et al., 2022).We use our fact duration predictions to decide when retrieval is necessary: we first predict an answer using T5 and run our fact duration prediction system using this answer.We then use the CDF of the predicted duration distribution to determine whether it is at least 50% likely that the fact has changed: d≤m P (d|q, a) ≥ 0.5.If so, we then run retrieval with DPR using the updated corpus t e = 2021 and present the predicted answer.We report our results in the first row of Table 5, which shows that this outperforms either system on its own, while running retrieval on less than half of all examples.Two Corpora: Relevancy vs. Recency ( 2 ⃝+ N ⃝): While most work for QA have focused on retrieving over Wikipedia, many questions require retrieving over other sources such as news or web text.One challenge in moving away from Wikipedia lies in managing temporal heterogeneity across different articles.Unlike Wikipedia, news and web articles are generally not maintained to stay current, requiring retrieval-based QA systems to identify out-of-date information in articles.Systems that retrieve such resources must consider the trade-off between the recency versus relevancy of an article.In these experiments, we experiment with using fact duration prediction as a method for weighing this trade-off in retrieval.
Our experimental setup is as follows: instead of computing misalignment from the model's training date, we compute relative to when the article was authored (m = 3 years for 2018 Wikipedia and m = 0 years for 2021 Wikipedia).After performing inference using both corpora, we re-rank answers according to their misalignment-adjusted confidence estimates.We report results in Table 5.We find that our method is able to recover comparable performance to always using up-to-date articles, while using it just under half the time.

Related Work
Commonsense and Temporal Reasoning Recent works have proposed forecasting bench-marks (Zou et al., 2022;Jin et al., 2021a) related to our fact duration prediction task.While our task asks models to predict when a fact will change, these forecasting tasks ask how a fact will change.Qin et al. (2021) studies temporal commonsense reasoning in dialogues.Quantitative reasoning has been explored in other works as quantitative relations between nouns (Forbes and Choi, 2017;Bagherinezhad et al., 2016), distributions over quantitative attributes Elazar et al. (2019), and representing numbers in language models (Wallace et al., 2019).
Calibration Abstaining from providing a QA system's answers has been explored in several recent works.Chen et al. (2022) examines instances where knowledge conflicts exist between a model's memorized knowledge and retrieved documents.As the authors note, such instances often arise due to temporal misalignment.Prior work (Kamath et al., 2020;Zhang et al., 2021;Varshney and Baral, 2023) has explored abstaining from answering questions by predicting whether or not the test question comes from the same training distribution of the QA system.While fact duration also predicts a shift in distribution, fact duration focuses on predicting a shift in a question's output distribution of answers instead of a shift in input distribution of questions; therefore, these two systems are addressing orthogonal challenges in robustness to distribution shift and are complementary.
Keeping Systems Up-to-Date Several works have explored continuing pretraining to address temporal misalignment in pretrained models (Dhingra et al., 2022;Jin et al., 2021b).Other works have explored editing specific facts into models (Cao et al., 2021;Mitchell et al., 2022;Meng et al., 2022).These works, however, have only focused on synthetic settings and assume access to the updated facts.Furthermore, such systems have yet to be successfully applied to new benchmarks for measuring whether language models have acquired emergent information (Onoe et al., 2022;Padmanabhan et al., 2023).Recent works on retrievalbased QA systems have found improved adaptation when updated with up-to-date retrieval corpora (Izacard et al., 2022;Lazaridou et al., 2022).

Conclusion
We improve QA calibration under temporal misalignment by introducing the fact duration pre-diction task, alongside several datasets and baseline systems for it.Future work may build upon this evaluation framework to further improve QA calibration under temporal misalignment.For instance, future work may examine modeling different classes distributions of fact duration distributions, like modeling whether a fact changes after a regular, periodic time interval.

Limitations
We only evaluate temporal misalignment between 2018 and 2021, a three-year time difference, on a relatively small scale SituatedQA dataset (N=322).This is mainly due to a lack of benchmark that supports studying temporal misalignment.Exploring this in more diverse setup, including different languages, text domains, wide range of temporal gaps, would be fruitful direction for future work.
As is the case with all systems that attempt to faithfully relay world knowledge, treating model predictions as fact runs the risk of propagating misinformation.While the goal of our fact duration prediction systems is to prevent models from reciting outdated facts, it does not always succeed and facts may change earlier than expected.Even though a given fact may be expected to only change once every decade, an improbable outcome may occur and the fact changes after only a year.In such an event, our misalignment-aware calibration system may erroneously maintain high confidence in the outdated answer.Furthermore, our system, as it stands, does not take the answer start date into account.Our system also can make errors due to changes in the typical duration of a given fact.For instance "What's the world's tallest building?"changes more frequently over time as the rate of technological advances also increases.We provide examples of such system errors in Appendix C.
Table 7: Additional calibration results on SituatedQA, comparing different methods of adjusting model confidence for temporal misalignment using the output of our classification-based fact-duration system.All other settings are the same as in Table 3.
simply assume that all such question answer pairs instances have a duration of 1 month.

B.2 Temporal Commonsense Datasets Preprocessing
As we noted above, each MCTACO example consists of a multiple-choice question about the duration of some event in a provided context sentence.During prepossessing, we use the same question conversion model as above to transform each QA pair into a statement and prepend the context sentence onto each question.We use the metrics proposed by the original authors, and we select all multiple choice options whose duration falls within some absolute threshold of predicted duration, measured in log seconds (Yang et al., 2020).This threshold is selected based on development set performance.

C Additional Results
Different Pretrained Models for Fact Duration Prediction In Table 6, we report our results from DeBERTa-v3-base (He et al., 2021) on our fact duration prediction system.We also experiment with using the large variants of both BERT and DeBERTa, but do not find substantial improvement.

Adjusting Confidence with Expected Duration
In addition adjusting confidence using the CDF of the predicted duration distribution from our classification-based system, we also experiment with using the expected duration as our discounting factor.We incorporate this by zeroing the confidence estimate if the expected duration is exceeded by the degree of misalignment: f (q, a) = 1{ d≥m d • P (d|q, a)}.
In Table 7, we report additional results on our calibration evaluation.We include calibration performance of our best performing fact duration models finetuned on SituatedQA: trained only on Sit-uatedQA for our classification-based model and first trained on TimeQA + TA Pretrain for our regression-based model.
Error Analysis In Table 9, we highlight sampled errors from our fact duration system and discuss their causes and impact.

D ChatGPT and GPT-4 outputs
Table 10 includes two examples of ChatGPT informing users that the answers to a given question may have changed.It, however, does not provide users with an estimate of how likely it has changed, or how often the answer is expected to change.This lack of a duration estimate results in lesser transparency and interpretatbility for users.To get results on SituatedQA, we prompt GPT-4 with the following system prompt (recommended by their documentation): "You are ChatGPT, a large language model trained by OpenAI.Answer as con-

FactFigure 1 :
Figure1: We depict the critical timestamps at play in open-retrieval QA systems.In the example on the left, the temporal misalignment between when the system was trained and evaluated has no affect on the answer.On the right, the answer has changed, causing the system to output an outdated answer with high confidence.To account for this, we apply our fact duration prediction system to adjust the system's confidence accordingly.

Figure 2 :
Figure 2: Duration statistics on each dataset's development set.Columns represent different duration classes used by our classification model, with units abbreviated as Seconds, Minutes, Days, Weeks, Months, Years, Decades, and Centuries.Cells contain the % of examples in each dataset in the column's duration class.

Figure 3 :
Figure 3: Fact Duration Prediction Results.On the left, we report our full results, with performance split by model type and training data.Performance on SituatedQA and TimeQA are given as the mean average error in years (Y) and mean squared error in years in log-seconds (LS), the same as the regression system training loss.On the right, we depict error histograms evaluated on SituatedQA, with systems trained on TimeQA.

Table 1 :
NQ-Open QA performance evaluated on answers from 2018 (from NQ-Open) and 2021 (from Situ-

Table 2 :
Dataset statistics for our QA misalignment calibration and duration prediction tasks.We report the number of examples used in our QA calibration experiments along with how many examples have answers that have changed/unchanged between 2018 and 2021.

Table 3 :
Results for calibrating QA under temporal misalignment on SituatedQA.All systems' training dates are 2018 and evaluation dates are 2021.We report each system's EM accuracy, evaluated against the answers from 2018 and 2021.We also report how much model confidence changes on average Conf % ∆) with each adjustment method (for DPR with t e = 2021 we compare average confidence against using t e = 2018).