SituatedQA: Incorporating Extra-Linguistic Contexts into QA

Answers to the same question may change depending on the extra-linguistic contexts (when and where the question was asked). To study this challenge, we introduce SituatedQA, an open-retrieval QA dataset where systems must produce the correct answer to a question given the temporal or geographical context. To construct SituatedQA, we first identify such questions in existing QA datasets. We find that a significant proportion of information seeking questions have context-dependent answers (e.g. roughly 16.5% of NQ-Open). For such context-dependent questions, we then crowdsource alternative contexts and their corresponding answers. Our study shows that existing models struggle with producing answers that are frequently updated or from uncommon locations. We further quantify how existing models, which are trained on data collected in the past, fail to generalize to answering questions asked in the present, even when provided with an updated evidence corpus (a roughly 15 point drop in accuracy). Our analysis suggests that open-retrieval QA benchmarks should incorporate extra-linguistic context to stay relevant globally and in the future. Our data, code, and datasheet are available at https://situatedqa.github.io/.


Introduction
Language reflects our ever-changing world; thus, the meaning of a sentence depends on a variety of extra-linguistic contexts, including when it was stated, who stated it, and where it was stated. Figure 1 depicts examples of this, where the answer to a question varies based on the temporal and geographical contexts. In this work, we study such questions by exploring the affects of temporal and geographical contexts in open-retrieval QA.
Ideally, QA systems should provide up-to-date, localized answers to questions like those in Figure 1 that are situated in the inquirer's temporal Figure 1: Examples of questions with answers that change depending on the temporal or geographical context. and geographical contexts. The existing paradigm for evaluating QA systems, however, makes implicit assumptions about time and location and does not measure how well QA models adapt to new contexts. Recent work studying other NLP tasks have shown that this static paradigm misrepresents how models perform in practical settings (Osborne et al., 2014;Wang et al., 2008;Huang and Paul, 2018;Rijhwani and Preotiuc-Pietro, 2020). To address this challenge, we introduce SITUATEDQA, a dataset comprised of questions from existing openretrieval QA datasets (Kwiatkowski et al., 2019) that have been re-annotated for their their contextdependence and answers across different situations (i.e., temporal or geographical contexts). 1 As we will see in Section 3, a significant pro- portion of information-seeking questions are sensitive to the two extra-linguistic contexts we study. We annotate 9K questions from four existing datasets (Kwiatkowski et al., 2019;Berant et al., 2013;Clark et al., 2020;Campos et al., 2016) with their temporal dependence and 2K questions for their geographical dependence. Annotators find that up to 30% of questions in existing datasets have answers that change over time, and rule-based heuristics using an NER tagger identify that about 5% of questions specify a location. We collect answers from different temporal and geographic contexts for such context-dependent questions (see Table 1). Using our collected data, we evaluate whether existing open-retrieval QA systems adapt to new temporal or geographical contexts by providing answers that are up-to-date or from new locations. While it is often assumed that retrieval based systems for QA (Karpukhin et al., 2020;Guu et al., 2020) can adapt to updated facts during inference by simply updating the retrieval corpus, we find that this is not the case. State-of-the-art retrieval based systems with updated corpora are 15 percentage points less accurate on questions whose answers have been updated versus those that have remained constant since the time when their largescale training dataset was collected. We also observe that models fail to generalize to answering questions from new locations (Shankar et al., 2017), with accuracy dropping by 10 percentage points when asked to provide the answer from a rare location versus a common one.
To support future research developing openretrieval QA systems which situate questions within the inquirer's extra-linguistic contexts, we propose two tasks for modeling what facts change across different extra-linguistic contexts and how those facts change. We also provide fine-grained evaluations for measuring how well models adapt to new temporal and geographical contexts. We establish initial performance levels on our tasks by adapting state-of-the-art methods for open-retrieval QA (Karpukhin et al., 2020;Lewis et al., 2020a;. Finally, we provide rich analysis of existing QA systems, which suggest that benchmark construction should incorporate extralinguistic contexts to remain relevant globally and in the future.

Defining Extra-Linguistic Contexts
We begin by defining the scope of contexts studied in this work. For a given question q, we say that a i is its answer when asked in the context c i . Each context consists of a type c t i and a value c v i . We study two context types: temporal (TEMP) and geographical (GEO). TEMP defines each context value as timestamp (e.g. a date or year) where a i is the answer to q if it was asked at the time of c v i . GEO defines each context value as a geopolitical entity where a i is the answer to the q in the location c v i . See Table 1 for examples from each context type. We limit our study to valid questions, ignoring presupposition cases (Kim et al., 2021) (e.g., asking who is the CEO of Google before Google was founded).

Situated Question Answering
Given a question q and context c i , the task is to produce the corresponding answer a i for the provided context. This task requires models to situate the question within the provided extra-linguistic context to produce an appropriate answer. Models are evaluated on exact string match with annotated answer a i and predicted answerâ i after minor normalization (Rajpurkar et al., 2016).
By evaluating on different sets of extra-linguistic contexts of the same question, we can measure whether models are able to generalize to new contexts. For instance, we can evaluate how models perform on commonly versus rarely asked about locations, or on questions whose answers changed recently or long ago. Furthermore, this task allows us to train systems that explicitly model how facts change across different extra-linguistic contexts.

Context-Dependent Question Identification
In context-dependent question identification, we evaluate whether models can determine whether the answer to a given question depends on its extra-linguistic context. More formally, given a question q and context type c t , models must determine whether there exists two distinct contexts values, (c v i , c v j ), with different respective answers, a i = a j . We cast this as binary classification and evaluate models on their classification accuracy, F1, precision, and recall.
Identifying what facts change depending on the extra-linguistic contexts is nontrivial even for human annotators that often requires extensive background knowledge on a subject. For example, determining whether the capital of Kazakhstan has changed requires specific knowledge of the nation's history. Identifying these questions has many practical applications, such as identifying suitable questions for QA systems that can only provide static answers, an idea we will explore later in Section 6.

Data Collection
Here we describe the process for identifying context-dependent questions and collecting answers from alternate contexts. We split up data collection into three stages which are depicted in Figure 2: identification, {Context / Answer} collection, and validation. Additional data collection details with interface screenshots can be found in Appendix A.

Identification
We source questions from a variety of datasets for open-retrieval QA: Natural Questions (NQ-Open) (Kwiatkowski et al., 2019), WebQuestions (Berant et al., 2013), TyDi-QA (Clark et al., 2020), and MS- MARCO (Campos et al., 2016). All of these datasets are in English and contain questions that are answerable by Wikipedia documents.
Temporally dependent questions are abundant in existing datasets, as they are unambiguous during annotation and annotators simply provide the current answer at the time of collection. In contrast, geographically dependent questions are often rejected during annotation due to their ambiguity, as there is no consensus among annotators on the assumed geographical context. To correct for this scarcity, we generate geographically dependent questions by modifying existing questions from NQ-Open. We identify questions with phrases that specify a location by running an NER tagger  and remove the location entity using heuristics based on its syntactic role as identified by a dependency parser (Dozat and Manning, 2017). See Appendix A for full details.
For each question, we collect 3-way yes / maybe / no annotations for whether they are temporally or geographically dependent. We label questions as context dependent if at least two of annotators label so. We discard questions with split annotations and majority maybe annotations, and map the labels to a binary yes/no label. 2

{Context / Answer} Collection
After identifying context-dependent questions, we then move on to collecting multiple {Context / An-swer} pairs via crowdsourcing. For this part, we exclusively use questions from NQ-Open dataset. We allow crowdworkers to query an up-to-date version of English Wikipedia using the Google Search API. 3 We outline the annotation process for each context type below: TEMP To construct SITUATEDQA examples with temporal context, we take two steps: (1) crowd sourcing timeline for each question and (2) generating (q, c v , a) from the annotated timeline.
Crowdworkers are asked to provide a brief timeline of answers for a given question, consisting of the current answer, the previous answer, as well as the start and end transition timestamps for each answer given as dates or years (see Figure 2). From annotated timeline and query pairs, we construct examples of (q, c v , a) pairs in one of three ways Figure 2: Data collection pipeline: Crowdworkers are first asked to identify context dependent questions. We then collect brief answer timelines for temporally dependent questions and location/answer pairs geographically dependent questions, each of which is then verified by another worker.
which we refer to as Start, Sampled, and Static. We describe each of these methods below: Start examples simply use each answer's start transition timestamp as c v . This is intended to simulate the common scenario of asking information that is new or has recently changed. Sampled examples use a timestamp that lies between an answer's start and end transition as c v . As each answer in a timeline can result in many valid (q, c v , a) pairs, we limit the number of values of c v for each answer to a maximum of two. We uniformly sample up to two timestamps between each answers start and end transition, using each sampled timestamp to create a new (q, c v , a) triple. Static examples utilize questions that were annotated as not temporally-dependent to simulate realistic settings where context dependent and independent questions coexist. For each static question, we uniformly sample a single value of c v , resulting in one (q, c v , a) triple per static question. GEO We construct our evaluation datasets with crowdsourcing. For each stripped question that was annotated as geographically dependent during the identification stage, annotators were presented with the question, the original answer(s), and stripped location. Annotators first validate whether the original location and answer pair is correct for the stripped question before identifying up to two additional location and answer pairs. We then use each validated or identified pair to construct a (q, c v , a) triple for the stripped question.
We construct our training set via distant supervision. Using our best performing context-dependent question identification model which we introduce in Section 4, we classify unlabeled examples from the NQ-Open training set. For each example that was classified as geographically-dependent, we use its original answer, location, and stripped question to construct a (q, c v , a) triple.

Quality Control
Validation The {Context / Answer} annotations collected above are reviewed by another annotator in a final validation stage. Presented with a question along with all answer timelines or location/answer pairs that have been generated by other workers, annotators are asked to validate or revise each response by marking them as correct or incorrect. We collect 2-way {Context / Answer} annotations and have one validatator for each question from the development and test sets. We collect a single {Context / Answer} annotation and skip validation for answer timelines from the training set.
Worker Qualifications / Inter-annotator agreement Table 2 reports Fleiss's kappa for the context dependent question identification and validation stages. We find moderate to high agreement all tasks, with lower agreement on datasets with highly ambiguous queries (e.g. how much to trust friends) or questions require extensive domain knowledge (e.g., what is the cause of smog in china?). For geographically dependent question answering, questions like "when were electric trains introduced" had split votes from the annotators, as both geographically dependent and independent interpretations are valid. Overall, we observe high quality data with comparable agreement numbers with prior work, yet reaffirming prevalent ambiguity in open retrieval QA (Min et al., 2021). Table 2 shows the statistics of our collected dataset. We annotated over 11K questions, and roughly 30-40% of them where identified as context-dependent. Temporally-dependent questions comprised of least  Table 2: Dataset statistics. For our identification dataset, we report both the total number of questions (# q) and the percent of questions that are context-dependent (%). For our {Context / Answer} dataset, we report the number of unique (question, context value, answer) triples (# q, c v , a). Each context type and split's (q, c v , a) triples are collected slightly differently, see Section 3.2 for details. We also report inter-annotator agreement on context dependent question identification and {Context / Answer} validation. 10% of examples in all datasets we looked at, even without any filtering. We also found that examples from Natural Questions where at least one answer span is within a table in its evidence document are more often temporally-dependent. To construct our final dataset, we upsample such questions. We followed the original train, development, and test split for each dataset, which caused slight inconsistencies in the proportion of examples from each dataset across splits, simulating a domain shift. For 2.8K of those context-dependent questions (2.4K temporal, 0.5K geographical), we collected a total of 5.9K answers from alternate contexts (4.0K temporal, 1.9K geographical). From those alternate temporal context / answer pairs, we construct (q, c v , a) by sampling valid dates, creating 6K examples. The final TEMP dataset also includes 6.7K examples from temporally-independent questions. More details and statistics on individual datasets can be found in the in Appendix A.

Data Statistics / Analysis
How often do temporally dependent facts change? We investigate how frequently answers change by measuring the distance between the the two most recent answers. Figure 3 depicts how long the previous answer was valid for. We observe a long tailed distribution, with a large proportion of answers changing around the one year mark.

Models
In this section, we describe our baseline models for context-dependent question identification and situated QA. Details on our learning settings and hyperparameters are in Appendix D.

Context Dependent Question Identification
As a lower bound, we provide a naive randomchoice baseline, where labels are randomly assigned to match the label distribution in the training data. We also train separate BERT-based classifiers, which encode the input question q using BERT and classify based on each question's encoded [CLS] token representation. We experiment with both BERT-base and BERT-large. We approximate human performance by sampling 50 examples for temporal context and geographical context each, and re-collecting annotations from a new set of crowdworkers.

Situated Question Answering
There are two types of models for open retrieval QA: retrieval based (open book) and closed book. Retrieval based models (Karpukhin et al., 2020;Guu et al., 2020) answer questions by retrieving relevant passages from a corpus before producing an answer based on the retrieved passages. Closed book approaches Lewis et al., 2020a;Févry et al., 2020) do not rely on access to a corpus during inference, instead use the knowledge that has been stored in the model's parameters. These models, which typically have a large number of parameters and are trained over a single snapshot of web data, generate answers from each question. We present two competitive baselines: one retrieval-based model, DPR (Karpukhin et al., 2020), and one closed-book model, BART-large Context (ct, cv) Augmented Question (TEMP, 2017) Who went number 1 in the WNBA draft as of 2017? (GEO, TOKYO) Where do we rank among the world's largest cities in Tokyo? Table 3: Examples of modified queries. Text shown in bold is concatenated onto the original question. To query for the answer for some context where c t = TEMP or c t = GEO, we append the phrase "as of c v " or "in c v ", replacing c v with the context value. (Lewis et al., 2020a). 4 Both baselines are trained on NQ-Open (Kwiatkowski et al., 2019;. We use the best performing model configuration from DPR paper, but swap the retrieval corpus with an up-to-date Wikipedia dump (timestamped 2021-Feb-20). In standard settings, DPR achieves an accuracy of 41.5 and BART achieves an accuracy of 24.1 on NQ-Open. We use these baselines in three settings which we introduce below: unmodified, query modified, and query modified with finetuning.
We also approximate human performance on situated QA using 100 randomly sampled examples from each context type which are annotated by the authors of this paper. While our estimated human performance appear somewhat low, they are in line with agreement rates in the original Natural Questions study (Kwiatkowski et al., 2019) and are tied to the challenges of evaluating open-retrieval QA. Discrepancies are often due to ambiguous questions or equivalent answer forms (e.g., July 5, 1945 vs. 5 July 1945, The Patriots vs. New England Patriots). The latter is especially problematic for geographically dependent questions, where answers involving time expression occur much more frequently, comprising over 50% of examples. See Appendix A for more details.

Unmodified Baseline
We evaluate openretrieval QA models on our task without any alterations. Each model is only given the question, ignoring the extra-linguistic context, and thus predicts the same answer no matter the context. For temporally-dependent questions, models are expected to output the most up-to-date answer. For geographically-dependent questions, models must assume some geographical context to produce an answer.
Query Modified One simple method for incorporating extra-linguistic contexts into question answering is to concatenate the context onto the question, separated by some special token. We adopt a slightly different approach that leverages what models have already learned about producing answers from specific contexts: concatenating a phrase that specifies the relevant context onto each question, transforming it into a context-independent question. The concatenation templates for each context type are described in Table 3. We find that these simple modifications mostly generate valid, fluent questions that closely resemble examples found in our models' training data. We estimate that 10% of questions in NQ-Open contain similar augmentations, inserting or concatenating a phrase that specifies the context (See Appendix B for details).

Query Modified w/ Finetuning
We also experiment with finetuning our query modified baselines. For our retrieval based approach (DPR), we finetune separate the reader and retriever models for each context type. We finetune retriever models using gold and hard-negative passages from the retrieval results of our Query Modified DPR baseline. Likewise, we also finetune separate closed book models for each context type. Table 4 reports our results on context dependent question identification. We find that pretrained language models perform competitively in this binary classification task, matching human agreements. Thus, it may be a useful tool for identifying context dependent questions which closed book systems are poorly suited for or identifying examples in benchmark datasets that require re-annotation.    Table 6: Breakdown of errors from context dependent QA. We report EM accuracy evaluated against answers from the correct context (One) and answers from the union of all our annotated contexts (Any). models on answers from multiple extra-linguistic contexts with modified questions improved performance across the board, especially for the TEMP context type.

Situated Question Answering
Adding TEMP context to static questions distracts the model, significantly decreasing performance especially for the closed book baseline. This suggests that models are not robust to semantically equivalent edits, as observed in Ribeiro et al. (2020). Models also perform better when provided with an answer's start date as context compared to a sampled date, which falls between the transitions between answers. This gap is especially pronounced for our retrieval based model.  Table 7: Results of our unmodified baseline run on GEO. We further break down performance common locations by separating the two most common ones, the "United States" and "India", the "Other" common locations.
Do systems performance vary based on the frequency of geographic context? Table 5 splits results on GEO questions into those from common and rare locations, many of which unseen during training. Both methods have greater difficulty with rare locations; however, the drop in performance for closed book models is significantly worse. Closed book models perform 10 percentage points worse on questions from rare locations while retrieval based models only perform 3 points worse. These findings show that retrieval based systems generalize better to new locations and are better equipped for answering geographically dependent questions. We also see that the gap in performance between common and rare locations shrinks for both models after finetuning, suggesting that explicitly modeling geographical contexts helps with generalization.  What is the assumed context for a geographically dependent question? When presented with a geographically dependent question without the context, open retrieval QA models must assume some geographical context to produce an answer. We hypothesize that models will exhibit bias toward assuming the few geographical co-texts are most frequently asked about. We test this by reporting the results of our unmodified baseline, further breaking down the results from Table 5 by individual location. Each time the unmodified baseline correctly answers a question for some location, it did so by assuming that location as its context. We report our results in In Table 7, which show that models are heavily biased toward assuming the question was asked by someone in the United States or India.
How often do models provide answers from the wrong context? In Table 6, we report error analysis for our situated QA baselines that shows how often models are provide the answer from the specified context versus the union of all annotated contexts. We see that models often fail to incorporate extra-linguistic contexts into the question, producing the answer from another context.

Analysis
Can QA systems differentiate and recall previous answers to a question? To answer this, we study a new setting where we query models for the current and previous answer to a given question. Following the same steps as above for the TEMP and GEO context types, we develop a suite of retrieval-based and closed-book baselines for this new setting. Our query-modified baselines in this setting consist of prepending the word "previously" to the beginning of each question to query for the  previous answer. In the existing framing of QA, all questions are looking for the current answer we do no augmentations to query for the current answer. For our finetuned baselines, we train separate reader models and closed book models to query for current and previous answers. We report our results for these experiments in Table 8. We find that models are far better at producing the current answer versus past answers to a question. Finetuning, however, greatly increases performance on queries for previous answers.
Can QA systems trained on outdated answers adapt to the present? We investigate whether QA systems that are pretrained on outdated corpora and that are trained on large-scale QA datasets containing outdated answers can adapt to answering questions situated in the present. Lewis et al. (2020b) shows that updating the corpera of retrieval based models can be effective method for updating simple facts, such as current heads of state. We test whether this extends to a broader range of temporally-dependent facts by evaluating our baselines introduced above on their ability to predict the current answer in our dataset, which are up-to-date as of Feb 2021. We split the current answers into two categories, stable where answer did not change since 2018, and updated where answer changed after 2018. We chose 2018 as a threshold as it is when Natural Questions (Kwiatkowski et al., 2019) was collected and roughly matches the timestamp of the most recent data our closed book model (Lewis et al., 2020a) was pretrained on (Feb 2019). Table 9 shows that, while retrieval based models were able to update some world knowledge after swapping the retrieval corpus, significant gains on updated questions only come after finetuning with newer data. This suggests that simply updating the corpora that models retrieve passages from is not sufficient to keep models up-to-date.

Related Work
Comparison to other QA datasets While a few prior studies mention that answers to a question can change over time , no prior work investigated the performance of existing models on temporally dependent queries and their prevalence. Addressing ambiguous questions  overlaps with our study, as ambiguity can arise from extra-linguistic contexts. Their study found 13% of examples were ambiguous due to temporal deixis. However, we found that contextdependence can co-occur with inherent ambiguities in the question which cannot be resolved with context. Thus, this figure underestimates the true proportion of temporally dependent questions. Furthermore, humans don't always consider contextdependent questions as ambiguous when there is only one answer given the context, which is normally assumed to be present time and location. Thus, ambiguities that arise due to lack of context should, therefore, be modeled separately from semantic ambiguities of the question.
Recent work has also studied generating temporally-dependent questions with timestamped answers from temporal knowledge-bases (Chen et al., 2021;Saxena et al., 2021). Similarly, Dhingra et al. (2021) generates temporally-dependent cloze-style prompts and present a temporally aware language model to address them. These works, however, synthetically generate examples and are therefore limited in scope and diversity. In contrast, we manually annotate temporally and geographically dependent questions.
Temporal understanding Temporal understanding in NLP has been typically studied in an intradocument setting, i.e., ordering events or finding temporal relationship between two events in the same document (Pustejovsky et al., 2003;Bethard et al., 2007;Cassidy et al., 2014;Llorens et al., 2015;. Dynamic evaluation Dynamic evaluation has been studied under the language modeling (Osborne et al., 2014;Yogatama et al., 2014), topic modeling (Wang et al., 2008) and entity linking (Rijhwani and Preotiuc-Pietro, 2020). Recent work (Lazaridou et al., 2021) studied temporal drift in the context of large scale pretrained language model, hinting reusing models from previous snapshot can cause performance decay, as we observe in our study as well.
Adversarial, phased data collection has been proposed (Paperno et al., 2016;Zellers et al., 2018;Potts et al., 2020) to drive model development, constantly feeding models examples that current models are unable to address. We suggest collecting questions dynamically for slightly different goal, to accurately reflect the changing world by identifying temporally dependent facts from such benchmarks and continuously updating them. Building a benchmark based on a fixed snapshot (Petroni et al., 2021) can be a viable alternative.
Researchers studied keeping knowledge sources up-to-date. Konovalov et al. (2017) looked at the constantly changing facts from the perspective of automatically extracting knowledgebase revisions. Schuster et al. (2021) presents contrastive sentences based on Wikipedia revisions, showing how entailment decision can change over time.

Conclusion & Future Work
We present the first study of how extra-linguistic contexts affect open retrieval QA. Our study reveals that current systems fail to adapt to shifts in the temporal or geographical context. We, therefore, propose tasks and create a dataset for training and evaluating QA systems on modeling how facts change across contexts. Our dataset will support ample future work for developing models which can gracefully update its predictions based on new temporal and geographical contexts. Future research may address incorporating in temporally and geographically dependent source documents, such as news articles, or considering other extralinguistic contexts such as who is asking the question, taking individual's preferences into account.  addition to allowing annotators to search over English Wikipedia, we also provide a list of articles that were used by workers in the prior stage. In open retrieval setting, the same question may have multiple interpretations, and therefore we ask annotators to mark answers as correct if they consistent with one plausible interpretation. Figure 4, 5, 6, 7, 8 shows our annotation interface.
Generating TEMP (q, c, a) examples We sample values of c v to match the granularity of the annotated dates (year-month-day or year) from our collected answer timelines. For each static question, we uniformly sample a single value of c v between the original dataset's creation (2018)     distribution we find in our training set annotations.
Inter Annotator Agreement Inter annotator agreement during identification phase can be found in Table 12. At the validation phase, 70% of temporal question-context-answer pairs were annotated as correct and 85% of geographical question-context-answer pairs were annotated as correct, similar to validation phase (76%) in Am-bigQA .

Agreement with Natural Questions
The answer span exact match between our collected answers and the original answers from NQ where the annotated start and end transition dates overlap with NQ's creation (2018) was around 40%, similar to the average agreement rate on NQ-Open test data is 49.2% in the original study (Kwiatkowski et al., 2019). Our manual analysis on randomly sampled 50 errors and find that only 12% of the errors can be attributed to annotation errors. 70% of other errors are due to ambiguities in the question resulting in multiple possible answers (e.g. "Three movies made from Agatha Christie's novels"), or the same answer being given in different forms (e.g. "Chamberlain, Wilt" vs. "Wilt Chamberlain"). The remaining errors are the result of inconsistencies in NQ annotations (18%).

B Data Analysis
Naturally occurring query modifications In open retrieval QA datasets, questions often ask about the answer in some specific temporal or geographical context. For instance, when people want to ask about the previous answer to a questions, people often add words like "previously" or "last" in their question to specify it. We find that such questions comprise about 4.1% of NQ-Open (Kwiatkowski et al., 2019;. We also find that 5.1% of questions specify a specific point in time, as determined by whether there is a year expression in the question. Finally, we estimate the number of questions that specify a geographical context by counting number of stripped questions that were specified as geographically dependent. These questions, which have a phrase specifying the geographical context, comprise about 4.4% of NQ-Open. Our modifications closely resemble these naturally occurring questions, but can sometimes produce ungrammatical sentences, usually due verb tense agreement.

C Retriever Analysis
Are errors in retrieval based systems due to poor retrieval? In Table 13, we report the retrieval performance from the settings in Section 5. We measure comparing against the results from Table 5 and Table 6, we see similar trends in endto-end performance reflected in our retriever performance.
We also explore whether the retriever model is able to adapt to updated corpora, reporting retrieval performance in Table 14. Here, we see that while retriever performance also suffers on queries with updated answers, suggesting that retriever systems are perhaps implicitly learning to situate questions within the time period of their large-scale training datasets.

D Implementation Details
Context Dependent Question Identification Baselines We use the pytorch-transformers (Wolf et al., 2020) library to implement our classification models. The training batch size is set to 8 and 64 for BERT-base and BERT-large, respectively. We train for 10 epochs using a learning rage of 5e-5 and 500 warmup steps and select the best preforming model measured on dev after each epoch.
Context Dependent Question Answering Baselines We finetune our closed book baselines for 10 epochs with a batch size of 256, using the AdamW optimizer with a learning rate of 1e-5. We select the best performing model evaluated after each training epoch. We keep all other hyperparameters the same from the original implementation.
For our retrieval based baselines, we finetune both the retriever and reader components. The retriever models are trained using in-batch negatives plus one hard-negative passage per question. We use hard-negative and gold passages from the query modified model's predictions before finetuning. We train the retriever model for 8 epochs, using a learning rate of 1e-5, 100 warmup steps, and a batch size of 16. We then finetune our reader models for 16 epochs and a batch size of 16. We select the best model evaluated after each training epoch for both reader and retriever models, and select the best topk retrieval results to use for in k ∈ {10, 20, 50} evaluated on the development set. All other hyperparameters from the DPR reader and retriever models are kept the same from the original work.