F OCUS QA: Open-Domain Question Answering with a Context in Focus

,


Introduction
As more and more information-seeking activities are moving to visual interfaces, the way of interacting with QA systems is changing.An example is given by screen-based virtual assistants (e.g., Google Assistant, Alexa and Siri), where the interaction with the user can be conditioned by the information on the screen.
To study this modeling, we introduce question answering with a context in focus, a task where an information-seeking user interacts with a QA system and, after reading some information shown on the screen, they ask a follow-on question.The information may represent the result of a previous exploratory interaction (e.g., tell me about Lady Gaga) or by some input provided by the system (e.g., news feeds and daily contents).We refer to this information as the context and, for the scope of this work, we define it as the title and the first paragraph of a web page shown on the screen.
Previous works approached this problem as a Machine Reading (MR) task, limiting the application to questions focused on a document (a questioner creates them reading that document) containing the answer (Rajpurkar et al., 2016;Trischler et al., 2017).In contrast, our scenario allows for a free interaction with the system, where users can decide whether to use the context or not.They can ask questions that are (i) grounded in the context, e.g., related to entities in the context, (ii) based on additional knowledge, i.e., the user has a prior knowledge about the topic and uses it to formulate the question, or (iii) self-standing, i.e., related or unrelated to the context, and containing all the information needed to identify the correct answer.All these questions can be answered with in-page text, i.e., from the document, and/or off-page text, i.e., from other documents.Figure 1 shows an example of this free information-seeking interaction.
The task poses two key challenges: (i) understanding when and if the question is related to the context, and (ii) ensuring that the answers are contextually-relevant when required.In this scenario, ambiguity arises frequently and the answers to the same question may change depending on the context in which they are asked.Even though MR is a common approach to answer extraction for Open-Domain Question Answering (ODQA), we model this problem as an Answer Sentence Selection (AS2) task, where the answer is selected by ranking over all possible candidates (Garg et al., 2020).We believe that AS2 is more relevant to a production scenario since humans converse in compact and complete sentences.
Current QA datasets have limitations that do not allow us to study question answering with a context in focus.They either (i) do not rely on the context at all (Kwiatkowski et al., 2019;Clark et al., 2019;Nguyen et al., 2016), (ii) prompt users to write probing questions based on a given text (Rajpurkar et al., 2016;Trischler et al., 2017), or (iii) assume a conversational user engagement at each turn (Adlakha et al., 2022;Anantha et al., 2021;Choi et al., 2018).To overcome such limitations, we construct FOCUSQA. 1 Our dataset contains 12,165 unique question, context pairs and a total of 109,940 annotated answers.Instead of asking annotators to write questions, we developed a new methodology that takes existing ones and pairs them automatically with multiple contexts.We sampled questions from NATURALQUESTIONS (Kwiatkowski et al., 2019) (NQ), QReCC (Anantha et al., 2021), and Answer-Sentence Natural Questions (ASNQ) (Garg et al., 2020), a dataset for answer sentence selection derived from NQ.
To show the benefits of our dataset construction methodology, we compare its questions against a sample of 100 question, context pairs, where the questions are written by humans after reading the context.The analysis shows that this common way of crowdsourcing questions leads to examples where all the questions are only related to entities in the context.In contrast, with our approach, we obtain a mix of questions that are grounded in the context and questions that contain additional knowledge with respect to the context.
To assess the initial performance on our task/data, we experimented with models for AS2.We show that the task sets several challenges: (i) when models do not use contextual information, state-of-the-art systems only achieve precision of 29.37%.(ii) When we introduce a set of strong baselines that incorporate the context and additional context from the answer page into Transformer models (Vaswani et al., 2017), the combination of these contexts achieves a precision of 50.7%.However, the task requires further research exploration to get closer to humans' precision of 81.4% on the test set.
In brief, our contribution is threefold: • We study a realistic scenario where users can freely interact with a QA system that provides contextual information to facilitate information seeking.
• We construct FOCUSQA, a dataset for AS2 with 12,165 unique question, context pairs and 109,940 annotated answers.To build the dataset, we developed a novel methodology to elicit more realistic questions.
• We introduce strong baselines that incorporate context from different sources (e.g., screen content and answer page) into Transformer models and show that contextual sentence selection models outperform the precision of state-of-the-art models for AS2 up to 21.36% absolute points.

Related Work
Contextual Question Answering (CQA).Contextual Question Answering (CQA) leverages additional context instead of treating the questions as self-contained inputs.The context consists of many factors such as cognitive and social ones that are related to a user's intentions, tasks, and needs (Allen, 1997).Taking the context into account is crucial to better interpret the questions (Min et al., 2020).Earlier works and datasets focused on leveraging external context like the time, location, and user profiles (Krulwich and Burkey, 1997;Limbu et al., 2009;Zhang and Choi, 2021).
Recently, the advance of various smart devices allows users to interact with the system in a much richer way.This has influenced the creation of CQA datasets leveraging a wider forms of context such as newspapers (Trischler et al., 2017), web pages (Chen et al., 2021), visual information (Zhu et al., 2016;Biten et al., 2019), and conversational history (Choi et al., 2018;Qu et al., 2020;Anantha et al., 2021). Trischler et al. (2017) is the closest to our research in that we use the same form of context.However, they collect questions by showing context to annotators explicitly, which leads to shallow, superficial questions far from real scenarios.
Contextual Sentence Selection (CS2).The AS2 task was originally defined in the TREC competition (Wang et al., 2007) and has the advantage of high efficiency, which enables its use in realworld applications (Garg et al., 2020).Previous AS2 models treated each sentence as an independent unit and applied neural network models to select the sentence with the highest score (He and Lin, 2016;Yang et al., 2016;Garg et al., 2020).However, such approaches are sub-optimal as sentences extracted from web documents are typically not self-contained and they can have issues related to pronouns, anaphora, or other linguistic limitations (Tan et al., 2017).Contextual Sentence Selection (CS2) addresses these limitations by encoding additional information, i.e., context, into the scoring function.Recently, Lauriola and Moschitti (2021) showed that local context, defined as the two sentences surrounding the candidate answer, highly improves non-contextual state-ofthe-art models.Han et al. (2021) proposed to use non-consecutive multi-sentences coming from the document containing the answer.Differently from these approaches, this paper defines, analyzes, and evaluates cross-document context by combining the context in focus and retrieved documents simultaneously.

Defining Context in Focus
In this work, we consider the screen content as a context.We say that a question is contextdependent if its answer a i can change depending on the context c i .Typically, such questions contain pronouns, entities, and concepts that can be disambiguated only if the context is provided.In contrary, we say that a question is context-independent if it provides all the information needed to find an unambiguous answer.In this work, the context is defined as the title and the first-paragraph of a web page.See Table 1 for examples in which the same questions can have different answers depending on the context.

FOCUSQA task: QA with a Context in Focus
The task simulates the real use-case scenario where there is a user and a context in focus.The user is free to interact with the system and after reading the context, they can ask a question that is either context-dependent or not.The answer can be extracted from the document in focus or from other pages of the web.We call in-page candidates those extracted from the document containing the context in focus and off-page candidates those extracted from other pages.We cast this problem as an AS2 task and we extend the formulation to include the context in focus.Thus, given a question q and a set of answer candidates S = {s 1 , ..., s n }, the task of QA with a context in focus is to select a sentence s i that correctly answers q for the provided context c.Formally, let Q be the set of questions, C the set of contexts paired with the questions, and S the set sentences.The task can be defined as a ranking problem.Given a pair (q, c) ∈ Q × C and a set of possible answers S (q,c) ⊆ S for q given c, the sentence selector returns the answer a ∈ S (q,c) for which a = arg max s∈S (q,c) r(q, s, c) where r is the scoring function r : Q × S × C → R, which can be estimated using Transformer models, as explained in Section 6.

Data Collection
Our approach to build the FOCUSQA dataset is divided into two stages: (i) question/context pairing, and (ii) answer collection.To collect candidate answers we implemented an Open-Domain Question Answering (ODQA) system for AS2.For this work we used a BM25/lexical-based sentence retriever and a Tranformer-based AS2 reader, as this combination already implements a reliable QA system.We retrieve documents using an open-domain index containing ∼100M web pages from the open repository of Common Crawl.2See Appendix A for more implementation details.All the annotation tasks are performed by a team of professional annotators with a project lead3 .Additional details on the guidelines can be found in Appendix A. The Scheduled Tribes In India.The Scheduled tribes are groups of people that are officially recognized in the Indian Constitution.They are also referred to as Dalit, which translates as broken or scattered.Ambedkar, who lived from 1891 to 1956, was an Indian economist and is considered the father of the modern Indian Constitution.
Constitution of United States of America 1789.While the primary authorship of a vast array of documentations and publication may be cited with ease, the process of identifying the Father of the Constitution may prove to be a far more difficult ... James Madison -alongside fellow Federalist Alexander Hamilton -is considered to be one of the individuals credited with being the Father of the Constitution.We construct question, context pairs by sourcing questions from NATURALQUESTIONS (Kwiatkowski et al., 2019), QReCC (Anantha et al., 2021), and Answer-Sentence Natural Questions (ASNQ) (Garg et al., 2020).
Instead of requesting annotators to formulate questions about the context, we propose to link the question to a valid context by working backwards from candidate answers.For each question, we use the end-to-end ODQA system to collect a ranked list of candidate sentences.Then, given a question q and an answer sentence from the top-k scored candidates s i , with i = 1, ..., k, we pair q with the title and the first-paragraph of the web page that contains s i .We pair each question with up to 3 different contexts using the top-3 answers extracted from different documents.Because this is an automatic procedure, we ask annotators to validate each question, context pair with three annotations: • context-dependent: the answer to the question can change depending on the context.
• context-connected: the question refers to concepts, entities, or events related to the context.Most likely, the answer to the question can be found in-page or in documents with similar contents.
• context-answered: the question is partially or fully answered by the context.
We consider valid a question, context pair if it is context-dependent or context-connected, but not context-answered.We stop the annotation if any of these conditions are not satisfied.
In addition, we collected self-standing questions from the test set of ASNQ.Given the question and the context from its answer page, we compute the similarity of such context with a set of contexts retrieved from the index.Then, we pair the question with the least and the most similar contexts.This way, the resulting pairs contain questions that can be answered without additional context, posing the challenge for the model to recognize when the context is needed or not.

Answers Collection.
After identifying valid question, context pairs, we collect multiple triplets question, context, answer and annotate them via crowdsourcing.
Annotators are asked to judge if the candidate sentence (i) is about the same context of the question, (ii) answers the question, and (iii) is factually correct.We consider an answer correct if all conditions are true.We collect sentences from two sources: • in-page, where we select the top-k candidates from the document in focus.
• off-page, where given the question and the context, i.e., title and first-paragraph, we use the ODQA system to collect candidate answers.We perform a basic contextual retrieval by querying the index using the concatenation of the question and the title.
For the set of self-standing questions, we collected candidates in the same way.In addition, we used candidates from the answer page.Instead of re-annotating the pairs, we rely on the annotation from ASNQ and we labelled as negative all the other pairs.Considering that these questions are not context-dependent, we can assume that retrieved candidates are most likely coming from unrelated documents.More details can be found in Appendix A.3.  5 Data Analysis

Dataset Statistics
The final dataset contains 6,036 questions, which we paired with 12,453 context in focus.We obtained 12,165 unique question, context pairs, for which we collected a total of 109,940 answers.Among these, 89,148 are new human annotations, 6,666 are from ASNQ, and 14,126 are automatically labelled as negative.We split the data into train/dev/test by randomly sampling unique questions from the set, excluding all the self-standing questions.We used them only at test time to evaluate the model's ability to distinguish when the context is not necessary.We refer to this subset as self-standing, i.e., questions that do not require the context to be answers, in contrast to contextual.
Table 2 shows the statistics of our collected dataset.The dataset contains 81.4% of answerable question, context pairs.Among these, in Table 3 we can observe that both in-page and off-page are valuable sources to find the correct answers.This is reasonable as questions are asked in an informationseeking scenario and users can ask follow-on questions based on the context.Finally, it is worth mentioning that our approach for automatically creating question, context pairs allows for saving annotation cost up to 31%.More details can be found in Appendix A.6.

Questions Based on Additional Knowledge
By design, our dataset, introduces an interesting feature that elicits more realistic questions, namely, some of the questions imply topic knowledge from the questioner.We refer to them as questions based on additional knowledge.When collecting questions manually, it is very complicated to obtain these questions since annotators might not have specific expertise about the topic.To demonstrate this, we randomly sampled 100 contexts from our dataset.For each of them, we asked annotators to formulate natural and well-formed questions that are not answered with the context.Consider the example in Table 4 where (P) is the automatically paired question and (M) is the one asked looking at the context.(P) is a question that can be only asked if one knows when "Brewster joined Premier League", because that is not specified in the context.Conversely, (M) are firmly grounded in the context and do not reveal any background knowledge, besides general world knowledge ( e.g., England has a national team).
To measure the topic expertise of paired and manual questions, we compute the percentage of entities in the questions that are not mentioned in the context.As shown in the example in Table 4, mentioning a new entity is a reliable proxy of domain specific knowledge.We observe that only 6% of the entities in (M) questions are not mentioned in the context, while for (P) it goes up to 59%.Additionally, we compute the semantic similarity between questions and context vectors, which we obtained using Sentence-BERT, a state-of-theart model for sentence embedding (Reimers and Gurevych, 2019).The average similarity between (M)/(P) questions and their context is 0.51/0.37.This result confirms that our dataset includes questions whose questioner owns more knowledge on the topic than what can found in the context.

Differences with Other QA Datasets
In this section we discuss the difference between FOCUSQA and Machine Reading Comprehension (MRC) datasets, where data are also in the form of question, context pairs.First, FOCUSQA is intended to simulate users interactions with a screenbased QA system, which is a real-world application not modeled by current MRC datasets.The latter are created by providing a text to the annotators and asking them to generate questions.As a consequence, and as discussed above, the generated questions are strictly grounded in the text, i.e., related to text entities, and so these questions have a high lexical overlap and similarity with the text (context).To obtain more realistic data, FOCUSQA takes a novel approach where questions are linked to a valid text (i.e., the context) that contains the correct answer to the questions.The approach is general and it can be applied to any collection of questions, avoiding the problem of having only questions specific to the context.Second, FOCUSQA contains correct answers that are extracted from multiple documents (Table 5), whereas in MRC datasets (e.g., NewsQA (Trischler et al., 2017), SQuAD (Rajpurkar et al., 2016) and NATURALQUESTIONS (Kwiatkowski et al., 2019)), answers are extracted only from the documents used to generate the question.Finally, in MRC-style datasets (e.g., SQuAD and NewsQA), the referring text is closely related to the question and it might even contain the answer (e.g., SQuAD) or it is needed to answer the question (e.g., NewsQA).By contrast, FOCUSQA models the situation where a user reads a context and asks questions that may or may not be related to the context (topic switching).This is essential to model user behavior when interacting with screenbased QA systems.To be successful in FOCUSQA, QA systems must learn when and how to exploit context in focus.

Contextual Models
Contextual Sentence Selection (CS22) models are Transformer models with multiple token-type (sentence) embeddings (Lauriola and Moschitti, 2021) and as in input a single textual sequence of question, answer, and context.They are very suitable to be used for tasks where the context is required to find the answer.To adapt such models to the context available in FOCUSQA, we introduce two new input sequences, namely, READOUT and Cross-Context, which embed question/answer pairs along with text from the answer page and the textual context from the screen, respectively.Question Rewriting.In order to include contextual information in AS2 models, we performed Question Rewriting (QR) to reformulate the questions to a more self-contained form.The questions are obtained with a generative model for question rewriting (Anantha et al., 2021) using the concatenation of the question with the context.
Transfer and Adapt with Context.We trained our models adopting a Transfer and Adapt strategy (Garg et al., 2020): first (Transfer), we fine-tune a pre-trained ELECTRA-base 4 model on the large ASNQ dataset (20M q/a pairs).Then (Adapt), we further fine-tune on FOCUSQA.More details can be found in B.

Results
Results of the proposed models and baselines are reported in Table 6.We use standard evaluation 4 available from HuggingFace.
metrics for AS2: Precision@1 (P@1), Mean Reciprocal Recall (MRR) and hit rate at k (HIT@3).We can observe that models that do not mix question context and answer context perform the worst.This result is expected as the answer can be contextdependent.We note that: (i) READOUT A and READOUT Q have the lowest performance with a P@1 of 25.77% and 26.58%, respectively.This is expected as they heavily rely only on one of the two contexts; (ii) when no context is used at all, performances are better (29.37%) and they go up to 33.56% with Local, thanks to the information coming from the text surrounding the answer; (iii) the highest performances are achieved by Cross-Context Titles (50.73%) and Cross-Context QA (50.03%) models, which exploit information from both question and answer contexts.
In the last two rows of Table 6, we used the rewritten questions instead of the original ones.This enables the standard AS2 model to use the context and to combine question rewriting and CS2 models to evaluate if having self-contained questions helps.We observe that using rewritten questions improves by 4.13% when no context is provided.In contrast, in line with results in Del Tredici et al. ( 2021), we found that question rewriting is not useful if the context is already taken into account by other means.
To investigate more on the behavior of contextual models, in Table 7, we split the dataset considering questions answerable with candidates from inpage and off-page only, respectively.No-Context and models with only the answer context have low performances when the answer is in-page.This is because in-page candidates are extracted from the document in focus.In contrast, off-page candidates are retrieved by a lexical-based search engine, and they tend to have more overlap with the question.Thus, for the model, it is more challenging to contextualize in-page answers and rank them to the top when they are correct.We can also observe that the performance of Cross-Context models are very optimized to answer questions with in-page candidates.However, they struggle when context is different and, possibly, unrelated.This illustrates the challenges set by the task: (i) understanding when a question is related to the context and (ii) selecting answers that are contextually-relevant.Figure 2: Models performances on self-standing depending on the similarity between question and context.

Error Analysis
In this section we further analyse models behavior depending on the type of question and context.
Do models performance vary when context is not needed?Results from Table 7 highlight that models have different performances depending on the source of the answer.However, it is unclear how the performances vary with self-standing questions that do not require the context to be answered.
To further investigate this, we focus on the self-standing questions in our test set.Given a question, we measure the similarity with its context, using the sentence embedding obtained with Sentence-BERT (Reimers and Gurevych, 2019).Then, based on the similarity score, we split the 1,000 question, context pairs into three groups based on the percentile: low, which contains the lowest 10%; high, which contains the highest 10%, and medium with the remaining pairs.In Figure 2, not surprisingly, we can observe that models that do not rely on the context, i.e., No-Context and Local, are more robust to these questions.Instead, all the other models perform considerably worst.In particular, READOUT Q has the lowest performance as it heavily relies on the question context.Finally, we observe that all the models have a lower precision when the question is less related to the context.These findings suggest that CS2 models have a subpar performance on questions that do not require the context, i.e., self-standing questions, and a different modeling is required to make the best usage of the context.
How do models perform on questions based on additional knowledge?While having questions based on additional knowledge is more realistic, it is unclear what new challenges they introduce in the task.As described in Section 5.2, measuring the percentage of new entities in the questions compared to the context is a reliable proxy to spot questions based on additional knowledge.We considered all the questions in our test set that have more than one new entity, obtaining a set of 211 question, context pairs.Then, from the remaining set, we randomly sampled 250 grounded questions. Figure 3 shows the models performance on these two sets of questions.We note that: (i) questions based on additional knowledge are more challenging for the retrieval, with 54% of them that are an-swerable, versus the 81% of answerable grounded questions; (ii) models have more difficulty in finding the correct answers, with a gap in P@1, respect to grounded questions, that goes from 8% with Cross-Context Titles to 16% when no context is used.Questions based on additional knowledge might be asked during information-seeking interactions, with results showing that for state-of-the-art models such questions still represent a challenge.This is a relevant finding from a practical point of view, since it provides a valuable indication to design QA datasets that better capture realistic interactions.
How does Question Rewriting perform when using the context?In Del Tredici et al. ( 2022), authors show that QR can fail when a large amount of information is required.From a manual inspection, we found this problem to be present also in QA with a context in focus.Table 8 shows some of these examples.We can note that, not only QR fails by providing very long and convoluted rewrites (row 3), but it also uses the context when not useful.For example, in row 2 and 3 the question is self-contained, i.e., 17-Mile Drive is a scenic road in California, US, and adding emery county, utah invalidates the question.While QR helps AS2 models, our results seems to indicate that understanding when to rely on the context is remains a crucial challenge in QA with a context in focus.

Conclusion
We introduced question answering with context in focus, a task where there is a context in focus and users can ask questions that can be answered from in-page sentences, i.e., from the document in focus, or from off-page sentences, i.e., from other documents.To study the task, we constructed FO-CUSQA, a dataset with 12,165 unique question, context pairs and a total of 109,940 answers.In order to elicit more realistic questions, i.e., annotators ask a question only based on that text information, we proposed a new methodology that can take any existing question and automatically pair it with a context in focus.We also introduced new input sequences for CS2 models.Our experiments show their effectiveness in learning from our data, as they greatly outperform state-of-the-art models for AS2 that do not make use of context.FOCUSQA highlights challenges of modeling realistic information-seeking scenarios and invites further research into this area.For example the retrieval struggles to extract contextually-relevant passages from an open-domain index.Future research can be devoted to study new approaches to model the context in focus in the retrieval stage.While we studied how to inject cross-contexts into QA models, learning when to use or not the context is another future research direction to explore.Furthermore, context is limited to text (i.e., title and first paragraph) and future research may include extending the task to a multi-modal or multi-turn scenario.

Limitations
We acknowledge this work to be limited in three aspects.First, some of the candidates from ASNQ have been automatically labelled as negative.While we did apply several measures to mitigate this problem, there exists the possibility of having some false-negative among the candidates.Second, while the retrieval is not the core of this work, we did implement it as a part of the ODQA system that collects candidate sentences.However, our basic implementation of the retrieval struggles to extract contextually-relevant passages.As a result, in off-page candidates there is a high percentage of negative samples.We opted for a sparse retrieval, i.e., BM25, because it is a common approach, robust on noisy web data, that can be used to more efficiently index a large set of domain (e.g., Common Crawl).Nonetheless, using dense models (Shen et al., 2022) is an important future work that we plan to explore to improve the efficiency of our methodology to build contextual datasets.Finally, even though our index greatly supports the scope of this work, a more variety of websites (e.g., news websites) can help collect less entity-centric contexts.

Ethics statement
This work relies on the publicly available datasets for Open-Domain Question Answering.The dataset proposed in this work can be helpful in advancing the research in QA and Conversational QA.Even with our best efforts to ensure the quality of the content, answers extracted from webpage may have a biased view, for example political opinions.The work does not propose models that can generate harmful or toxic content.Our models are fine-tuned based on checkpoints downloaded from HuggingFace (Wolf et al., 2020).

A Data Collection Details
A.1 ODQA System for AS2 We implemented a standard ODQA system for AS2 that computes answers in three phases: (i) text retrieval, which returns relevant documents for a question from a large text collection; (ii) text ranking, which reranks and decomposes text into answer candidates, e.g., sentences; and (iii) answer sentence selection (QA), which selects the final answer for a question from the list of candidates.On the same line of similar end-to-end systems for QA (Yang et al., 2019), we implemented text retrieval with a BM25 ranking function.To split the text into sentences we used an off-the-shelf sentence splitter (Manning et al., 2014).Sentences are ranked using a state-of-theart Transformer-based model for AS2 (Garg et al., 2020).Text retrieval is done on a standard index using Lucene/Elasticsearch.As a framework for text ranking we used HuggingFace (Wolf et al., 2020).The tokenizer is set to truncate to a maximum length of 128 tokens, removing a token from the longest sequence in the input.

A.2 Document Collection
The index is built using a large collection of Web data, i.e., documents.This resource allows us to measure the impact of our work in an industry-scale ODQA setting.We selected English Web documents of the 5,000 most popular domains, including Wikipedia, from releases of Common Crawl in 2019 and 2020.This process produced a collection of ∼100M of documents.Each document in the index contains a url, title of the page, and content of the document, after removing all the HTML tags.

A.3 Pre-processing and Filtering
When collecting answer candidates, to optimize the annotation cost, we rely on the model score and we send for annotation only sentences with score > 0.25.We discarded questions where: (i) we have less than 3 contexts, i.e., we do not have at least three answers with score ≤ 0.25 to use for contex pairing; (ii) we have less than 3 candidate answers to annotate from in-page; and (iii) we have less than 3 candidate answers to annotate from cross-page.To obtain the first paragraph of the page, we split based on a double newline character.Then, we remove the title (if present) from the text and we truncate the text after 40 words.We discarded a context if it has less than 10 words.We always retrieved 100 documents from the index.To collect in-page candidates we used k = 3.For cross-page candidates, we use k = 10 for training examples, and k = 15 for testing examples.To sample question/context pairs from ASNQ, we filter out pairs with less than 10 candidates and more than 25.In retrieval, we limited the index to only Wikipedia pages.In this way, we reduce the possibility to retrieve pages that are similar to the answer page.Because those candidates for ASNQ are automatically labelled as negative, there is still the possibility to have false-negative.

A.4 Dataset Composition
The dataset is built by aggregating publicly available data from existing repositories and QA datasets.Questions are sourced from NATU-RALQUESTIONS (Kwiatkowski et al., 2019) and QReCC (Anantha et al., 2021).Answers are extracted from web pages contained in the Common Crawl index.To create the subset of contextindependent question, context pairs, we sample used data from the ASNQ dataset (Garg et al., 2020).For each entry, we will release the annotation labels, the question and answer document id, and an identifier for the questions to allow the match with the source dataset.We used the MD5 hashing function from the Python package hashlib.

A.5 Annotation Guidelines and Tooling
All the annotation tasks conducted in this paper are performed by a team of professional annotators with a project lead.Annotation is performed using a custom annotation interface based on the annotation guidelines (Figure 4).
Guidelines.We provided annotators with guidelines to follow during the process.The guidelines describe the two annotation steps for question, contex validation and answer collection.For each step and substeps, we provided extensive examples, covering edge-cases as much as possible.When evaluating if a question is context-connected, annotators also check if question, context presents grammar inconsistencies (e.g., pronouns in the question that are inconsistent with the person of the entities in the context) and if the question is about possible facts.The guidelines were designed with an iterative approach with three pilots.For the pairs in the pilot, two authors of the paper conducted a separate annotation that was used to assess the quality of the pilot.We manually reviewed the annotations and we discussed with the annotation project lead all the problems found in the process.We improved the guidelines according to their feedbacks and we ran the annotation at scale once we obtained an agreement greater than 90% with our manual annotations.
Tooling.The high presence of ambiguity makes the task hard even for humans; it is difficult to know if a candidate sentence can answer the question without considering additional contextual information.For this reason, annotators are provided with the context and additional text from the document containing the answer, i.e., title and the paragraph containing the answer (Figure 4).Annotators are trained on the specific task prior to start the annotation task.The process is divided into two conceptual stages: (i) question/context pairing, and (ii) answer collection.First, annotators have to validate a question, context pairs.Then, only if the pair is valid, they are asked to annotate the candidate answers.At each moment in the annotation process, annotators are allowed to use a commercial search engine to clarify the content of questions, context, and answer.

Quality Control
To ensure high quality of annotations at scale, the process was constantly monitored by the annotator project lead.In addition, to support annotators during the process, the UI allowed them to write comments in case a clarification was needed.Each comment was reviewed by the project lead, which reported to us any problem found during the process.We requested a reannotation of the pairs in case of errors.

A.6 Annotation Efficiency
We investigate on the efficiency of this approach in terms of annotation cost.After sampling questions from existing datasets, we pair them with 5235 5236 context that are retrieved using the procedure described in Section 4.1.This procedure can lead to invalid pairs, which are discarded with a manual annotation.From the human annotations, we observed that: (i) 13% of the questions are discarded because context-independent; (ii) and 18% of the question, context pairs are discarded because context-connected or context-answered.However, based on the quote provided by our annotation provider to manually formulate questions, we found that the overall annotation cost is 31% lower when using our approach.This is because the manual question sourcing phase is more expensive than validating question, context pairs.

B Model Implementation Details
Question Rewriting.Question rewriting is a popular approach in Conversational QA to directly encode relevant information into the question without explicitly use the contextual models.To obtain the rewrites, we implemented the generative model proposed in (Anantha et al., 2021) that rewrites using the concatenation of the question and the context.We use a T5-base model available in Hug-gingFace (Wolf et al., 2020) and we trained it on the QReCC dataset with 5 epochs, a batch-size of 4, 500 warm-up steps, and a learning rate of 5e − 5.
AS2 and CS2 Model Training.Since ASNQ consists of open domain questions associated to a single Wikipedia page, only answer-level contexts can be extracted (document title, document content, or local context).In the case of No-Context, Local, and READOUT A models, the same context can be tuned on ASNQ first and FOCUSQA subsequently.The other contexts cannot be directly trained on ASNQ as the dataset does not contain that type of contextual information.In order to alleviate this issue we adopted models with conceptually similar contexts when adapting from ASNQ to the target domain.We adapted READOUT Q from READOUT A , Cross-Context Titles from Local, and Cross-Context QA from a standard non contextualized model.We framed these contexts into the CS2 framework 5 proposed in Lauriola and Moschitti (2021).Similarly to existing CS2 solutions, these models are implemented through a multi-sentence Transformer that uses multiple token-type embeddings for each encoded text.
During the two fine-tuning stages, models were 5 https://github.com/alexa/wqa-contextual-qatrained with (i) binary cross-entropy loss, (ii) Adam optimizer, (iii) batch-size of 768, (iv) triangular learning-rate scheduler whose peak comes after 0.15 epoch, (v) 15 maximum epochs.The development set was used to early stop the training when observing a decrease of P@1 after two consecutive epochs and to select the learning rate, with values {1, 5} • 10 {−6,−5} .We set a max length of 256 for No-Context and Local, 320 for READOUT, and 512 for Cross-Context.
Starting from checkpoints trained on ASNQ, we repeated the fine-tuning (and model selection) step on FOCUSQA 3 times with 3 different random seeds.Eventually, we collected average and standard deviation of P@1 and other metrics.The training on ASNQ was done once per context type due to the dimension of the dataset and the associated training cost.

Figure 1 :
Figure 1: Example of a free information-seeking interaction with a QA system.

Figure 3 :
Figure 3: Models performances on a set of grounded questions and questions based on additional knowledge.

Figure 4 :
Figure 4: Screenshot of the annotation tooling for (a) Question/Context validation and (b) answer collection.

Table 1 :
Example of a question with answers that should be chosen depending on the context.

Table 3 :
Sources of answerable question, context pairs.The percentage is computed respect to the total number of answerable pairs.

Table 4 :
Questions with expertise: P are our questions; in underline the new entities; M are questions by human annotators.
Tony Berg.Anthony Rains "Tony"Berg (born October 21, 1954 in Connecticut)is an American musician, record producer, and A&R representative, in which role he has been described as an "industry guru".Were there many casualties?What Countries Fought in World War II? | Reference.com.The countries that fought in World War II were Germany, Italy and Japan, which comprised the Axis Powers, and Britain, France, Australia, Canada, New Zealand, India, the Soviet Union, China and the United States of America, which comprised the Allies.When will the capsule be opened?List of time capsules.This is a list of time capsules.The register of The International Time Capsule Society estimates there are between 10,000 and 15,000 time capsules worldwide.An active list of Time Capsules is maintained by the Not Forgotten Digital Preservation Library.How was Santana experimenting?Fredo Santana.Derrick Coleman(July 4, 1990 -January 19, 2018), known professionally as Fredo Santana, was an American rapper from Chicago, Illinois.

Table 5 :
Example of answerable question, context with answers found in different sources.
Note that Cross-Context models exploit extra information coming from both answer and question documents (cross-documents), whereas READOUT uses a single document depending on the specialization.
• READOUT A , where the model encodes the document from which the candidate answer sentence is extracted and ignore the context of the question.Cross-Context.Documents can be really long and their ingestion into a Transformer model can dramatically increase the latency in both training and inference.To overcome these limitations, we propose to model both question and answer context, i.e., title and first paragraph, from the screen and from the answer page, using more compact text.• Cross-Context Titles , where titles from the context and the answer page are used.The information is encoded into a Transformer model as [CLS] question [SEP] answer [SEP] context title [SEP] answer document title [EOS].

Table 8 :
Examples of errors when using the context to rewrite questions.
Chen Qu, Liu Yang, Cen Chen, Minghui Qiu, W Bruce Croft, and Mohit Iyyer.2020.Open-retrieval conversational question answering.In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 539-548.