RobustQA: Benchmarking the Robustness of Domain Adaptation for Open-Domain Question Answering

Open-domain question answering (ODQA) is a crucial task in natural language processing. A typical ODQA system relies on a retriever module to select relevant contexts from a large corpus for a downstream reading comprehension model. Existing ODQA datasets consist mainly of Wikipedia corpus, and are insufficient to study models’ generalizability across diverse domains, as models are trained and evaluated on the same genre of data. We propose RobustQA 1 , a novel benchmark consisting of datasets from 8 different domains, which facilitates the evaluation of ODQA’s domain robustness. To build RobustQA , we annotate QA pairs in retrieval datasets with rigorous quality control. We further examine improving QA performances by incorporating unsupervised learning methods with target-domain corpus and adopting large generative language models. These methods can effectively improve model performances on RobustQA . However, experimental results demonstrate a significant gap from in-domain training, suggesting that Ro-bustQA is a challenging benchmark to evaluate ODQA domain robustness.


Introduction
Open-domain question answering (ODQA) is a crucial and practical NLP task. Unlike traditional reading comprehension task where contexts are provided for a QA pair, in ODQA, a retriever needs to first extract relevant passages from a large amount of documents; then a QA model provides answers based on these passages. Due to the magnitude of the corpus, it is computationally prohibitive for a QA model to read through all documents. Therefore, ODQA becomes a popular research topic and is widely adopted in real-world applications. 1 Datasets and their processing code can be found here: https://github.com/rujunhan/RobustQA-data FiQA-Finance Question: Why do investors buy stock that had appreciated? Document: Imagine how foolish the people that bought Apple at $100 must have felt. It was up tenfold for the $10 it traded at just years prior, how could it go any higher? Stocks have no memory. A stock's earnings may grow and justify the new higher price people are willing to pay. When FB came public, I remarked how I'd analyze the price and felt it was overvalued until its earnings came up. Just because it's gone down ever since, doesn't make it a buy, yet.

LoTTE-Lifestyle
Question: What techniques, tricks or otherwise have you used to get upgrades on flights? Document: ... but I think the best way to get upgraded is to fly a lot with the airline. Generally when the flight's overbooked in one class, and they're trying to pick which person to upgrade, frequent flyer status is the first metric they use. The higher your status, the higher up the list you go! Having a high status with a partner airline can work too, high tiers with a partner airline usually comes below the airline's own frequent flyers, but above everyone else. Otherwise, if you're flying on your own that'll help in the event that there aren't enough frequent flyer to upgrade to free the required number of seats! Offering to pay may be an option too -if they're pretty full they may offer you a low price to upgrade. Table 1: Examples in RobustQA. Highlighted text spans are the precise answers that our annotators need to identify. Different from NQ, our texts are diverse with more challenging question-answer pairs. The practicality of ODQA necessitates the evaluation of systems' out-of-domain (OOD) performances because a real-world system needs to be robust when confronting domain drift. Moreover, existing state-of-the-art (SOTA) ODQA systems (Karpukhin et al., 2020;Santhanam et al., 2022) are based on neural networks, which are known to overfit training data and suffer from degradation when domain changes. For example, Natural Questions (Kwiatkowski et al., 2019, NQ) is the most commonly used ODQA dataset, but a recent study shows that there is a significant amount of overlaps between its train and test sets. This partially explains why neural models trained only on NQ can struggle in unseen domains (Lewis et al., 2021).
However, evaluating OOD performances for ODQA is currently not feasible in the research community due to the lack of a public multi-domain benchmark. Existing popular ODQA datasets such as NQ and TriviaQA (Joshi et al., 2017) rely on Wikipedia or Web documents. Multi-domain evaluation datasets exist separately for each compo-2 Open-Domain QA We briefly review ODQA in this section. Following the extractive QA set-up in the DPR paper (Karpukhin et al., 2020), we denote a collection of documents as D. We split each document d i ∈ D with a fixed length N tokens and obtain a collection of M (≥ |D|) passages denoted as C = {p 1 , p 2 , ...p m , ...p M }. Denoting a token as w, a passage can be defined as p m = {w m,n ∈ p m , 0 ≤ n < N | p m ∈ C}.
We denote a question as q and a passage retriever as R. The task is to select K most relevant passages for q from C. Formally, R(q, C) → C q . Upon receiving K passages C q , a QA model predicts the most probable text span A q = {w m,j 1 :ja | w m,j 1 , . . . , w m,ja ∈ p m , p m ∈ C q } that can answer the question. We also test generative models for our task, but it is crucial to note that we are still confined to the extractive QA setting as generated texts need to be evaluated against the ground-truth answers, which are contained in the contexts.
In real-world applications, a trained ODQA model may encounter domain changes including token distribution shifts in C, or type and length changes in questions and answers. Therefore, it is crucial to gauge domain robustness. However, a comprehensive evaluation benchmark does not exist, which we propose to address with RobustQA.

Data Creation
Existing public ODQA data mostly leverage Wikipedia (Kwiatkowski et al., 2019;Joshi et al., 2017) as a search corpus, and focus on factoid questions. To benchmark model robustness across a wider range of text genres and question types, we 1) annotate 6 datasets in finance, lifestyle, recreation, technology, science, and writing domains based on FiQA (Maia et al., 2018) and LoTTE (Santhanam et al., 2022); 2) adapt two public available ODQA datasets, SearchQA (Dunn et al., 2017) and BioASQ (Tsatsaronis et al., 2015). The newly annotated data consist of a significant portion of challenging reasoning type of questions that cannot be answered with entities or short phrases. Our data samples can be found in Table 1, Table 11-13,  and Table 16 in the appendix.

Annotated Data
We describe the data annotation process for the new domains: finance, lifestyle, recreation, technology, science, and writing based on FiQA and LoTTE, both are IR-dataset with no precise answer spans annotated in the retrieved supporting documents.
As shown in Table 1, relevant documents are retrieved from corresponding IR systems. We present a question and its relevant documents to the annotators, and they need to identify up to 3 concise text  Figure 1: An illustration of our data quality control procedure, which starts by data experts sharing instructions and raw data to the annotators. Data Experts constantly audit the annotations and conduct a final validation to ensure data quality. spans (conciseness) from the passages that are most appropriate to answer the given questions (validity). Note we do not concatenate different answer spans. Rather, we treat each annotated span as an individual answer similar to the practice in NQ and BioASQ. We also provide detailed guidelines regarding how to judge conciseness and validity (details in Appendix A.1). Next, we describe specific features of FiQA and LoTTE.
FiQA contains a task: "Opinion-based QA over financial data." It aims at answering finance related questions from financial corpus such as microblogs, reports and news. However, the answers provided in the original dataset are documents instead of precise text spans. Besides, its test set does not include answer passages. Therefore, we use the original training set with all of their relevant passages to be examined by annotators. After filtering out samples with no precise answer spans, we obtain 3,669 questions.  ODQA task (Task 1b) consists of four types of questions: 1) Yes/No, (2) factoid, (3) list, and (4) summary. We consider only (2) and (3) as they are suitable to our extractive QA task. After discarding no-answer questions, we acquire 1,956 questions.

Quality Control
Fig. 1 illustrates our data annotation and quality control procedure, which starts with the data experts (including co-authors) sending the annotation guidelines and raw data to the annotators. Annotation guidelines can be found in Appendix A.1. Upon receiving annotated data from the annotators, the data experts will randomly select 10% of the annotations to audit. If the selected samples fall below the validity requirements of 90%, they will be sent back to the annotators for re-annotation. The process repeats until randomly selected samples pass the 90% threshold.
To ensure annotator quality, we hire professional data providers. Based on the information shared, the data provider team consists of more than 20 data professionals, and each of them is paid > 15 U.S. dollars per hour. The data expert team consists of co-authors and 10 additional internal data professionals. Note that due to the high costs of hiring data professionals, we were not able to provide multiple annotations per sample, and thus were not able to explicitly compute inter-annotator agreements. However, the 90% passing threshold we install in the process guarantees high annotation satisfaction rate, and thus ensures good data quality.

Data Statistics and Analysis
In this section, we compare data in different domains. Particularly, we want to highlight the drift of data distribution from NQ.
Passages. Following the DPR paper (Karpukhin et al., 2020) for passage pre-processing, we split  [WR] [SC] [TE] [RE] [LI] [FI] [BI] NQ factoid reasoning documents into 100 maximum continuous tokens. 2 Passage numbers for each domain can be found in Table 2. We observe that BioASQ (biomedical) and SearchQA (Web-search) contain the amount of passages that are in the same magnitude of NQ. The other newly annotated datasets all have relatively smaller collections of passages, which is common in many real-world applications.
Questions. Table 3 shows the length of questions, and Fig. 3 demonstrates the types of questions across different domains. We can see that BioASQ is relatively similar to NQ. They share similar question length and contain mostly factoid questions. However, other RobustQA data have longer questions compared with NQ, and they tend to ask reasoning type of questions such as "how to ...?" "how do ...?" and "why does ...?" (see details in Appendix A.4). Also, these questions ask long-tail topics, ones that might not be covered by an entity-centric knowledge base like Wikipedia (Santhanam et al., 2022). Answers. As mentioned above, except for SearchQA, RobustQA data consist of longer answers due to the nature of reasoning type of questions. Our questions also contain more individual answers, which is a consequence of compiling answers from multiple relevant passages during data creation. For example, if a question has two supporting passages, each containing three unique answers, the question will have six answers in total.

Passage Retrieval
In this section, we describe the retrievers adopted in this work. We benchmarked five leading retrievers on RobustQA.
DPR adopts a bi-encoder architecture to encode a pair of question and passage independently. The passage relevancy score is calculated using vector similarity measures such as dot product. We use the best checkpoint (trained on NQ) provided by the DPR paper (Karpukhin et al., 2020).
BM25 is a widely used sparse retriever, which matches keywords efficiently with an inverted index and can be seen as representing the question and context as weighted, high-dimensional sparse vectors. We use the BM25 implementation provided by the BEIR paper (Thakur et al., 2021).
BM25+CE incorporates cross-encoder (CE) architecture as passage re-rankers after obtaining BM25 results. However, Reimers and Gurevych (2019) points out that the CE architecture is computationally expensive as it requires both the question and passage to be fed into a language model for encoding. For this reason, in the BEIR paper, the authors input only the top 100 ranked passages returned by BM25 into CE for re-ranking, and we follow this practice. The original CE model is trained on MS MARCO (Payal Bajaj, 2016), and we also re-train CE using NQ only from scratch, and denote this model as BM25+CE N Q .
ColBERTv2 (Santhanam et al., 2022). Col-BERT was initially proposed by (Khattab and Zaharia, 2020) using late interaction to decompose relevance modeling into token-level computations, which improves the expressivity of the query-document matching, but increases the storage requirements drastically. ColBERTv2 alleviates this issue with a residual compression mechanism and improves the quality of the retriever by distilling from a cross-encoder with hard-negative mining. We use the best checkpoint (trained on MS MARCO) provided by the paper.
Atlas (Izacard et al., 2022b) jointly trains Contriever (Izacard et al., 2022a), a dense retriever with bi-encoder architecture and T5 (Raffel et al., 2019), a sequence-to-sequence language model as the reader. Since we adopt Atlas as one of the open-QA baseline models (Sec. 5), we report its retriever performances here.

Open-domain QA
To anchor our QA models against a widely tested baseline, we adopt the extractive QA model architecture used in the DPR paper. We further investigate whether we can improve extractive QA model's OOD generalization by pretraining base models with unlabeled target domain corpus. As large language models (LLMs) are gaining research popularity nowadays, we also benchmark Flan-T5 (Chung et al., 2022) as a baseline. We also test the method that joingly trains LLMs with dense retriever (Izacard et al., 2022b), which is expected to have stronger performances than standalone QA models.

Extractive QA Model
For the baseline extractive QA model, we strictly follow the training objective of the DPR paper. 3 where P i indicates the last encoded hidden layers of the i-th passage from BERT (Devlin et al., 2019). w start , w end , w selected are learnable vectors. The training objectives consist of two scores: 1) span score of the s-th to t-th tokens are computed as P start,i (s) × P end,i (s); 2) passage selection score is P selected (i).

Pretraining with Target Corpus
Pre-training with target corpus/task has shown to help language models (LMs) adapt more effectively to unseen domains (Lee et al., 2020;Liu et al., 2020;Han et al., 2021;Garg et al., 2019;Zhou et al., 2021). Here, we are interested in the setting where only a small amount of passages is available for pre-training. Specifically, we conduct a secondstep pre-training on BERT for the extractive QA model mentioned above. The target corpus has no QA annotations and consists of a small fraction of all available target domain text data. Finally, we fine-tune the target-pretrained LMs on in-domain data before testing it on the target-domain data in RobustQA.
Besides target-corpus pretraining, we also experimented with other unsupervised domain adaptation methods such as contrastive loss based on Long et al. (2022). Since its improvements are relatively marginal, we briefly describe it and report results in Appendix A.5.

Prompt Finetuning with LLMs
Recently, LLMs have shown incredible performances on a variety of NLP tasks. Here, we also test LLM's ability on ODQA by finetuning one of the SOTA open-source LLMs, Flan-T5-xxl with 11B parameters, on the same open-domain NQ dataset used in training the extractive QA model. The prompt template used during finetuning and inference are shown in Table 4. 3 https://github.com/facebookresearch/DPR During finetuning, Instruction is fixed for all training samples. Question and Answer pairs are from NQ data. Passage 1-5 are retrieved by DPR. During inference, the same template is used for each test domain data, but the passages are retrieved by ColBERTv2 (see more details in Sec. 6).
Instruction: provide an answer to the question in the given passages.

Joint Training LLM with Retriever
Joint training LLMs with dense retrievers has been shown to be an effective method for retrieval based tasks (Izacard and Grave, 2021; Lewis et al., 2020). One of the most recent efforts, Atlas (Izacard et al., 2022b) improves upon previous works by jointly pre-training T5-based models with Contriever on a large amount of corpus using various objective functions. Atlas achieves impressive zero/few-shot learning performances on open-domain QA. Therefore, we report its results to show a strong modeling baseline on RobustQA. As shown in Table 5, we use HIT@5 as the primary reporting metrics 4 .

Open-domain QA
Following single-data setting of the QA model in the DPR paper, we use NQ as the in-domain training data with passages retrieved from the best DPR  Table 5: Passage retrieval performance based on HIT@5. Atlas uses the Contriever (Izacard et al., 2022a) that has fixed model size for both Atlas-base and Atlas-xxl. "base" and "xxl" refer to the size of the reader model (T5). Neural retrievers' model sizes and training data are summarized in Table 9 in the appendix.
checkpoint. This choice is justified for two reasons: 1) we want to have a fully comparable training setting with the baseline QA model; 2) according to Table 5 and Table 14-15 in the appendix, DPR achieves overall best in-domain retrieval results, suggesting DPR retrieves good quality passages for in-domain training.
For evaluation, we test the trained QA models on RobustQA with passages retrieved from Col-BERTv2 as it provides the best retrieval results according to Table 5. Using DPR's retrieved passages is a reasonable alternative as it may reduce the gap between the train and test time. As a benchmark paper, we leave more rigorous investigations of this option for future research efforts.
Unsupervised Corpus. Note that PT − → FT method requires unsupervised source/target corpus. We construct them by randomly sample a subset of passages (Table 2) with no more than 20 million combined tokens (∼200K passages). We ensure positive contexts for the test questions are excluded.

Results and Analysis
In this section, we show our benchmark results on RobustQA for both passage retrieval and end-toend question answering. Table 5 shows the results for passage retrieval. Consistent with previous findings (Thakur et al., 2021), DPR achieves best in-domain performance on NQ, but generalizes poorly to RobustQA. On the contrary, BM25's performances across all datasets are stable, and it outperforms DPR by 8.21% over HIT@5 on RobustQA.

Passage Retrieval
We found that both BM25+CE and ColBERTv2 can work well for all domains. Their RobustQA average result outperforms BM25 by 13.56 and 14.73 percentage points, respectively per HIT@5. ColBERTv2 appears to be the most robust passage retriever according to our results. Table 14 and 15 in the appendix show ColBERTv2 gains wider margins against BM25+CE per the HIT@20 and HIT@100. Here, both CE and ColBERTv2 are trained on MS MARCO. Impact of Training Data. As shown in Table 5, with the same base model, training the CE with NQ can improve its in-domain performances by 6.22 points, but hurt its results on RobustQA by 6.73 points. Since MS MARCO includes larger and potentially more diverse data than NQ, it suggests enhancing quantity and diversity of training data can be beneficial to improve domain robustness.
Model Size. As Table 9 in the appendix shows, all model sizes for neural models are comparable, except for ColBERTv2, which uses a much smaller model, but leverages distillation techniques to obtain knowledge from larger CE models. Nonetheless, its efficiency and remarkable performances make it an excellent retriever during inference.  Table 6: End-to-end QA performance based on F 1 score. All readers are trained on NQ. Except for Atlas models, ColBERTv2 is used to retrieve up to 100 passages to be consumed by the reader during inference. *Atlas-xxl's F 1 score on NQ is lower than the number reported in the original paper because we use DPR's passage pool, which does not contain infobox data.
Atlas' superior OOD performances show the effectiveness of joint pre-training of retriever and reader on large text corpus before fine-tuneing on NQ. The signals from the QA models likely can help correct the errors in the retrieval stage, and the signals become stronger as we adopt larger language models, i.e., from "base" to "xxl." NQ v.s. RobustQA. We observe that the best RobustQA average is 9.45 percentage points below the best NQ HIT@5 (72.24%). Except for SearchQA and Writing and, which have the closest token distribution to NQ (Fig. 2), there are significant performance degradation on RobustQA datasets, suggesting our new benchmark provides a much more challenging contexts for passage retrieval compared with the commonly used Wikepedia corpus.

Open-domain QA
For end-to-end QA performances, Table 6 shows that simply applying an extractive QA model finetuned to NQ to RobustQA will result in a drastic performance drop of 30.56 percentage points per F 1 measure. Comparing and contrasting with the performance declines in the passage retrieval, it implies that domain drifts have strong impacts on both retriever and reader modules.
Pretraining with Domain Corpus. We observe that comparing with FT only, PT− → FT can improve RobustQA by 2.23 percentage points. The best PT− → FT F 1 score on RobustQA (21.04%) is still more than 28 points below its in-domain performance on NQ. These results again confirm that RobustQA is a challenging benchmark to work with, but point out a promising direction to leverage unlabeled corpus to help close the gap between in-domain and OOD data.
Generative v.s. Extractive Approach. Comparing to FT (on extractive readers), generative models Atlas-xxl and Flan-T5-xxl improve F 1 score on NQ by 13.07 and 8.23 percentage points, while gaining 18.5 and 16.68 points on RobustQA, respectively. These results demonstrate the superior performances of large language models. However, the in-domain and OOD gap is still wide, and LLMs may not be suitable for compute/latency-sensitive applications. Thus, we test a smaller generative model, Atlas-base whose reader has similar model size as the extractive QA model. We observe a lift of 9.48 points agaist FT, which suggests that generative approach can help extractive ODQA task.
Analysis on challenging domains. We observe in Table 6 that FiQA and Technology are the two most challenging domains, which can be particularly attributed to the statistical data differences (Sec. 3.4). Table 3 shows that FiQA has the longest answer spans (9.4 v.s. 2.3 for NQ). Since the QA model is trained on NQ to predict short spans, it is likely to have lowest token recall for FiQA, thus the lowest F 1 scores. On the other hand, Technology's poor performances may be related to its largest token distribution drift from NQ as illustrated by Fig. 2. Moreover, Fig. 3 shows that Technology has the largest amount of reasoning type of questions, which indicates the largest question type drift. Both factors make it harder to adapt a model trained on NQ to Technology in a zero-shot manner.
Error Analysis. A common issue for ODQA is that when a retriever returns a mixture of relevant and irrelevant passages as QA inputs, the latter can mislead reader prediction by extracting answers from incorrect contexts. These issues can be potentially resolved by building stronger retrievers and leveraging retrieval results such as scores to help readers rank answer candidates. We leave this to the future research efforts.
Here we focus on analyzing errors in reading comprehension only by selecting samples where a reader correctly picks relevant passages to extract answers. As Table 16 in the appendix shows, both PT− → FT model and Atlas-xxl, when finetuned on NQ, tend to predict either entities or short phrases on FiQA, whereas the ground-truth answers tend to be longer and complete phrases that can fully answer the reasoning type of questions (Example 1 and 2 in Table 16) or complicated factoid questions (Example 3). This again suggests that training NQ alone may not be sufficient to solve RobustQA, and a more diverse ODQA dataset is crucial for training robust ODQA systems. Passage Retrieval. At the core of most ODQA systems is a passage retrieval system. While we only benchmark several strong baselines in this work, numerous other systems have been studied in the past (Thakur et al., 2021). These systems can be broadly divided into sparse retrievers (where the similarity between the query and a passage is calculated via inverted index), dense retrievers (where the similarity is calculated with dense vectors from neural encoders), or a combination of both.

Conclusion
We propose RobustQA, a benchmark consisting of samples across 8 different domains that better evaluates ODQA systems' robustness on domain adaptation. After adopting SOTA ODQA systems enhanced by unsupervised learning methods and LLMs, there still exists a significant performance gap between RobustQA and the commonly used NQ dataset, which suggests that RobustQA is a more reliable and challenging benchmark to evaluate ODQA systems' cross-domain performances.

Limitations
We discuss some limitations of this work for future research efforts. The range of the domains could be more comprehensive to cover social media and law. The experiments can potentially cover more models. As we mention in Sec. 8, there are more comparable retrievers and QA readers. It would be useful in the future to benchmark more models on RobustQA. Finally, due to the complexity of the raw IR data, it is costly to collect our datasets. This is manifested by not only the monetary costs, but also the human efforts to create guidelines, to coach annotators, and to manually audit and validate annotations. In the future, it could be beneficial to leverage large language models with context learning to assist human labors.

A.1 Annotation Guidelines
We summarize our annotation guidelines here. Before judging the answer validity and conciseness, annotators should first determine if a passage can be used to answer a given question. Though the passages associated with the question should have labels of "relevant" in the original IR data, we found they are not always appropriate for our annotations because 1) annotation errors in the original dataset, i.e., passages are in fact not relevant; 2) intent of a question is ambiguous, so it is hard to determine if an associated passage is relevant or not; 3) we couldn't find precise answer spans in a passage to answer the question. In any of these cases, we instruct the annotators to discard the passage.

Answer Validity
• The answer span is exactly the same as in the passage. Typos/misspelling are acceptable as long as they are not preventing us from understanding the answer. • The answer span does not combine two different excerpts/parts of the passage to create only one answer span. • The answer span does not include leading or trailing punctuation marks, unless they are a part of the answer span. (e.g. Yahoo!) • The answer span does not contain a URL and/or is not a URL link itself. • The answer span does not correspond to the entire passage.

Answer Conciseness
• The answer span should be as short as possible, while still conveying the wanted meaning. • The answer span does not contain more than 16 words, though adding 1-2 words to make the span complete is allowed (should be considered very carefully). • The answer span does not contain unnecessary rhetorical expressions such as subject/object, time and location around a core concept.  • The answer span does not contain unnecessary the explanation of a core concept.
A.2 Additional RobustQA Annotations Table 7 shows the data statics for our ODQA annotations in the dev split of LoTTE. We did not benchmark these data, but will release data for future model development purpose.

A.4 Question Types
In Figure 3, we categorize questions into two types: factoid and reasoning. Here, we entail the string matching used for the categorization. Though these rules are not perfect, they largely capture question types based on our careful manual examination. Here is an example: "magic-making mickey mouse movie of 1940" -the typical question should be "what is the magic-making mickey mouse movie of 1940." We do not handle these questions separately and leave the impact on domain-robustness of this statement-type questions to the future research.

A.5 Domain Classification-based Contrastive Learning
Our contrastive learning method is based on Long et al. (2022). The core idea is that while training models to perform well on the key task (e.g. ODQA), we also encourage models to learn domain-invariant representations using unlabeled source/target corpus. In this way, the model can potentially be effective at adapting to the same task in a different domain. This goal is accomplished by first introducing a domain classifier f (x, l) where x is the text representations and l is the text's domain label. Then we attempt to learn the optimal perturbation δ to x that brings the representation close to domain-invariant. Adopting the virtual adversarial loss formulation (Miyato et al., 2017), where θ is the classifier parameter to be learned and α adv controls the balance between optimizing the classifier and learning the perturbation. ϵ is the l 2 norm boundary for δ, which can be learned through Projected Gradient Decent (PGD) (Madry et al., 2018;Zhu et al., 2020) with an additional assumption that the loss function is locally linear.
To approximate the perturbation θ, we can run one iteration of the following algorithm, ) where ∥δ∥≤ϵ performs a projection onto the ϵ-ball, and η is the step size. Finally, the contrastive loss is computed as, where z = g(f (x)) and z ′ = g(f (x + δ)). g is a projection function and s represents the cosine similarity between two vectors. τ is a constant temperature parameter. The indicator function 1 k̸ =i excludes the target sample i from the normalization term. N is the batch size. Intuitively, this contrastive loss function brings the original representation z closer to its domain-invariant representation z ′ while pushing it away from the representations from other negative samples in the batch. Combining all, the final training objective is, (2) where w d , w c are weights for component losses.

A.6 Additional Implementation Details
Computing Resources. We run all experiments on 8 Nvidia A100 GPUs. A model is typically trained for 10-15 epochs, which takes 10 to 20 hours to complete depending on the model size and algorithm complexity. All neural models are implemented by PyTorch and Huggingface libraries. All benchmarked models are publicly available. Please refer to their code repo for software details.
Hyper-parameters. All hyper-parameters for the QA models are selected based on the best NQ dev set performances. We use EM for top 50 passages to be consistent with the DPR paper. For pretraining method, we save checkpoints for every 8000 training steps, and report the best model per criteria mentioned above. Similarly, for FT + CL experiments, the best hyper-parameters are picked based on the same criteria. We tuned 4 hyper-parameters as shown in Ta  A.7 Passage Retriever Models Table 9 shows base models, number of parameters and training data for passage retrievers.

A.9 Contrastive Loss Results
As shown in Table 10, after finetuning using contrastive loss on NQ data, we observe noticeable gains against FT only method on BioASQ and SearchQA datasets with very little trade-off on indomain NQ result. However, this method doesn't work very well on other RobustQA datasets, and requires a lot of hyperparameter tuning. We leave more rigorous study on this method to future research efforts.
A.10 Error Analysis Table 16 shows error cases on FiQA.
Lifestyle: Example 1 Question: Why is international first class much more expensive than international economy class? Passage: Your question is (I think): why does (US) domestic first class exceed the cost of economy by 50%, whereas international first class is many times the cost of economy. The answer to that is that the comparative services are vastly different. In (US) domestic first class you get a little more legroom, and a seat roughly 50% more wide. In international first class, you will often get far more than a fully reclining seat -a private bed is not unknown. Comparing the plane's floor-plate take is a crude measure, but it would not surprise me if a first class seat took 10 times the amount of space as an economy seat on an international flight. Look at the floor plans here (image not copied for copyright reasons) and you can see the six seats in the back half of the first class cabin of a 747 take about the same space as 4 rows (40 seats) of economy.
Lifestyle: Example 2 Question: Why did the metro bus stop at each railway crossing, despite no warning indicating a train was coming? Passage: Other people have cited the relevant laws. The laws exist because warning signals sometimes malfunction. It's probably fine for a passenger car to take that risk, but a bus carrying passengers has a higher standard of care to adhere to. Question: Why did you have to blow into an NES cartridge to make it work? Passage: My brother and I did this all the time with our old nes. Beyond blowing out the dust there seemed to be some connection with moisture. If I blew it out and then breathed on it to make it damp it seemed to work better. The question about how everyone knew to do it? I agree with the instinct theory. My brother and I worked out this system by ourselves. You look at it, and for some reason the first thing you think of is either to stick your finger in it or to blow in it, LOL.

Recreation: Example 2
Question: How do you prevent Sims from aging? Passage: Sort of. There is the Ambrosia, which you can make yourself, although it is a little tricky. Here is a quote from another helpful player There is the Ambrosia recipe that you can buy from the book store. It's a level 10 cooking recipe and when you make it it resets your current life meter. So if you're 20 days into the Adult stage of life, it sets it back to 0. Also, if you have a sim ghost eat it, they will be brought back to life. Of course, to be able to make it requires some items that are not easily come by. Noamely you need a life fruit which can only be grown from a special seed found somewhere on the ground and grown by a high level gardener, and you need a Deathfish, which I believe can only be found in the graveyard pond after midnight and probably only fished by a higher level fisherman. I hope that helps. Question: How can I disown a running process and associate it to a new screen shell? Passage: Using GNU screen is your best bet. Start screen running when you first login -I run screen -D -R, run your command, and either disconnect or suspend it with CTRL-Z and then disconnect from screen by pressing CTRL-A then D. When you login to the machine again, reconnect by running screen -D -R. You will be in the same shell as before. You can run jobs to see the suspended process if you did so, and run %1 (or the respective job #) to foreground it again.

Technology: Example 2
Question: How can I reduce a videos size with ffmpeg?
Passage: This answer was written in 2009. Since 2013 a video format much better than H.264 is widely available, namely H.265 (better in that it compresses more for the same quality, or gives higher quality for the same size). To use it, replace the libx264 codec with libx265, and push the compression lever further by increasing the CRF value ‚Äî add, say, 4 or 6, since a reasonable range for H.265 may be 24 to 30. Note that lower CRF values correspond to higher bitrates, and hence produce higher quality videos. ffmpeg -i input.mp4 -vcodec libx265 -crf 28 output.mp4 To see this technique applied using the older H.264 format, see this answer, quoted below for convenience: Calculate the bitrate you need by dividing your target size (in bits) by the video length (in seconds). For example for a target size of 1 GB (one gigabyte, which is 8 gigabits) and 10 000 seconds of video (2 h 46 min 40 s), use a bitrate of 800 000 bit/s (800 kbit/s): ffmpeg -i input.mp4 -b 800k output.mp4 Additional options that might be worth considering is setting the Constant Rate Factor, which lowers the average bit rate, but retains better quality...