Multi-Task Retrieval for Knowledge-Intensive Tasks

Retrieving relevant contexts from a large corpus is a crucial step for tasks such as open-domain question answering and fact checking. Although neural retrieval outperforms traditional methods like tf-idf and BM25, its performance degrades considerably when applied to out-of-domain data. Driven by the question of whether a neural retrieval model can be _universal_ and perform robustly on a wide variety of problems, we propose a multi-task trained model. Our approach not only outperforms previous methods in the few-shot setting, but also rivals specialised neural retrievers, even when in-domain training data is abundant. With the help of our retriever, we improve existing models for downstream tasks and closely match or improve the state of the art on multiple benchmarks.


Introduction
Knowledge-intensive tasks is the common designation for a class of real-world NLP problems which, because of their nature, require large amounts of knowledge about the world .
For example, open-domain question answering requires producing answers to general factoid questions; fact checking involves determining the veracity of claims based on a database of trusted evidence. Practical solutions to these tasks usually involve an efficient retrieval component that, given an input query, selects a limited subset of relevant information from a large knowledge source. Sophisticated downstream models then consider the input only in the context of the retrieved information, and perform the final task. 1 * Equal Contribution. 1 While large pre-trained neural models have been shown to incorporate real-world knowledge in their parameters and thus may skip retrieval (Petroni et al., 2019), they still have limited capacity and suffer from a lack of explainability.
The standard retrieval component in many systems (e.g., Thorne et al., 2018;Wang et al., 2018;Chen et al., 2017) has long relied on term-matching methods, such as tf-idf or BM25 (Robertson and Zaragoza, 2009). These methods rely on efficient algorithms and usually perform reasonably well regardless of the problem. In contrast, recent neural retrieval models, such as ICT , DPR  and RAG (Lewis et al., 2020b) achieve better results by learning directly from task-specific training data and going beyond simple keyword matching. While task specialisation results in improved task performance, researchers have observed that a retriever trained for one specific domain will typically achieve low out-of-domain performance, and even lower performance on entirely different tasks . This has two implications. First, unlike tfidf or BM25, neural retrieval models are unsuitable for low data regimes such as few-and zero-shot settings. Second, task-specific retrievers complicate practical applications where multiple knowledgeintensive tasks may need to be performed using the same supporting database or over the same input text. It may not be practical to deploy multiple separate specialised models due to computational performance or memory concerns.
In this work, we ask the following question: can we develop a universal neural retriever? Namely, we target a retriever which can perform well on a wide variety of problems without domain-specific training, but which -if additional in-domain labelled data is available -can be further fine-tuned to improve its performance. We perform a large experimental study to attempt to build such a universal retrieval model. We find that, by jointly training on an extensive selection of retrieval tasks, we obtain a model which is not only more robust than previous approaches, but also can lead to better performance on the downstream knowledge-intensive tasks when plugged into an existing system. Our approach combines the benefits from IR-based models with those of task-specific neural retrievers -namely, good performance when no (or not enough) training data is available and high task performance due to its ability to learn highly specialised representations.
Our contributions can be summarised as follows.
• We propose a single general-purpose "universal" retrieval model, able to perform comparably or better than specialised retriever approaches in both zero-shot (leave-one-out) and few-shot retrieval. We investigate several model variants, shedding light on what are the aspects of the architecture that affect its performance. • We show that, with in-domain training, our model's gains in terms of retrieval directly translate into performance gains for a variety of downstream knowledge-intensive tasks. • We will share the implementation as well as our best model. This is in the form of a readily available BERT checkpoint which, as we will show, can be used by NLP practitioners as a strong out-of-the-box retrieval system, and can also undergo further in-domain training for even higher performance.

Background
In this section, we first give an overview of retrieval methods based on sparse and dense representations. We then discuss a wide range of knowledgeintensive NLP tasks, where retrieval plays a crucial role in solving the problems.

Retrieval methods
Given a large collection of unstructured text passages, information retrieval (IR) can be broadly defined as finding a small set of passages that satisfies an information need, often presented in the form of a short text query (Manning et al., 2008). Traditional IR methods, such as tf-idf and BM25 (Robertson and Zaragoza, 2009), match keywords efficiently with an inverted index. Such methods can be seen as representing queries and passages in high-dimensional, sparse vectors, where each dimension corresponds to a term in the vocabulary and the weight indicates its importance. In contrast to tf-idf and BM25, dense retrieval methods encode text as a latent semantic vector of a fixed, much smaller dimensionality. Whether a passage is relevant to a given query is determined by the distance of their vectors (Deerwester et al., 1990). Although dense representations do not encode tokens explicitly and can potentially map paraphrases of completely different tokens to close vectors, performance of early dense retrieval methods was often inferior to term-matching approaches, except when large labelled data is available Gao et al., 2011;Huang et al., 2013). Thanks to success of large pre-trained models Liu et al., 2019b), however, recent dense retrieval methods have shown to outperform the sparse counterparts, when fine-tuned on a small set of in-domain labelled data Lewis et al., 2020b;Xiong et al., 2020). Efficient index and search of dense vectors are made possible by maximum inner product search (MIPS) algorithms (e.g., Shrivastava and Li, 2014;Guo et al., 2016), as well as tools like FAISS (Johnson et al., 2019).
Our work is built upon the Dense Passage Retriever (DPR) architecture of , which was initially proposed for the task of open-domain question answering. DPR is a neural bi-encoder model which embeds queries with an encoder f (·) and passages with a separate encoder g (·). Given an input query x and a target passage y, we have where the similarity score sim (x, y) is defined as the inner product of the embeddings of its arguments, f (x) · g(y). Given a query at inference time, calculating its similarity with every possible passage would be prohibitive for large knowledge sources. Therefore, DPR makes use of the FAISS library (Johnson et al., 2019) to perform fast approximate nearest neighbour search in sub-linear time.
Training of DPR is based on a contrastive loss. Given a query x, a relevant passage y, and a set of n irrelevant passages y − i , we train the model by optimising the following negative log likelihood: .
As the set of irrelevant passages, we use the relevant passages for other queries within the same batch, as well as a specially selected "hard" confounder. This is a passage which has high lexical  Figure 1: Training of DPR , a bi-encoder model for open-domain question answering. Queries and passages are encoded as vectors, and retrieval is performed as a maximum inner product search.
overlap with the query (high BM25 score), but is not among the set of relevant passages for the given data point.  have shown that the inclusion of such "hard" confounders leads to substantially improved training results. This training process is illustrated in Figure 1.

Knowledge-intensive Tasks
For the training and evaluation of all models in the paper we make use of KILT, a benchmark and library of datasets . KILT consists of a selection of datasets spanning five varied classes of knowledge-intensive tasks (i.e., question answering, slot filling, fact checking, dialogue, entity linking), with the aim to cover many different ways of seeking knowledge. Input queries can vary wildly from one task to the other, and include classic examples of open-domain retrieval tasks such as natural language questions and claims to be verified, as well as more unusual examples like conversation fragments and long chunks of annotated text. Crucially, all datasets distributed in KILT have been re-aligned such that they are all grounded in the same snapshot of Wikipedia, which the authors distribute. The knowledge required to answer any of the queries in the library of tasks can thus be found within the same unified knowledge source.
To illustrate the variety of ways in which the input queries for different tasks can be formulated, we provide a few simple examples in Table 1. In spite of the differences between query formulations, all these tasks share one crucial aspect: they all require a retriever to fetch the relevant passages from the knowledge source, in order to support the final downstream task.

Universal retrieval
Using task-specific models to tackle our collection of retrieval tasks would involve completely separate models, one per dataset. Following the definitions of §2.1, for a family of tasks i = 1, . . . , n this would require n query encoders f 1 , . . . , f n , and n corresponding passage encoders g 1 , . . . , g n . As illustrated in Figure 2, this would lead to a proliferation of models and data, down to separate indexed copies of the knowledge source itself. This fully specialised setup will form one of our baselines.  Multi-task training has been successfully used to allow models to leverage cross-task data, as well as to provide a regularisation effect leading to better generalisation ability (Liu et al., 2019a). We apply this concept to neural retrievers, with the aim of improving performance by jointly leveraging multiple different retrieval datasets.  Our base setup is illustrated in Figure 3b and involves using, across all tasks, a shared passage encoder g -so that a single index of encoded passages can be used -as well as a shared query encoder f . In essence, in this setup a single DPR model is used to perform all retrieval tasks.

Model variants
Due to the complexity of training and evaluating retrievers (which involves training the model, embedding all of Wikipedia, and indexing it), our main experiments are all based on the configuration of Figure 3b, which was found to work well.
We did, however, also investigate other more complex model variants in a set of preliminary experiments. As these were not found to be beneficial, we leave them in the appendix, but mention the variants' architecture for completeness: • Task-specific query encoder. A different query encoder f i is used for each family of tasks. For example, all question answering tasks use the same query encoder. This is meant to allow for potentially different needs in processing queries, given the fundamentally diverse nature of the tasks at hand. This setup configuration is illustrated in Figure 3a. • Task markers. This is a variant of the base model in which specialised tokens are inserted at the beginning of each query. Their aim is to help the model distinguish between the different tasks, by marking them. We use one task marker for each of the five task classes of KILT, such that all question answering tasks share the same marker.
Experimental results comparing these variants to the base model can be found in Appendix B.

Dataset
Task class #Train   Table 2, which cover all five task classes and include a training split, a validation split and a held-out test split each.
Preprocessing Starting from the KILT data, we split each Wikipedia article into disjoint 100-token chunks which form our basic retrieval units, following Wang et al. (2019) and . To maintain the same language introduced in §3, we will simply call these chunks passages.
This preprocessing results in a knowledge source of 36 million passages. In order to harmonise all datasets to the same knowledge source, KILT used a mapping strategy based on the BLEU metric to map relevant passages in the original versions of its datasets to passages in its own shared knowledge source . Entries included in the KILT training sets which have a mapping BLEU score below 0.5 are likely to be noise, and we exclude them from training and validation (resulting in a 18% reduction on average for the validation sets).
Multi-tasking Training is performed on the union of all data. Since two training sets are vastly larger, we downsample them to the same order of magnitude as the others. Preliminary experiments with more complex sampling methods, like resampling all datasets so that each epoch would see an equal number of samples from each, found that they had no measurable effect compared to this simpler approach.
Encoders Our query and passage encoders are initialised as distinct BERT-base uncased encoders , trained separately. As pooling mechanism we find it effective to simply take the [CLS] token representation at the topmost layer.
Training We train our models for up to 80 epochs. To select the best checkpoint, we evaluate the retrieval performance on the validation set at regular intervals. We optimise with Adam (Kingma and Ba, 2015) with a learning rate of 2 · 10 −5 , warmup, a linear decay schedule, and a dropout rate of 0.1. The batch size is set to 128 samples, and in preliminary experiments we found no benefit in increasing this further. We use an additional "hard" confounder per batch, selected based on BM25 score as in .
Downstream evaluation When evaluating our retriever within a larger architecture to perform a knowledge-intensive task, we replicate a setup analogous to DPR + BART of . This uses our multi-task model to retrieve and prepend the top 3 passages to the query, which is then processed by a task-specific fine-tuned BART model to generate the final answer for the end task.
Baselines For our retrieval experiments, we include as baselines a BM25 model as well as a taskspecific DPR model for each of the training datasets. For the downstream evaluations, we compare against three strong representative models trained by : a task-specific DPR model combined with BART (Lewis et al., 2020a), RAG (Lewis et al., 2020b), and T5 (Raffel et al., 2020).

Universal retrieval
The results of the evaluations reported in  show that retrievers trained for question answering have poor performance outside of their domain. We would like to understand if it is possible to design a single model which can accurately satisfy the information needs of a wide variety of knowledge-intensive tasks. In short: Can a neural retriever be universal?
We perform a comprehensive evaluation of several models on the 8 tasks of Table 2. We evaluate 8 task-specific models (one trained on each of the 8 datasets), for which we measure both in-domain and out-of-domain performance, and a BM25 baseline. Additionally, we include a multitask trained model -as described in §3.1 -with the hope that it can learn to perform all tasks satisfyingly. This amounts to 10 models evaluated on 8 tasks each, for a total of 80 evaluations. To measure retrieval performance, we adopt the main metric used for the KILT benchmark, R-precision. This is calculated as r/R, where R is the total number of relevant passages for a given query, and r is the number of relevant passages returned among the top-R retrieval results. For the case of R = 1 this is therefore equivalent to precision@1.
This experiment is of a very large scale, amounting to 10 models evaluated on 8 tasks, each repeated at the page and passage level -for a total of 160 figures to report. Due to this complexity, we report the results in Table 3 via a heatmap showing, for each evaluation task, the difference in R-precision between a given model and the taskspecific model that was trained on the relevant task only. This is to highlight how each approach stacks up against a specialised model.
While the KILT evaluation focuses on retrieval at the level of Wikipedia pages (thereby marking as "hits" any results that lie within the correct page), we are also interested in performing an evaluation at a more fine-grained level. We therefore also evaluate our models at the passage level, using a modified version of the official KILT evaluation scripts. These are shown at the right side of Table 3. For full context, we also provide the full absolute results in Appendix A.
We straight away notice that task-specific models tend to achieve high performance on their respective tasks, often taking one of the top two spots. Interestingly, we also note that these neural retrievers consistently outperform the BM25 baseline, show- Table 3: Difference in retrieval R-precision (at page-and passage-level) with respect to a task-specific model, on KILT validation data. The rows show our proposed multi-task retriever, the BM25 baseline, and a series of taskspecific models trained on each of the tasks. For the AIDA-YAGO2 dataset, due to the nature of the task, pageand passage-level results coincide.
ing that the result achieved by  for open-domain question answering also holds for other knowledge-intensive tasks.
The results reveal a strong performance for the multi-task model, confirming the hypothesis that a single model can be trained to perform well on a wide variety of retrieval tasks. With the exception of one dataset, the multi-task model achieves the best retrieval performance or is within a few points of the top score. We note that the one exception, the Zero-shot RE task (Levy et al., 2017), is a trivial task in which the query will always contain the title of the page to be retrieved. Indeed, the model specific to this task achieves a near-perfect score (see full results in Appendix A).
Another task which stands out for being markedly different in formulation is AIDA-YAGO 2 (Hoffart et al., 2011). As shown in Table  3, models that were not trained on AIDA-YAGO 2 do very poorly on it. Entity linking is normally better performed by models which are explicitly designed for it (De Cao et al., 2020). We nevertheless include it to showcase the ability of neural retrievers to adapt to a variety of tasks, and note how well the multi-task retriever performs on it in spite of its unusual nature.

Downstream performance
We saw that our proposed approach achieves strong performance across a variety of retrieval tasks. However, our interest in neural retrievers stems from their use as components within larger sys-tems, to perform tasks such as question answering. Our next experimental question is therefore: Can a universal retriever lead to better downstream performance in knowledge-intensive tasks?
We perform a downstream evaluation of our approach used in conjunction with BART (Lewis et al., 2020a) as the generative component, adopting a setup identical to that of . The results are reported in Table 4, with bold and underline marking the best and second best scores respectively.
The DPR + BART line refers to a setup similar to ours, but with the simpler retriever of  as trained in , which lacked the multi-task aspect. Therefore, comparing to its performance gives us a clear indication of the contribution of multi-task training on the overall performance on knowledge-intensive tasks.
Our proposed model achieves significantly better performance than this baseline in AY2, zsRE and HoPo; while for the other tasks, the discrepancy is always below two points.
This fact is reflected in the last column, showing that on average multi-task training leads to better downstream performance. The model also compares favourably to RAG (Lewis et al., 2020b), a more advanced system in which the query encoder is fine-tuned on the end task.  Table 4: KILT test scores on the downstream evaluation. Results in the bottom section are as reported in . The score metrics are accuracy for fact checking, entity linking and slot filling; exact match for QA; and F1 score for dialogue. 2

Zero-and few-shot performance
Task-specific neural retrievers can achieve higher performance than IR-based methods, but they are not suitable for cases where no training data (or not enough) is available. In those cases, tf-idf and BM25 are the better choice. To evaluate the performance of a multi-task retriever as a suitable replacement for them in this scenario, we run a series of experiments in the low data regimes (few-shot and zero-shot). We start by training a set of multi-task retrievers (with the base setup) in the leave-one-out setting for each dataset, in order to see how a neural retriever will perform when trained on all domains except for the one it is to be evaluated on.
The results of these zero-shot experiments are reported in the second line of Table 5 (again, bold and underline indicate best and second best overall performance, respectively). They show that, even in the zero-shot setting, the multi-task neural retriever achieves performance that is competitive to BM25, with retrieval being 10 points higher at the page level and 5 points lower at the passage level on average.
The advantage of neural retrievers over BM25 lies in their ability to improve with training. We therefore look at few-shot training for each task, and create two smaller copies for each of the original training sets with a random sample of 128 and 1,024 examples respectively. In order to evaluate the suitability of a multi-task trained retriever as a starting checkpoint for few-shot training, we take the various leave-one-out models and finetune them on our few-shot training sets. To check whether multi-task pre-training is effective, we also compare these to DPR models, which are just initialised with BERT weights and then fine-tuned on the same data.
The bottom two sections of Table 5 report the results. The most dramatic gains from fine-tuning are seen for AY2, an "outlier" task whose formulation differs from that of the other tasks, and which seems to benefit the most from being trained on in-domain data. The zsRE performance does not seem to improve from fine-tuning on the smaller dataset, but sees a very big jump when switching to the larger dataset. As a reminder, in this trivial task the title of the page to be retrieved always appears at the start of the query. It is therefore not surprising that models specifically fine-tuned on it can achieve near-perfect scores, as long as enough training data is provided.
In spite of the fine-tuning, we note that both DPR and the multi-task model fail to improve on their performance for T-REx, suggesting that large amounts of training data are required to learn this task. Nevertheless, the multi-task model proves itself more robust, and achieves the top performance on it.
Finally, we note for 2 out of 8 tasks, namely zsRE and WoW, DPR achieves lower page-level retrieval scores than the multi-task model, but performs better at the passage level. This shows that fine-grained and coarse-grained retrieval performance are not always perfectly correlated.
Overall, the experiments show strong results for the multi-task model, with the average zero-shot performance being competitive to BM25, and the average few-shot performance being markedly better than the alternatives. The discrepancy in performance between a vanilla DPR model and the leave-one-out multi-task model is especially noticeable when using the smaller of the two datasets, in which case average performance for the latter is more than double that of vanilla DPR.  Table 5: Page-and passage-level R-Precision in the zero-shot setting and with additional fine-tuning of 128 and 1,024 examples. We also compare to a BM25 retriever and a DPR model initialised with BERT weights.

Related work
The approach most closely related to ours is DPR , upon which we built all our retrievers. It is covered in detail, along with historical context, in § 2.1. Another closely related approach is the Retrieval-Augmented Generation (RAG) model of Lewis et al. (2020b). In its base configuration it augments DPR with a generative reader, and trains the query encoder end-to-end (differing from traditional retriever-reader architectures, which treat the two steps as disjoint). A natural extension of our work would be to combine RAG with the multi-task learning approach, to study whether it can lead to further gains in performance or robustness. A number of promising techniques to boost retrieval performance have been proposed recently. These are orthogonal to our work, and as such they could be combined with it. Amongst these, pretraining methods form one class. Inverse Cloze Task  and its extensions (Chang et al., 2020) are self-supervised pre-training methods designed for retrieval in open-domain question answering. Whether such specific pre-training is beneficial to tasks other than question answering remains an open question. CERT (Fang et al., 2020) is an alternative pre-training approach, inspired by some recent advances in computer vision. While to our knowledge this has not been applied to retrieval problems, we believe it might be promising due to its focus on sentence-level semantics (as opposed to the more standard masked language modelling pre-training, which focuses on the token level).
Another class of orthogonal improvements to dense retrieval involves models which embed passages into multiple fixed-size vectors. Of these, ColBERT  and ME-BERT (Luan et al., 2020) are two representative examples. One further approach is ColBERT-QA , which additionally uses a data augmentation strategy closely related to our own approach described in Appendix D.
Retrieval does not strictly have to be performed with a model which contains an explicit memory. Large-scale pre-trained models have been shown to store knowledge directly into their parameters. A model which demonstrates this ability is T5 (Raffel et al., 2020) -which we used as a baseline in § 4.
Regarding the multi-task aspect of our approach, a related strategy has been demonstrated by Aghajanyan et al. (2021). In this recent work, the authors multi-task train a pre-trained model on around 50 datasets, before performing the final fine-tuning. While they do not focus on retrieval, their results are consistent with ours and show that multi-task training leads to improved performance and increased sample efficiency.
On the topic of question answering, Lewis et al. (2021) show in a recent notable paper that, for several popular QA datasets, a portion of questions in the test set has near-duplicates in the training sets, and the same holds true for an even larger set of answers. To our knowledge, similar analyses have yet to be performed on the other KILT tasks.
Finally two entity linkers, GENRE (De Cao et al., 2020) and BLINK , are worth mentioning. Being trained specifically for entity linking, these models will generally outperform retrieval-based approaches on that task. While they are not comparable to retrieval models and will not generally be applicable to information retrieval tasks, we cite them here to provide readers with a fuller context of the existing literature on related tasks.

Conclusions
We have conducted a large-scale experimental study on knowledge-intensive tasks, and how re-trieval models that tackle them seek the required information from knowledge bases like Wikipedia.
The study started with the question of whether the way in which information is embedded for retrieval purposes is universal. §4.2 provided evidence that to a large extent it is, with a single "universal" retriever, trained jointly on 8 datasets, often performing comparably to task-specific models.
Armed with this knowledge, in §4.3 we plugged our single model in a larger pipeline, in order to see its contribution to the downstream performance on a wide range of knowledge-intensive tasks. This led to an overall improvement in downstream performance, setting new top results for a number of tasks in the KILT benchmark.
Next, in §4.4, we evaluated the model's performance in the zero-shot and few-shot settings. By evaluating on a wide range of tasks, we were able to show that our proposed approach performs comparably to BM25 in the zero shot setting, and quickly overtakes it even with minimal in-domain training.
In the appendices, readers interested in getting a fuller picture will find further experiments. Namely, in Appendix B we test two more complex variants of the model involving task specialisation, but fail to see clear performance improvements. In Appendix D we show how a simple iterative approach to data augmentation, easily applied to our base approach, can lead to better performance throughout.
We provide a pre-trained snapshot of our bestperforming model, in the form of a BERT checkpoint. 3 As shown, this model will be useful in zero-shot and few-shot settings as a better performing alternative to both IR-based approaches such as BM25, as well as task-specific models. The multitask training approach demonstrated here can also be useful in industry settings where several retrieval operations may need to be performed on the same piece of content, 4 and the deployment of multiple task-specific models might not be possible due to space or computational performance concerns.

A Full retrieval results
The heatmap in Table 3 showed a full comparison of task-specific models to our multi-task model and the BM25 for the experiments of § 4.2. In order to aid in the interpretation of a very large set of results, the heatmap showed, for each task, the difference in R-precision to the respective task-specific model. Here, for full context, we also provide in Table 6 the full set of absolute R-precisions for the experiments of § 4.2.

B Model variants
We compare our base multi-task model with the two variants described in § 3.2. Due to the high memory consumption of the "task-specific encoders" variant (requiring one full query encoder per task family, in addition to the passage encoder), it was only possible to perform these evaluations in a restricted setting of three datasets. The results in Table 7 do not reveal a clear winner, suggesting that the base architecture might be the better choice due to its simplicity and generally good performance. Not included in this table and in any other experiments, due to very poor performance in preliminary evaluations, are two further variants: a base model with a single encoder for both queries and passages, and a base model trained from scratch without BERT pre-training.

C Task learning curve
One of the initial studies we conducted involved computing the learning curve of the multi-task model for each task, using the full validation metrics. This is particularly expensive, as it involves embedding the whole of Wikipedia for each evaluation, indexing it, and performing a full retrieval. Figure 4 shows this for one of our preliminary models, trained on six tasks (excluding the abnormally large T-REx and the outlier AY2). We note the unstable behaviour of zsRE, whose unusual nature was already remarked upon in §4.2.

D Adversarial confounder selection
We saw in § 2.1 how "hard" confounder passages are collected using a BM25 baseline, following the standard approach in DPR. However, any other retriever can be used to select such confounders, including the very retriever being trained, leading to an iterative, self-adversarial training. Concretely, this amounts to following steps: (1) a first version of the retriever is trained with BM25 confounders; (2) new confounders are selected with the trained model, by retrieving high-ranking passages which are not among the set of relevant ones; (3) a second version of the model is trained using the additional new confounders.
Intuitively, it is expected that this approach should lead to higher quality confounders compared to those selected by BM25 based on simple keyword matching. Based on our own experience as well as relevant literature , this adversarial approach has been shown to work well for question answering.
As a way of further pushing the performance of the model, we experiment with this adversarial confounder selection on two datasets, Natural Questions (Kwiatkowski et al., 2019) and TriviaQA (Joshi et al., 2017). We selected these two datasets since, out of all of the tasks we are considering, they have an easy way of checking whether a certain passage is relevant or not for a given querynamely, by checking whether the answer is present in the passage. This enabled us to automatically build sets of confounders, ensuring relevant passages would be excluded. 5 The performance of this approach is reported in Table 8, showing an overall improvement across multiple tasks. While this approach is demonstrated here on our multi-task model, it is in fact orthogonal to it, and could be applied to any other neural retrievers trained with a contrastive loss.