Task-aware Retrieval with Instructions

We study the problem of retrieval with instructions, where users of a retrieval system explicitly describe their intent along with their queries. We aim to develop a general-purpose task-aware retrieval system using multi-task instruction tuning, which can follow human-written instructions to find the best documents for a given query. We introduce the first large-scale collection of approximately 40 retrieval datasets with instructions, BERRI, and present TART, a multi-task retrieval system trained on BERRI with instructions. TART shows strong capabilities to adapt to a new retrieval task via instructions and advances the state of the art on two zero-shot retrieval benchmarks, BEIR and LOTTE, outperforming models up to three times larger. We further introduce a new evaluation setup, X^2-Retrieval to better reflect real-world scenarios, where diverse domains and tasks are pooled and a system needs to find documents aligning users' intents. In this setup, TART significantly outperforms competitive baselines, further demonstrating the effectiveness of guiding retrieval with instructions.


Introduction
Information retrieval (IR) is the task of finding relevant documents from a large collection of texts to fulfill a user's information need, typically expressed in the form of a textual query (Singhal et al., 2001).The notion of relevance from the user's perspective (i.e., intent) can be amorphous (Mizzaro, 1998), and a query alone may not fully capture user information needs (Ruthven and Lalmas, 2003;Taylor, 1962).As illustrated in Figure 1 (top), given the same query, "implementing batch normalization," users' intents can be diverse (e.g., find code snippets or paragraph-length answers).
Most existing work tries to learn those implicit intents from labeled data (e.g., pairs of queries and relevant documents), yielding separate models for 1 Code and models are available at https://github.com/facebookresearch/tart.different intents, as shown in the bottom left of Figure 1.These approaches usually require a vast number of annotated examples to train a model to capture the task-specific notion of relevance, while they could benefit from the abundance of data available from related tasks.Additionally, having separate models leads to complicated pipelines.This paper advocates for a new problem formulation, retrieval with instructions (Figure 1 bottom right), to explicitly model a user's intent by providing a natural language description of the search task (a.k.a.instruction).The goal of retrieval systems is to retrieve documents that are both relevant to the query and well-suited to the instructions (taskaware).Explicitly defining the user intent with natural language instructions provides additional flexibility that enables unifying diverse retrieval tasks during training.
Despite active research in language models (LMs), instruction-following has not been systematically explored in retrieval, partly due to the lack of annotated resources.To facilitate research in retrieval with instructions, we introduce BERRI (Bank of Explicit RetRieval Instructions), a collection of approximately 40 retrieval datasets with diverse instructions in a unified format, covering 10 diverse domains.Each task has on average 3.5 diverse instructions annotated by experts, following our novel instruction schema for retrieval tasks.
We showcase the benefit of BERRI to train TART (Task-aware ReTriever), a multi-task retrieval system that learns to follow instructions to perform diverse tasks.We employ two widely explored architectures: TART-dual, a dense dual-encoder architecture that retrieves documents based on the similarity of independently encoded query succeeded by instructions and document embeddings; TART-full, a cross-encoder architecture that calculates probabilities of a document being relevant to the query according to the instruction.We train TART leveraging hard negative samples and new instruction-unfollowing negative samples.
The TART models, particularly TART-full yields state-of-the-art results on two popular zero-shot retrieval benchmarks, BEIR (Thakur et al., 2021) and LOTTE-pooled (Santhanam et al., 2022), outperforming systems using three times more parameters (Nogueira et al. 2020;Ni et al. 2021;Muennighoff 2022) as well as task-specific retrievers trained on millions of automatically generated examples (Dai et al., 2022;Wang et al., 2022a).
We further introduce a new evaluation setup, X 2 -Retrieval (Cross-task Cross-domain Retrieval), where a system needs to handle queries with diverse intents and find relevant documents from a largescale, cross-domain pooled corpus, simulating challenges in real-world retrieval applications.TART outperforms other state-of-the-art methods, demonstrating its effectiveness in this under-explored setting, leveraging explicit textual intents.In summary, our contributions are as follows: • Retrieval with instructions, a new formulation to model users' intent explicitly (Section 3.1).• BERRI, a new collection of about 40 retrieval datasets with instructions (Section 3.3).• TART, a task-aware retriever trained on BERRI that advances state of the art on zeroshot and cross-task retrieval (Section 4).
over term-based retrievers (e.g., BM25; Robertson and Zaragoza 2009) across domains when training data is abundant (Luo et al., 2022;Asai et al., 2021;Petroni et al., 2021).Due to the high annotation cost, improving neural retrievers in zero-shot settings is an active area of study.Pre-training neural retrievers (Izacard et al., 2022) and training a single retriever on large-scale supervised datasets such as MS MARCO (Bajaj et al., 2016) show improvements in transferring to related retrieval tasks (Khattab and Zaharia, 2020;Nogueira et al., 2020;Chen et al., 2022), while they often struggle with tasks unlike those used for training (Dai et al., 2022).To address this, several work (Wang et al., 2022a;Dai et al., 2022) trains customized retrievers for each task using unlabeled corpora, leveraging another model to automatically generate training data (Wang et al., 2022a).It often requires running massive LMs and training separate retrievers, resulting in slow and costly adaptation.Concurrent to our work, Su et al. ( 2022) trains a single dual-encoder model trained on embedding tasks including retrieval tasks with instructions.

NQ
Retrieve a Wikipedia paragraph that answers this question.

QReCC
Find a dialogue response from dialogue history to answer the user's question.

Arguana
Retrieve a paragraph from an argument website that argues against the following argument.

SciFact
Find a sentence from a scientific paper to check if the statement is correct or not.MultiLexSum I want to find the one-sentence summary of this legal case.

Problem Formulation
This work introduces a new problem formulation, retrieval with instructions (Figure 1 bottom right).We are given a large collection of N documents D = {d 1 , . . ., d N }, a search task instruction t and a query q.The problem of retrieval with instructions is to find a document d ∈ D that is relevant to q according to the instruction t.Compared to the standard retrieval setting (e.g., Figure 1 bottom left), the difference is the explicit definition of relevance in the instruction t as additional input to the system and a retrieval system needs to be task-aware-changing their relevance measure by attending to the instruction.This new formulation brings both new research challenges and opportunities.For instance, a retriever is now required to modify its search behavior according to the instructions.On the plus side, different datasets can be naturally grouped to train a single retriever, yielding benefits from cross-task interdependence.Instructions provide extra flexibility and enable zero-shot transfer via natural language instructions, unlike training with fixed task tags (Maillard et al., 2021).A single task-aware retriever obviates the need to host multiple task-specific retrievers.
Multi-task training with instructions has not been studied in the area of retrieval due to the lack of resources and dedicated models.To facilitate the research on retrieval with instructions, we introduce BERRI, a large-scale retrieval benchmark with expert-written annotations (Section 3.3) in a unified format (Section 3.2), and subsequently the multitask instruction-following retrievers (Section 4).

Unified Task and Instructions Schema
Task format.Each task T in BERRI consists of a corpus D, queries Q = {q 1 , . . ., q K }, and an instruction t, where K is the number of the queries included in the task.An instance of each task in-cludes a query q, gold (relevant) documents d + , and negative (irrelevant) documents d − .For each task, an explicit intent t is given.
Instruction schema for retrieval.We introduce a novel schema to define an informative instruction for retrieval tasks, which have not been studied in prior instruction-following literature.An instruction that sufficiently describes an arbitrary retrieval task should include: intent, domain and unit.Specifically, intent describes how the retrieved text relates to the query, such as whether the text answers a question in the query or paraphrases it.Domain is the expected source or type of retrieved text, such as Wikipedia or PubMed articles.Unit defines the text block to retrieve, such as a sentence or a paragraph.Table 1 shows examples of instructions, and Appendix A.5 shows the full list.

Dataset: BERRI
Dataset selection and unification.We manually collect datasets from (1) KILT (Petroni et al., 2021), (2) the Sentence-Transformers Training Data for Text Embedding Models2 , and (3) manual searches in ACL anthologies and huggingface datasets3 to cover diverse tasks and domains.Except for a few domains (e.g., Wikipedia) many domains do not have retrieval datasets while there are a few datasets for other NLP tasks that can be cast as retrieval (e.g., sentence paraphrasing).Re-purposing those non-retrieval tasks as retrieval tasks enables the diversity of the domains as well as the instructions in BERRI.From initial collections of more than 60 datasets, we conduct manual dataset inspection and select 37 datasets (Figure 2) covering diverse domains (e.g., Wikipedia, scientific papers) and tasks (e.g., fact verification, dialogue response retrieval, QA).See Appendix A.1 for more details.Negative documents selection.Negative samples are crucial for training retrieval systems (Zhan et al., 2021;Qu et al., 2021).In addition to randomly sampled negative samples (random negative documents), we introduce two types of challenging negative samples: denoised hard negative documents d HD and instruction-unfollowing negative documents d UF .Figure 3 shows examples of gold documents and those negative samples.
For hard negatives d HD we run Contriever (Izacard et al., 2022) and then filter out false negative documents by running an off-the-shelf reranker4 and keeping passages with low scores (smaller than 0.1).We further introduce a new negative sampling strategy, instruction-unfollowing negative samples d UF , to make systems learn to retrieve documents that are well-suited to the instructions.As shown in Figure 3, given an instruction "find an informative dialogue response", a system should not retrieve a Wikipedia paragraph about armadillos, even though that is highly relevant to the query.To obtain such negative documents, we retrieve documents from a different task's target corpus using Contriever and consider all those documents to be negatives since they do not satisfy the instruction.Details are in Appendix Section C.3.

TART: Multi-task Instructed Retriever
We now present TART (TAsk-aware ReTriever) trained on BERRI via multi-task instruction-tuning, leveraging our unified task-aware schema.

Model Architecture
TART-dual.TART-dual adopts a dual-encoder architecture to independently encode queries with instructions and documents.It uses maximum inner product search (MIPS) over the embeddings (Karpukhin et al., 2020).The similarity between a query q and a document d, given an instruction t, is calculated as follows: where E(•) is the embedding function5 and [t; q] is the concatenation of the instruction and query.For this model, document embeddings can be computed offline, improving inference efficiency at the cost of storage space (Yamada et al., 2021).TART-full.The dual-encoder architecture is known to be less expressive due to its limited querydocument interactions (Khattab and Zaharia, 2020).
To address this issue, we also explore a crossencoder architecture (Nogueira and Cho, 2019), which computes the relevance between a query and each document by jointly encoding them with cross-attention.A cross-encoder model is often prohibitively expensive to scale up to millions of documents, so we first run a lightweight off-theshelf dual-encoder retriever to retrieve the top documents.For each of these documents, TART-full computes the similarity score as: where FFN represents an additional feed-forward network that predicts whether the document follows the instruction and is related to the query.We initialize TART-full with encoders of T5based instruction-following pretrained models, namely T0-3B (Sanh et al., 2022) and FLAN-T5-3B (Chung et al., 2022) for their empirical competitiveness, as found in prior work (Sachan et al., 2022).We follow the EncT5 approach (Liu et al., 2021) and prepended each sequence with a start-ofsequence token.The token representation is then fed to a newly initialized feed-forward network.Unlike MonoT5 (Nogueira et al., 2020), we use their encoders only for parameter efficiency, reducing the number of the parameters to half.

Training TART
We train TART-dual and TART-full using the positive documents and three types of negative documents in BERRI with instructions (Figure 3).Training TART-dual.We train TART-dual using annotated positive and negative documents in BERRI as well as in-batch negatives as follows: d∈B e s (t,q,d) , where B denotes all documents in the same minibatch (Karpukhin et al., 2020).Training TART-full.
Following prior work (Nogueira and Cho, 2019), TART-full is trained with the cross entropy loss as: Knowledge distillation from TART-full to TARTdual.The default hard negatives in BERRI rely on off-the-shelf models fine-tuned on MS MARCO; for some domains, the hard negative samples mined by those models can be less reliable.For a smaller dual-encoder model, those false positive and negative samples can diminish performance (Qu et al., 2021).We apply hard knowledge distillation with TART-full (Qu et al., 2021).We first train TARTfull on the annotated gold documents and the negative documents in BERRI, and then update hard negative documents and positive documents as in Section 3.3 with TART-full, with instructions.

Experiments
We evaluate TART on zero-shot retrieval (Section 5.1) and our new more challenging evaluation setup, X 2 -Retrieval (Section 5.2).

Zero-shot Retrieval Evaluations
We run experiments on two popular zero-shot retrieval benchmarks: BEIR (Thakur et al., 2021) and LOTTE (Santhanam et al., 2022).None of the evaluation datasets overlap with BERRI.
BEIR is a collection of diverse retrieval tasks in multiple domains where the retrieval target is restricted to the target corpus in a single domain.We used publicly available datasets.6LOTTE-Search samples GooAQ (Khashabi et al., 2021) questions whose answers come from certain forums in Stack-Exchange.We evaluate our model in the pooled setup, where documents come from forums in diverse domains (e.g., cooking, technical).GooAQ is not included in our training set.In LOTTE, our instructions specify which forum our system should retrieve evidence from (e.g., "Retrieve a cooking StackExchange forum post").Metrics.Following Thakur et al. ( 2021), for BEIR, we use NDCG@10 as our primary metric on BEIR.For LOTTE-pooled, we use Success@5 (= Recall@5) as our primary metric, as in the original paper (Santhanam et al., 2022).

X 2 -Retrieval Evaluation
Users' intents can be diverse, requiring searching in an open-domain environment (Piktus et al., 2021), which is currently under-explored.
We introduce a more realistic evaluation setup, X 2 -Retrieval (Cross-task Cross-domain Retrieval), where several retrieval tasks with different intents are pooled to form a single retrieval target containing diverse documents.This requires a system not only to adapt to a new task in a zero-shot manner but also to model users' intents expressed in natural languages to meet their information needs.Tasks and queries.Our X 2 -Retrieval evaluation covers six datasets across three domains, namely, Wikipedia, Science, and Technical (Table 2) domains.The key challenge here includes datasets with different search intents that may not always be obvious from the queries alone.
A pooled corpus.For the primary pooled setup, we combine all documents from different tasks and the BEIR NQ Wikipedia corpus to form a single retrieval corpus, consisting of approximately 3.7 million documents.We also report the simplified closed setup performance as an oracle setup, where a system retrieves only from the original corpus.
Metrics.We report NDCG@10 on both pooled and closed setups for each task.In addition, we evaluate the performance gap between the closed and pooled setups and refer to it as robustness.A smaller gap means that the model is distracted less by the documents from undesirable corpora.

Baselines
We compare TART with various state-of-the-art methods., 2022).The final group of models is specialized retrievers trained for each task on automatically generated task data.Promptagator (Dai et al., 2022) generates large amount of in-domain data using FLAN (Wei et al., 2022a), and GPL (Wang et al., 2022a) generates them using DocT5Query (Nogueira et al., 2019).We also compare TART with their counterparts trained on BERRI and evaluated without instructions, TARTdual w/o I and TART-full w/o I.
We sample positive and negative passages with a 1:4 ratio.We initialize TART-dual from Contriever-MS MARCO (Izacard et al., 2022), which is based on BERT-base.7 Per-GPU batch size is 16, and for each positive document, we sample in total 5 negative passages, where 90% of them are randomly sampled from D, and 10% are sampled from d HD and d UF .We use top 100 Contriever-MS MARCO results as the TART-full initial candidates. 8Table 9 shows instructions for evaluations.More details are in Appendix C.1.
6 Results and Analysis

Results
Zero-shot evaluation.As shown in Table 3, TART-full and TART-dual largely outperform their counterparts trained and tested without instructions, demonstrating the effectiveness of instructiontuning for better zero-shot retrieval.TART-full significantly outperforms larger models and customized models trained on millions of synthetically generated in-domain data, advancing the state of the art on BEIR and LOTTE.Unlike prior methods that require additional data generation, TART only requires a single human-written instruction to adapt to a new task.Compared to other methods using cross-encoder-based reranking models (e.g., BM25+MonoT5), TART-full uses a much smaller number of paragraphs to be re-ranked, which significantly reduces latency caused by reranking at test time.The large performance gain from Contriever (MS) to TART-dual on six out of the nine BEIR tasks (e.g., SciFact, Arguana) shows the effectiveness of instructions and knowledge distillations.However, for the other three datasets (e.g., Touche-2020), TART-dual shows large performance deterioration.We hypothesize that model capacity (i.e., BERT-base) and limited interactions between the query and document embeddings could be major bottlenecks.Prior work on instruction training in large LMs has shown that smaller models often do not get as much benefit as larger ones from instructions and increasing dataset size, possibly due to their limited model capacities (Chung et al., 2022).Su et al. ( 2022) also observe more significant gain from instruction tuning when they use larger encoder models (i.e., GTR-base v.s.GTR-XL), reporting performance deterioration in retrieval tasks when they instruction tune 335 million parameter base model.Future work can investigate efficient architectures that enable more rich interaction between queries with instructions and documents.
X 2 -Retrieval evaluation.Table 4 shows the models' X 2 -Retrieval performance.Contriever and Contriever+CE show competitive closed performance in the closed setup, as in BEIR, but they struggle in the pooled setup due to their inability to handle human instructions.Especially Con-triever+CE shows a large performance drop on AmbigQA-pooled by retrieving documents instead of queries due to the biases from fine-tuning on a QA dataset (i.e., MS MARCO) only.
TART-full shows the best-closed performance and pooled performance, indicating its strong zeroshot adaptation and cross-task abilities.We found that a model can flexibly change its behavior based on the instructions, as shown in Table 11.TARTdual shows strong performance on the pooled setup, indicating that smaller models can be also guided by explicit instructions.

Analysis
Ablating instructions.We compare TART-full with three variants: (a)  4 shows the performance of those baselines.On all benchmarks, ablating instructions during training or test time causes a notable performance drop.We also see that a model trained with instructions but given no instruction at test time still yields a few performance improvements over the model trained completely without instructions, indicating the effectiveness of multi-task instruction tuning.
Robustness toward instructions.Figure 5 shows the performance variance given multiple different instructions.Instructions significantly improve model performance without instructions (the blue circles).Although different instructions give small performance variance, TART often outperforms other baselines when informative instructions are given.See Table 15 for individual instructions.
Dataset scale.Following prior work on instruction tuning for LMs (Wang et al., 2022b;Wei et al., 2022a), we conduct dataset ablation, where we reduce the number of training datasets.Figure 6a shows the average BEIR performance of TART-full  trained on randomly sampled 5, 10, and 20 datasets.
Increasing the number of the training datasets helps TART to perform better.In addition to domain and task diversity, the diversity of instructions observed during training may also improve performance, as in Appendix Section E.3.
Effects of negative sampling.We analyze the effectiveness of negative samples by ablating them during training.Figure 7 shows the performance of the models trained without negative samples on BEIR and X 2 -Retrieval.Wei et al., 2022b).We investigate how model scale affects the ability to generalize to new tasks and follow instructions.For a fair comparison, we train TART-full using different T5 LM-Adapt (base, large, and XL) and evaluate performance using them to rerank the top 100 Contriever results.
Figure 6b shows TART-full's average performance across different model scales.We observe clear performance improvements by increasing model size as observed in prior work on large LM.

Conclusion
This paper lays the foundation for building a general-purpose task-aware retriever that can follow natural language instructions.We introduced a new setup, retrieval with instructions, to model users' intents explicitly.We presented BERRI, the first large-scale retrieval dataset with expert-written annotations.Building upon BERRI, we trained the first instruction-following retrieval system by massive multi-task instruction-tuning, TART advances the state of the art on two zero-shot retrieval benchmarks BEIR and LOTTE as well as on our newly introduced challenging evaluation setup.

Limitations
Although our TART-full model shows the effectiveness of instruction-tuning for retrieval, on some datasets TART-dual shows large performance degradation from its non-instruction-following counterpart.We hypothesize that a smaller model size (i.e., 110 million parameters) and limited interactions between query and document embeddings are the main factors.We conduct primarily experiments training larger dual-encoder models such as SGPT (Muennighoff, 2022) on BERRI but still observe some notable performance drop on some datasets, which indicate only scaling up encoders may not significantly improve instructionfollowing retrieval systems.Future work can study the better approach to train larger-scale dualencoder models as well as explore modeling architectures that enable rich interactions but are still more efficient than the cross-encoder, such as ColBERT-v2 (Santhanam et al., 2022).
Retrieval tasks are excluded in prior work on instruction-following of LLMs.This work is the first to explore instruction tuning in the area of retrieval, and we annotate more than 100 instructions for approximately 40 tasks, and we demonstrate the effectiveness of the dataset scale in retrieval.Yet, recent work (Wang et al., 2022b;Chung et al., 2022) show that scaling up the number of the training datasets improves LLMs' ability to adapt to new task via instructions, and the current dataset scale might not be optimal.We open-source our instruction data and call for community efforts to collect more retrieval tasks and human-written instructions as in instruction-following for LMs (Wang et al., 2022b;Bach et al., 2022), to investigate whether further increasing the number of the datasets lead to improvements.

Ethical Considerations
Although instruction-tuning using many datasets enable better zero-shot transfer, TART does not always retrieve documents that perfectly align with users' expectations.Applying TART to safetycritical domains requires extra attention.BERRI includes approximately 40 tasks covering diverse domains.Although the data has been automatically filtered, and we have examined the data, there may still be harmful or privacy-sensitive contents.We will release all of the data and preprocessing scripts for follow-up work to inspect those dataset issues and the effects of those data.Unification and instruction annotations.For retrieval datasets such as MS MARCO, we use the annotated gold documents as positive documents d + to a given query q.Regarding non-retrieval tasks, we use the original input sequence as a query q and the original output or given context as d + .
For instance, given a summarization dataset we use a source text and a summary as a query and a gold document, respectively.More details about the dataset unification are available in Section A.2.For datasets without preprocessed retrieval targets, 11 we gather all positive and negative documents provided by the original dataset to build a single task-specific retrieval corpus D.

A.2 Details of Dataset Unification
As shown in Table 5, some datasets were not originally retrieval datasets (e.g., summarization datasets).We describe how we convert these into the unified retrieval task format.
QA.For QA datasets, where each instance consists of a query, a gold context, and answers, we assume the original gold context is the gold document used as a positive sample during training. 9For examples, finding a corresponding review text for the review title "I love this!" is under-specified.
10 Prior work has shown that MS MARCO can be beneficial to many downstream retrieval tasks (Izacard et al., 2022).
11 For example, KILT datasets such as FEVER or NQ use the unified Wikipedia corpus.For some exceptional datasets, we performed additional preprocessing.We found that ReCoRD instances are occasionally self-containing due to the nature of the cloze-style QA; therefore, for ReCoRD, we replace the original placeholder with the gold answer and use this original question with the answer as the query and the original context as a gold document.For MedMCQA, we use the source exam question as the query and the answer evidence as the positive document.
Summarization.For summarization datasets, we use target summarizations as the gold document and source text as the query.
Text simplifications.For text simplification datasets, we use source (often more complex) sentences as the query and simplified sentences as the gold document.
Code search.We use the source comment as the query and the corresponding implication as the gold document.We exclude the python subset from BERRI as we use it for X 2 -Retrieval.

A.3 BERRI Statistics
We conduct analyses on BERRI to understand its domain and intent diversities.
Intents.Open-ended intents are diverse and hard to classify into fixed sets of categories.As a proxy for intents, Figure 8 shows the distributions of the source task categories.QA is the most representative category, while summarization and question duplication detection is also common due to their abundance in large-scale datasets.On the other hand, around 50 % of the tasks do not belong to those top three categories, such as code search or caption generations, which contribute to the diversity of BERRI.We also find that traditional nonretrieval tasks, such as sentence simplification or dialogue, can be repurposed as retrieval tasks.
Domains.Our dataset covers diverse domains.
Figure 9 shows that Wikipedia (e.g., NQ), web (e.g., MS MARCO), Community QA (e.g., Quora), News(e.g., CNN/Daily) dominate, while we also have some expert domains (e.g., medical, legal, technical).We found that although many expert domain datasets are smaller than the ones in general domains like Wikipedia, adding those high-quality expert domain datasets helps the system learn to adapt to those domains or unseen expert domains with a similar writing style (e.g., scientific papers).

A.4 Dataset List
Table 5 shows all datasets we used in BERRI.Table 6 provides references for these datasets.

A.5 Instructions for BERRI
Table 7 shows the full list of the instructions in BERRI.Note that we present only one instruction for each dataset.A full list of the instructions will be released in our repository.
B Further Detail about the X 2 -Retrieval Query and corpus creations.For AmbigQA, we use the official development split, including 1,172 queries, as the official test split annotations are not publicly available.We use all paraphrased questions for all train and development sets to form the retrieval corpus.For WIKIQA, we combine the development split and test split available at the huggingface datasets,13 and we use the question and answer sentence pairs that are labeled as 1 as the queries for evaluations, and use the answer sentences as the gold documents.Regarding the retrieval target, we use all sentences available in the WIKIQA dataset, including the sentences that are labeled as 0. For LinkSO, we use the original datasets' test split for the python domain and sample 1,000 queries. 14We find questions that are labeled as duplicated and use their corpus as our retrieval target.For GooAQ-technical, we sample 1,000 GooAQ questions whose answers are from stackoverflow.com.As 20% of the sampled GooAQ tech queries share the same answer posts, we remove the duplicated paragraphs.For CodeSearchNet-Python, we use the comments describing the codes as queries and the corresponding python codes as positive documents.We sample 1,000 queries from the test split.
Examples.Examples of X 2 -Retrieval are shown in Table 8.As shown, queries themselves often do not fully indicate the users' intents.By specifying users' intents as explicit textual instructions, our model can effectively perform multi-task retrieval over a single pooled corpus.
Human evaluations of quality.To access the possibility of having false negative passages, we run an off-the-shelf retrieval system to retrieve the top 10 documents for randomly sampled 20 questions for each task, and we evaluate if any of the negative passages, especially from the non-target corpus, are indeed positive.We found that the false negative ratio is less than 10%.

C Modeling Details
C.1 Hyperparameters of TART TART-dual.We set the learning rate to be 1 × 10 −5 and warm-up steps to be 1,000.The softmax temperature is set to 0.05.The batch size is 1024.We use 7 negative samples per instance; 10% of the time we use hard negative or instructionunfollowing negatives, while 90% of the time we use negative documents that are randomly sampled from the same target corpus.The maximum document chunk length is set to 256.
TART-full.To train a cross-encoder using the T0-3B encoder, we set the maximum sequence length to 512 and the batch size to 1, increasing the gradient accumulation steps to 8. We set the dropout rate to 0.1 and the learning rate to 1 ×10 −5 .et al., 2014).Retrieve the most voted answer for this question from Yahoo Answers.

MSMARCO
I want to know the answer to the question.Can you find good evidence on the web?.

ELI5
You have to answer a why / how question from users.Retrieve a Wikipedia paragraph that provides a piece of good evidence for the answer.

WikiHow
Find a detailed paragraph from WikiHow that explains how-to to achieve 8. SearchQA Pick up the top web search results snippets for the following question.

AGNews
Find a news summary sentence corresponding to the following header.

NPR
Given a news article headline published at npr.org, find a corresponding summary of the news 11.CodeSearchNet (Java) Match the following natural language instruction to Java codes 12. CodeSearchNet (ruby) Retrieve ruby codes from GitHub commit history that implements this feature 13.CodeSearchNet (JavaScript) Find a javascript code implementation on GitHub for the following natural language instructions 14.CodeSearchNet (Go) Can Find a Wikipedia paragraph related to the following conversation topic.

WoW-Response
Find a meaningful dialogue response to answer the user's question 28.Medical Simplification Please retrieve a medical paper summary that is written in a simple language so that my patient can understand 29.SciTLDR Find a sentence-length summary of this paper.

PubMedQA
Help me to find a highly related PubMed paper to answer this question.

MedMCQA
Find the explanation for the correct answer of this medical question.

Gigaord
Retrieve an extremely short summary of the following Gigaword article.

Record
Find a News article to verify the following sentence 34.MultiLexSum Map this legal case summary to a sentence-long summary 35.Qrecc You need to find a good response from a collection of previous responses and help users to know this topic more 36.OQA Find a question that is paraphrased of this 37. SQuAD Find a Wikipedia paragraph that answer the question

D More Experimental Details
in addition to the in-batch negative documents.We use 8 GPUs to train TART-full and 64 GPUs to train TART-dual.We train TART-full up to 10k steps and TART-dual up to 30k steps and take the checkpoint with the best development performance.We use 64 GPUs to train TART-dual and 8 GPUs to train TART-full.

E Further Results and Analyses
E.1 Qualitative Results on X 2 -Retrieval Table 11 shows the qualitative examples given different instructions on X 2 -Retrieval, and Table 12 compares TART-full with Contriever MS MARCO.

E.2 Analysis of Instruction Effectiveness
Full results of instruction ablations.Table 13 shows the full BEIR results of ablating instructions and Table 14 shows the ones on LOTTE and X 2 -Retrieval.On all of the benchmarks, removing instructions at training or test time largely hurts the performance, indicating the effectiveness of instructions.Examples of prompts with performance.Table 15 shows the instructions and TART-full performance on three BEIR datasets.We also provide a comparison of the model performance when uninformative instructions are given in Table 16.We see that more informative and related instructions often result in a strong performance, while irrelevant instructions degrade it.

E.3 Analysis on Model and Dataset Scale
Task diversity.As shown in Figure 10, task diversity is a key to improving models' zero-shot transfer performance.QA only struggles on Arguana, where the tasks significantly differ from QA.

Domain diversity.
Figure 10 shows that having more diversity in training datasets' domains is also crucial, especially when the target datasets are in non-general domains.For instance, a model trained only on Wikipedia datasets struggles on Touche-2020 or SciFact, where documents come from argument websites and scientific papers, respectively.
Per-dataset performance breakdown.Table 17 shows the NDCG@10 across different model scales.We compare the TART-full initialized with Query: 10% of sudden infant death syndrome (SIDS) deaths happen in newborns aged less than 6 months.
Instructions: Retrieve a scientific paper abstract to verify this Contriever TART-full ✗ By definition, SIDS deaths occur under the age of one year, with the peak incidence occurring when the infant is at 2 to 4 months of age.This is considered a critical period because the infant's ability to rouse from sleep is not yet mature (Wikipedia paragraph) ✓ Despite declines in prevalence during the past two decades, sudden infant death syndrome (SIDS) continues to be the leading cause of death for infants aged between 1 month and 1 year in developed countries.Behavioral      D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?
The annotators of the instructions are the authors of the papers, so we cannot disclose our basic demographics due to the risk of anonymity violations.

Figure 1 :
Figure1: User intents are not fully captured in query q only (top).Conventional approaches (bottom left) take a query and retrieve documents from a closed corpus using a task-specific retriever.Retrieval with instructions (bottom right) additionally takes explicit intent.

Figure 2 :
Figure 2: Examples of datasets included in BERRI.Table5shows the full dataset list.

Figure 3 :
Figure 3: Examples of documents that are considered gold documents d + , and two types of negative documents d − : hard negatives d HD and instruction-unfollowing negatives d UF for two different query and instruction pairs.

Figure 6 :Figure 7 :
Figure 6: Analysis of dataset and model scale.

Figure 9 :
Figure 8: The task distributions of the datasets included in BERRI.

Figure 10 :
Figure 10: Dataset ablation results.Wikipedia-only denotes TART-full performance trained on Wikipediabased datasets only.QA-only denotes the model trained on QA datasets only.

Dialogue Response Retrieval t 2 : Find an informative dialogue response to this user's conversation Dup. Question Retrieval t 1 : Retrieve a question asked in StackOverflow similar to this StackOverflow Question StackOverflow Question StackOverflow Answer Dialogue Response Wikipedia Paragraph Tasks Dialogue Response Gold documents Hard negative documents Instruction-unfollowing negatives Negative documents
Table 5 shows the full dataset list.

Table 2 :
The X 2 -Retrieval evaluation.Example pairs of queries and documents are shown in Table8.In addition to the corpora listed above, we add the Natural Questions corpus data from BEIR(Thakur et al., 2021).

Table 3 :
Zero-shot retrieval results on BEIR and LOTTE-Search.† indicates the models using cross-encoder-based reranking models.The first group of models use no labeled data during training.The second group uses MS MARCO at training time but has no customized task-specific data.The third group trains individual retrieval systems using automatically generated data.TREC, NFC, FQA, ARG, TOU, DBP, SCD, CLI, SCF indicates TREC-COVID, FIQA, NF Corpus, Arguana, Touche-2020, DBPedia, SciDocs, Climate-Fever, and SciFact, respectively."×9" of GPL, Promptagator means that those models train customized models for each dataset.
train without instructions, test with instructions prepends instructions at test time only to test if the models just exploit keyword matching only at test time; (b) train with instructions, test without instructions uses TART-full without instructions at test time; (c) train without instructions, test without instructions does not use instructions at all during training and test time.Figure UF) results in lower X 2 -Retrieval performance, although this model performs on par with the original TART-full on BEIR.This indicates that our new instructionunfollowing negative documents largely contribute to improving the ability to distinguish instructions and are thus crucial to build a robust task-aware retrieval system.
Adding more challenging negative documents (i.e., d HD and d UF ) during training largely improves the model performance on BEIR.Moreover, the model trained without instruction-following samples (w/o d Thomas Diggelmann, Jordan L. Boyd-Graber, Jannis Bulian, Massimiliano Ciaramita, and Markus Leippold.2020.CLIMATE-FEVER: A dataset for verification of real-world climate claims.In Proceedings of Tackling Climate Change with Machine Learning Workshop at NeurIPS.
Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave.2022.Unsupervised dense information retrieval with contrastive learning.Transactions on Machine Learning Research.Omar Khattab and Matei Zaharia.2020.ColBERT: Efficient and effective passage search via contextualized late interaction over bert.In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval.
Table 9 lists the instructions used for the BEIR and X 2 -Retrieval evaluation.

Table 6 :
References for datasets used in BERRI and evaluations.We use the preprocessed versions available on the SentenceTransformers (Reimers and Gurevych, 2019) embedding data page 12 for the datasets with * .We use the preprocessed versions fromKILT (Petroni et al., 2021)for the datasets with † .
you find a Go implementation of this? 15.PAQ Can you answer my question by finding an article on the web? 16.Sentence Compression You have to match this long sentence to a shorter compressed one 17.CNN Daily MailThe following sentences are the summaries of a news article.Find the source news article.18.XSUMRetrieve a news article that is summarized as following.19.Coco captionsCan you find an image caption talking about the same image as.20.Quora Dup.Questions Check if a Quora question is duplicated with this question.

Table 7 :
Full list of the instructions for the BERRI datasets.We present one instruction per dataset.All of the instructions are available at our GitHub repository.

Table 10 :
The list of the combinations of the dataset and corresponding instruction-unfollowing corpora to mine instruction-unfollowing negative documents.
risk factors identified in epidemiological studies include prone and side positions for infant sleep, smoke exposure, soft bedding, and sleep surfaces, and overheating.(paper)Jupiter is the only planet whose barycenter with the Sun lies outside the volume of the Sun, though by only 7% of the Sun's radius.[80]Theaveragedistance between Jupiter and the Sun is 778 million km (about 5.2 times the average distance between Earth and the Sun, or 5.2 AU) (Wikipedia paragraph)✓ Jupiter is the fifth planet from the Sun and the largest planet in the Solar System.(Wikipedia answer sentence)

Table 12 :
We compare TART-full outputs with the Contriever-MSMARCO (Izacard et al., 2022)predictions on X 2 -Retrieval.We show the top one prediction for the first four examples, and show the top three predictions for the bottom examples.✓ mean that the documents follow instructions while ✗ mean that the documents do not satisfy the instructions.

Table 15 :
Performance on SciFact, Climate-FEVER and Touche-2020 with different instructions.

Table 16 :
Full list of the instructions used for evaluations.[NULL]means that at inference time, no instruction is given to TART-full.✓ means a correct instruction, while ✗ means incorrect instructions.

Table 19 :
Zero-shot retrieval results for TART-full initialized with different pretrained models' encoders on BEIR.TREC, NFC, FQA, ARG, TOU, SCD, CLI, SCF indicate TREC-COVID, FIQA, NF Corpus, Arguana, Touche-2020, DBPedia, SciDocs, Climate-Fever, and SciFact, respectively.C2.Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Sections 4, 5, C, and D.C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Not applicable.Left blank.C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Not applicable.Left blank.D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? Section 3.3.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.