Meta-training with Demonstration Retrieval for Efficient Few-shot Learning

Large language models show impressive results on few-shot NLP tasks. However, these models are memory and computation-intensive. Meta-training allows one to leverage smaller models for few-shot generalization in a domain-general and task-agnostic manner; however, these methods alone results in models that may not have sufficient parameterization or knowledge to adapt quickly to a large variety of tasks. To overcome this issue, we propose meta-training with demonstration retrieval, where we use a dense passage retriever to retrieve semantically similar labeled demonstrations to each example for more varied supervision. By separating external knowledge from model parameters, we can use meta-training to train parameter-efficient models that generalize well on a larger variety of tasks. We construct a meta-training set from UnifiedQA and CrossFit, and propose a demonstration bank based on UnifiedQA tasks. To our knowledge, our work is the first to combine retrieval with meta-training, to use DPR models to retrieve demonstrations, and to leverage demonstrations from many tasks simultaneously, rather than randomly sampling demonstrations from the training set of the target task. Our approach outperforms a variety of targeted parameter-efficient and retrieval-augmented few-shot methods on QA, NLI, and text classification tasks (including SQuAD, QNLI, and TREC). Our approach can be meta-trained and fine-tuned quickly on a single GPU.


Introduction
Large language models (LLMs) have become increasingly popular due to their impressive fewshot performance on many NLP tasks and domains (Brown et al., 2020;Chowdhery et al., 2022).This has resulted in many few-shot learning methods based on LLMs that require ever-larger GPUs and Figure 1: Our approach.Given an input x from one of many possible QA tasks, we use a dense passage retriever to retrieve K semantically similar demonstrations Z = {z k } 1,...,K from a memory bank z composed of labeled examples.We meta-train BART, supervising it to generate the (question and) answer y given x and Z across a diverse collection of QA tasks.increasing computation.Methods requiring no parameter updates such as in-context learning (Brown et al., 2020) and parameter-efficient methods like Adapters (Houlsby et al., 2019) partially mitigate these downsides, but ultimately, larger computation budgets are increasingly necessary to achieve stateof-the-art few-shot performance-even to simply load models and perform inference.
Meta-learning (Vilalta and Drissi, 2002;Finn et al., 2017) and meta-training (Min et al., 2022a) are methods that make smaller language models capable of quicker and more robust few-shot performance across multiple tasks and domains.However, smaller models may not be able to store enough knowledge for effective generalization in many domains and tasks simultaneously.Retrieval is one way to overcome this: by separating parametric knowledge in the language model from external knowledge (stored as retrievable text), one can leverage much more information than could be stored in the parameters of a language model.For example, retrieval-augmented generation (RAG; Lewis et al., 2020) and retrieval-enhanced transformers (RETRO; Borgeaud et al., 2022) retrieve natural language passages to improve performance on knowledge-intensive NLP tasks, although they do not perform meta-learning or meta-training and only evaluate on high-resource knowledgeintensive tasks.
We thus propose meta-training with demonstration retrieval as a more parameter-efficient way to leverage demonstrations for few-shot learning.We retrieve semantically similar labeled demonstrations for each training and test example during meta-training and fine-tuning.On a relatively small sequence-to-sequence model (BART large , 440M parameters), we show our proposed approach is capable of generalizing quickly and well on a variety of downstream tasks (Table 1).Inspired by retrieval-augmented generation (RAG) models (Lewis et al., 2020), we use a dense passage retriever (DPR; Karpukhin et al., 2020) to retrieve demonstrations instead of Wikipedia passages.We retrieve semantically similar demonstrations from a large and diverse bank ( §3.3) that is compiled from many existing question answering tasks (App.A), rather than randomly sampling demonstrations from the training set of the target task like most contemporary work (Min et al., 2022a;Brown et al., 2020;Gao et al., 2021).
Our experiments show that our method ( §3) outperforms tailored efficient few-shot baselines and other retrieval-augmented models on various tasks, including natural language inference (NLI), paraphrase detection, and extractive question answering ( §5).To our knowledge, our work is the first to combine retrieval with meta-training (or multitask training more broadly), to use DPR models to retrieve demonstrations, and to leverage demonstrations from many tasks simultaneously, rather than retrieving random or k-nearest demonstrations from the training set of the target task.
Our code is available on GitHub.1

Related Work
Meta-learning (Vilalta and Drissi, 2002;Finn et al., 2017) is a class of methods that supervise a model on how to learn; the goal is to leverage a collection of meta-training tasks to learn a better learning algorithm that generalizes to held-out tasks.Inspired by meta-learning, some recent stud-ies have attempted to induce specific abilities in language models in a task-and domain-agnostic manner via meta-training; this entails directly supervising a model on labeled examples from various tasks (sometimes using some controlled format or template (Chen et al., 2022;Wei et al., 2022)) to directly induce specific abilities or better inductive biases that improve generalization.Metatraining is typically accomplished via some form of controlled multi-task learning, as in Min et al. (2022a).Many studies have explored multi-task and multi-domain learning (Khashabi et al., 2020;Zhong et al., 2021;Aghajanyan et al., 2021;Ye et al., 2021;Wei et al., 2022), but these studies often leverage tasks that improve a model's abilities for some specific (set of) downstream tasks.In meta-training, we aim to directly improve the learning algorithm via controlled supervision, which should improve out-of-distribution generalization by teaching a model some helpful ability-such as in-context learning-that can result in gains on various downstream tasks (Min et al., 2022a).We focus on meta-training with examples from QA datasets.
Few-shot learning is a common setting in which a model is supervised on only a few labeled examples.Many methods for improving fewshot performance are based on scaling model and data size (Brown et al., 2020;Chowdhery et al., 2022).Our goal is to improve few-shot performance across tasks in a computation-and memoryefficient manner, so we focus on smaller models that can be trained efficiently on a single GPU.Some parameter-efficient few-shot methods have been proposed, including cloze-style prompting (Schick and Schütze, 2021b), fine-tuning with manually tuned (Schick and Schütze, 2021a) and automatically tuned prompts and demonstrations (Gao et al., 2021), and meta-learning (Yu et al., 2018;Bansal et al., 2020;Bao et al., 2020).One advantage of our approach is that it does not require significant prompt tuning: rather, we standardize all of our tasks into a single format, similar to Chada and Natarajan (2021).This saves human time and computational resources.
Crucially, these approaches compare probabilities of single tokens or small pre-selected label sets; thus, they cannot be used for open-domain tasks like question answering.Some work has proposed generative few-shot methods for opendomain tasks: this includes reformatting the input data to match a model's pre-training format (Chada and Natarajan, 2021), pre-training models to select relevant spans from context passages (Ram et al., 2021), and running a secondary pre-training step on labeled classification data (Mueller et al., 2022).Our model should be effective on many tasks, even when the label space is large and differs across examples; thus, our method is based on a generative sequence-to-sequence model.
In-context learning (ICL; Brown et al., 2020) is increasingly used in few-shot methods; here, labeled demonstrations are concatenated to the same context as a test example to teach a model how to perform a task without additional gradient updates.
Studies have analyzed what kinds of demonstrations are most effective (Liu et al., 2022), as well as what makes demonstrations effective (Min et al., 2022b;Xie et al., 2022).Our demonstration retrieval approach is most similar to Liu et al. (2022), who encode demonstrations and test examples into a sentence embedding space and retrieve the knearest demonstrations.Our method differs in multiple ways: we use dense passage retrievers instead of sentence embeddings; we use demonstrations from many training sets instead of the training set of the target task; and we perform gradient updates with demonstrations, which is more feasible on our relatively small BART large -based model.Wei et al. (2022) find that very large LMs (>68B parameters) are required for ICL to be effective, but Min et al. (2022a) find that meta-training can be used to make a much smaller model (GPT2 large , 774M parameters) capable of leveraging demonstrations.Here, we make BART large (440M parameters) better at leveraging demonstrations through meta-training with demonstrations, like Min et al. (2022a); however, their method is designed for zero-shot generalization, and it selects from a constrained set of pre-defined labels.Our method is designed for few-shot settings and can be applied to open-domain tasks.
Retrieval-augmented generation models consist of two components: generators and retrievers.The generator is typically a decoder-only LM (Guu et al., 2020) or sequence-to-sequence (seq2seq) model (Lewis et al., 2020;Izacard and Grave, 2021); we use seq2seq models.The retriever is most often a dense passage retrieval (DPR; Karpukhin et al., 2020) model based on BERT base .RAG models are typically evaluated on knowledge-intensive tasks like abstractive QA and fact verification.Thus, the memory bank typically consists of Wikipedia passages, which augments the model with additional factual knowledge separate from the generator's parameters.Izacard et al. (2022) adapts this architecture for few-shot knowledge-intensive tasks using a very large generator (T5 X(X)L ) and a Contriever-based (Izacard et al., 2021) retriever.However, we are interested in more general-purpose methods, as well as more parameter-and memory-efficient methods that train or fine-tune quickly on a single GPU.Thus, we propose a task-agnostic and domain-general method to improve smaller generative models for few-shot settings: specifically, a retrieval-augmented metatraining step and a memory bank of labeled QA demonstrations instead of Wikipedia passages.

Retrieval-augmented Generation
As we wish to retrieve similar labeled examples for every input, our architecture takes inspiration from retrieval-augmented generation (RAG) models (Lewis et al., 2020), which consist of a pretrained sequence-to-sequence component (we use BART large ) and a pre-trained dense passage retriever (DPR) component.Given an input x, the DPR component retrieves the K most semantically similar memory entries {z k } 1,...,K from the memory bank z.Retrieval is performed using a BERTbased input encoder E I on x and BERT-based demonstration encoder E D on z to encode both into a vector space, and then running maximum inner product search:2 The DPR component also returns the inner products themselves as document scores p η (z k |x).
The input and retrieved entries are then passed to a pre-trained sequence-to-sequence model, BART large , for autoregressive generation.At each timestep, we marginalize over the retrieved demonstrations by creating K separate input contexts, consisting of the input x and one retrieved entry z k .We then sum over BART's token probabilities p θ given each context, weighted by z k 's document  2021)'s splits of TREC, MRPC, MNLI(-mm), and QNLI.We generate our own few-shot splits of QASC using 5 random seeds for each split size. score: (2)

Meta-training
To adapt a sequence-to-sequence model for generalpurpose demonstration retrieval and answer generation, we perform a meta-training step by supervising the model with demonstrations on a collection4 of 18 QA tasks (Table 7).We update the parameters of the BART component of our model during meta-training by supervising BART (using its normal cross-entropy loss) to generate the question and its answer given the question and a set of retrieved demonstrations.We use QA tasks due to the semantic diversity of inputs and labels; compare to text classification tasks, where the label space is much smaller and labels are often less informative.We modify and use the QA meta-training task collections from (Min et al., 2022a).This consists of various extractive, multiple-choice, and/or abstractive QA tasks from CROSSFIT and a subsample of UNIFIEDQA (Khashabi et al., 2020(Khashabi et al., , 2022)), including NaturalQuestions, MCTest, BIOMRC, inter alia.We modify the meta-training collections by (1) removing our evaluation sets if they are present,5 and (2) standardizing the format of each task.Our final meta-training collection contains 32 tasks, which we subsample to 18 tasks based on semantic similarity to our evaluation tasks; see Appendix A for a full list of tasks and details on our semantic subsampling procedure, and §5.2 for a description of the downstream effect of semantic subsampling.
Following Chada and Natarajan (2021), we standardize each input in the meta-training data to a "question:. . .\n answer: [MASK] \n context:. . ." format.Then, the output sequence consists of both the question and answer sequences, 6 which aligns with BART's pre-training objective of reconstructing the entire input sequence (not just masked spans).Like Chada and Natarajan (2021), we find that aligning the input/output format with BART's pre-training objective makes a positive difference for downstream performance.For QASC, which is a multiple-choice QA task, we put all of the answer options in the context field before the two context sentences and generate the full answer string.This outperformed all other formats we tried by a significant margin. 7 For classification tasks, we use the same question/answer/context format.For our singlesentence classification task (TREC), we place the input in the question field, and present all of the possible labels in the context field using a similar format as for QASC.For sentence-pair classifica-tion tasks (MRPC, MNLI(-mm), QNLI), we place the first sentence or hypothesis in the question field and place the second sentence or premise in the context field.As with QA tasks, we generate both the question and answer fields in the target sequence, but only evaluate F 1 on answer sequences.

Demonstration Memory
For the demonstration memory bank, we use training sets from UNIFIEDQA, excluding our evaluation tasks; the memory contains examples from 16 tasks.UnifiedQA has approximately 40% overlap with the QA meta-training collection, and no overlap with the non-QA collection.See Table 8 in Appendix A for a full list of tasks in our demonstration memory bank.
We format each demonstration in the memory bank in the same question/answer/context format as described above, except that demonstrations have the ground-truth label after the answer: header instead of a [MASK] token.Note that memory entries consist of a text passage (the demonstration) and a title; for the title, we simply use the answer to the question.

Experimental Setup
We evaluate on a variety of QA and classification tasks (Table 1).We select open-domain QA tasks from the MRQA shared task (Fisch et al., 2019) to reflect a variety of extractive QA formats, including a standard QA benchmark (SQuAD), a domainspecific challenging benchmark (BioASQ), and two knowledge-intensive QA benchmarks (Triv-iaQA and TextbookQA).8Our few-shot QA splits of size {16, 32, 64, 128} for these tasks are from Ram et al. (2021), which are themselves derived from MRQA (Fisch et al., 2019).We also generate few-shot splits for QASC, which is a multiplechoice QA task; we evaluate on QASC to determine whether our model is also effective in dealing with much shorter contexts, and to ensure that it is not overfitting to more typical MRQA-style extractive tasks.
Our few-shot classification task splits are from Gao et al. (2021).We evaluate on sentence pair classification tasks which are not contained in our meta-training or demonstration tasks; sentence pair classification tasks like natural language inference (NLI) and paraphrase classification can be easily reformatted to our question/answer/context format.We also evaluate on TREC, which is a singlesentence text classification task where the model must guess the category of the answer to a question (e.g., human, location, number), rather than the answer itself.
For each task and few-shot split size, we average scores across 5 random few-shot samples.

Baselines
We compare against strong efficient few-shot methods, as well as similar models that will tell us why our method performs better.Note that our approach is generative, unlike iPET and LM-BFF; thus, it is usable on a wider variety of tasks.
FewshotQA (Chada and Natarajan, 2021).A few-shot question answering method.We compare to the FewshotBARTL model, which is based on BART large like our model and is the bestperforming variant.We use the same few-shot splits such that we can directly compare to the numbers reported in that paper.We also try metatraining this non-retrieval-augmented model, which is essentially our method without retrieval; we call this baseline FewshotQA-m.
Splinter (Ram et al., 2021).A few-shot question answering model pre-trained to select salient spans from context passages.
RAG (Lewis et al., 2020).The original RAG-Token model with a memory of Wikipedia passages.We use the released model fine-tuned on NaturalQuestions (NQ), as this was the bestperforming RAG model on our tasks.To see whether our demonstration memory is more effective than Wikipedia passages when meta-training, we also try meta-training the RAG model with its Wikipedia memory; we call this baseline RAG-m.
iPET (Schick and Schütze, 2021b).A manual prompt-tuning approach that induces better fewshot performance than GPT3 with much smaller LMs.We tune the best-performing ALBERT xxl (Lan et al., 2020) model on our tasks.
LM-BFF (Gao et al., 2021).An automatic prompt-tuning approach based on RoBERTa large (Liu et al., 2019).It requires no unlabeled text data to work well, unlike iPET.This model and iPET compare token probabilities to perform classifica- tion, so we cannot use them for open-domain tasks like question answering.Thus, we only compare to these models on classification.

Hyperparameters
For meta-training, we use hyperparameters from Min et al. (2022a) where possible: init.LR 1×10 −5 , effective batch size 8,9 training for a maximum of 30,000 steps.We checkpoint every 2,000 steps and select the checkpoint with the lowest mean loss on our 16-shot QA training sets.Metatraining finishes in ≈14 hours on 1 A100 GPU (40GB). 10or fine-tuning, we use hyperparameters from Chada and Natarajan (2021) where possible: init.LR 2 × 10 −5 , batch size 4, fine-tuning for a maximum of 1,000 steps or 35 epochs (whichever is larger).We checkpoint every 2 epochs and select the checkpoint with the highest exact match on the training set.Fine-tuning finishes in 30-60 minutes on 1 A100 GPU (40GB).
For each meta-training and fine-tuning input, we retrieve 5 demonstrations from the memory.11

Results
Our model's F 1 scores for extractive question answering (Figure 2) are higher than models of similar parameterizations, including similar models that have been meta-trained using the same training data.Our model also outperforms strong clas- sification approaches on TREC, MNLI, and QNLI (Table 2).Thus, meta-training with semantically similar demonstrations induces a more generalpurpose system that can perform well across a variety of low-resource downstream tasks.Contrast this with RAG, which often performs worst out of each model we test across tasks.Thus, the architecture itself is not inherently strong in few-shot settings, suggesting that meta-training makes a significant contribution to increased performance.This is also supported by the increased performance we observe with FewshotQA and RAG after meta-training, though note that meta-training does not help FewshotQA to the same extent it helps retrieval-augmented models.Also note that FewshotQA does not perform well on classification tasks, whereas our method achieves performance exceeding or close to the strongest baselines.This means that the combination of meta-training and retrieval enables a more general-purpose model than either of these components separately.With meta-training, RAG-m obtains performance much closer to our model.This tells us that meta-training is responsible for much of the performance gains we observe, though the demonstration memory bank also improves performance to a lesser extent.On MRPC, RAG-m outperforms our model, indicating that there exist some nonknowledge-intensive tasks where Wikipedia passages are more helpful than QA demonstrations.

Knowledge-intensive QA
We also evaluate on few-shot knowledge-intensive QA tasks (Figure 3): here, TriviaQA and Text-bookQA, using the few-shot splits from the MRQA shared task.While these are also technically extractive QA tasks, their contexts have an average length of 677 and 581 words, respectively, meaning that BART will likely struggle more to synthesize all of the information in these tasks (even with retrieval).We find that FewshotQA outperforms our method on both of these tasks, and that even Splinter outperforms our method at larger split sizes for Text-bookQA.This means that demonstration retrieval may be actively harmful for these tasks.Thus, our meta-training method is optimizing RAG architectures for non-knowledge-intensive tasks, but not for knowledge-intensive tasks.Wikipedia passages are more effective than demonstrations in the memory bank for TriviaQA as well, as indicated by RAG-m outperforming our approach.However, meta-training with or without the memory bank still induces far better performance than the base RAG model, which performs worse than all baselines except Splinter.Thus, our method is still improving over RAG, making this model more versatile and better able to handle such tasks even if it is not the optimal approach.

Ablations
Here, we perform further analyses to understand the contribution of individual model components and (meta-)training decisions.
Memory bank.We find that performance is generally higher for question answering and classification when retrieving demonstrations instead of Wikipedia passages, as in Figure 2 and Table 2.This raises two questions: how much could the memory bank impact downstream performance in the best-case scenario?Relatedly, what is the upper bound on performance for our model given the best possible demonstration memory bank?
To obtain an estimate, we create an oracle memory consisting of labeled test examples from our evaluation data.We find that scores significantly improve over our method and others in this setting, indicating that this architecture has significant potential to achieve further gains if the memory bank is improved.
Number of retrieved demonstrations.Is retrieving more demonstrations always better?We compare performance when retrieving  While not monotonic, there is a clear correlation between these variables, indicating that lexical features may be responsible for much of retrieval's contributions to performance.
K = {0, 1, 5, 10, 25, 50} demonstrations during fine-tuning and evaluation on non-knowledgeintensive QA (SQuAD) and sentence-pair classification (MNLI).Our results (Figure 4) show that F 1 scores begin to saturate at 5-10 demonstrations for both tasks.However, using more demonstrations generally does not harm performance; the model is able to handle less helpful demonstrations without performance decreasing significantly.
Why is retrieval helpful?Is the model abstracting semantic content from the retrieved demonstrations for improved performance, or is it simply learning to copy token sequences from the retrieved demonstrations?As an initial test, we can correlate the frequency of the ground-truth answer sequence in the retrieved documents with F 1 scores on our QA tasks.Our results (Figure 5) suggest that the model is indeed learning to retrieve certain text strings from the demonstrations.This provides one possible path forward for improving the memory bank: higher semantic overlap with one's evaluation task increases the likelihood of these overlaps, so future work could focus on collecting (or perhaps generating) more semantically similar demon- and Contriever appear to be better at retrieving more relevant demonstrations on average, though this does not necessarily lead to higher downstream performance (Table 5).Table 5: F 1 scores on 16-shot extractive QA tasks across retrievers.We fine-tune with different retrievers given the same (best) meta-trained model.Despite DPR (Wiki)'s lower retriever scores (Table 4), its downstream performance is the best among the retrievers we try.
strations that feature more lexical overlaps.However, this does not explain how retrieval improves performance on classification tasks, where the label space is small and labels are less informative.For NLI, the label space includes "entailment"/"neutral"/"contradiction", which we would not expect to see often in our demonstrations and which do not carry significant semantic content.Yet retrieval-augmented models outperform Few-shotQA by a large margin on MNLI(-mm), so what is helping our model?There could exist some QA demonstrations which semantically prime our model toward correct completions, though sentence embedding similarity may not capture this helpfulness.Future work could ablate over specific features in the demonstrations.
What type of retriever is best?For our experiments thus far, we have used the DPR component of the RAG-Token (NQ) model, which is pre-trained on Wikipedia and fine-tuned on Nat-uralQuestions.Is this an optimal starting point, or would some other retreiver be better?We compare against a DPR model pre-trained on the Probably-Asked Questions (PAQ; Lewis et al., 2021) dataset, as well as the Contriever model (Izacard et al., 2021).Contrievers are unsupervised, whereas DPR models receive explicit supervision during pre- training.DPR tends to perform better when the downstream task is similar to the pre-training or fine-tuning data; however, in our case, demonstration retrieval is dissimilar from Wikipedia passage retrieval, and Contriever may handle larger traintest shifts better (Izacard et al., 2021).We evaluate both the relevance of the retrieved demonstrations (Table 4) and downstream F 1 (Table 5) on our QA tasks.We find that DPR (PAQ) and Contriever are both better at retrieving similar demonstrations, as measured by the frequency with which they retrieve examples that contain the answer.For BioASQ, only Contriever retrieves more relevant demonstrations than a random retriever.
However, retrieving more relevant demonstrations does not translate into increased downstream performance: DPR (Wiki) consistently outperforms the others.Why? Through qualitative analysis, we find that DPR (Wiki) retrieves more semantically diverse demonstrations, whereas DPR (PAQ) and Contriever retrieve demonstrations that are technically more similar to the test example, but also less diverse across test examples.Thus, there should be a balance between diversity and relevance: completely random retrieval is not effective (as indicated by our random retrieval baseline scoring worst), but neither is the more constrained demonstration set we retrieve using an arguably more optimal retriever.
Meta-training data.Is meta-training helpful because of the variety of tasks included in our setup (the more is better hypothesis), or would it be better to select meta-training data in a more principled way (the similar datasets are better hypothesis)?We compare downstream performance when meta-training on all QA tasks from MetaICL versus the top tasks by mean instancelevel semantic similarity to our evaluation tasks (Table 6).To compute semantic similarity, we use the stsb-roberta-base-v2 model from Sen-tenceTransformers (Reimers and Gurevych, 2019) and compute the mean pairwise cosine similarity between the 16-shot training examples in our evaluation tasks and all examples in a meta-training task.We then select the top tasks by similarity until we have over 240,000 examples (enough for 30,000 training steps using batch size 8).See Appendix A for a list of meta-training tasks before and after subsampling.
We find that selecting meta-training data based on semantic similarity to our evaluation tasks is helpful for both our QA and non-QA tasks: F 1 increases across tasks when only metatraining on the most similar data.This contrasts with the findings of Min et al. (2022a), who find that more meta-training tasks is generally better.

Conclusions
We have proposed a meta-training method ( §3.2) that retrieves ( §3.1) semantically similar demonstrations from a diverse demonstration bank ( §3.3).Our method achieves higher performance on average across many tasks than other strong parameterefficient few-shot baselines ( §5).In future work, one could explore a mixture of demonstration retrieval and passage retrieval for improved performance on a wider variety of tasks-including knowledge-intensive tasks.

Limitations
Our method requires access to a large set of labeled examples for the memory bank-ideally with some relevance to the evaluation tasks.This limits the languages and tasks that are optimal for this method: there does not exist a large variety of training examples for low-resource language varieties, nor for certain much more specific tasks-as in, for example, industry applications with domainspecific customer data.And while multilingual models could leverage cross-lingual transfer, it is unclear how well this model would generalize into low-resource languages when (for example) using multilingual BART.
When using the full demonstration memory, meta-training does not run on a 16GB GPU using our current implementation.While this does exclude more common GPUs, our approach could still run quickly on a 32GB GPU in a few hours, thus costing far less than pre-training a language model of comparable few-shot performance from scratch.Table 9: The formats we try for QASC and 16-shot F 1 scores from BART large (no retrieval) after fine-tuning on each format.We find that generating the answer is better than just generating the letter label, that including the options in the context is helpful, and that excluding the options from the context is harmful to performance."⇒" separates the input from the output sequence, and "\n" indicates a newline.D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 2 :
Figure2: F 1 scores at each few-shot split size for extractive and multiple-choice question answering evaluation tasks.Scores are averaged across 5 random few-shot samples.Our model outperforms or maintains similar performance to the strongest baselines on each task and split size.Performance gains on SQuAD are especially large-up to 3.9 F 1 (4.9% improvement).FewshotQA and Splinter scores are fromChada and Natarajan (2021).

Figure 3 :
Figure 3: F 1 scores for each few-shot split size for knowledge-intensive question answering tasks.Our model is outperformed by a strong few-shot QA baseline, though meta-training still greatly improves performance.

Figure 5 :
Figure5: F 1 scores for non-knowledge-intensive (SQuAD, left) and knowledge-intensive (TriviaQA, right) QA tasks by the frequency of the true answer string in the retrieved demonstrations.While not monotonic, there is a clear correlation between these variables, indicating that lexical features may be responsible for much of retrieval's contributions to performance.

Table 1 :
Evaluation sets used in this study.L: mean # of words in question/context or input sentence(s).For more straightforward comparison to prior few-shot question answering and classification methods, we use Ram et al. (2021)'s few-shot splits of SQuAD and BioASQ derived from MRQA, as well asGao et al. (

Table 2 :
Gao et al. (2021)ssification tasks, averaged across 5 random few-shot samples (std.dev. in subscript).All datasets are well-balanced except MRPC; thus, we report accuracies for all tasks except MRPC, where we report macro-F 1 .LM-BFF and RoBERTa scores are fromGao et al. (2021).* indicates that p < .05 in a t-test between our model's score and the marked score.

Table 3 :
F 1 scores on QA tasks for our strongest baselines, our approach, and our approach where the memory has been replaced with labeled test examples (oracle).The oracle approach establishes an approximate upper bound for our model.Large gaps between our approach and the oracle indicate room for improvement in what constitutes our memory bank.

Table 4 :
The proportion of test examples for which each retriever retrieves at least 1 demonstration containing the ground-truth answer as a substring.DPR (PAQ) Table 6: 16-shot F 1 scores on QA tasks after metatraining on either all QA tasks from MetaICL's QA meta-training collection, or QA tasks subsampled by semantic similarity to our evaluation tasks.A full list of meta-training tasks can be found in Appendix A.
{a A , . . ., a H } \n answer: [MASK] \n context: c 1 .c 2 .⇒ question: q? \n answer: i {a A , . . ., a H } \n answer: [MASK] \n context: c 1 .c 2 .⇒ question: q? \n answer: a question: What does sunlight do for a plant?(A) during the day (B) Kills it (C) it can be seen (D) Helps it survive (E) Helps it drink water (F) It gets heated up (G) adding heat (H) Makes the color darker \n answer: [MASK] \n context: A plant requires food for survival.All plants require sunlight to make their food.⇒ question: . . .\n answer: Helps it survive question: q? \n answer: [MASK] \n context: {a A , . . ., a H }. c 1 .c 2 .⇒ question: q? \n answer: a C2. Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Section 4.C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Sections 4 and 5.C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Sections 3, 4, 5.D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.