CrossFit: A Few-shot Learning Challenge for Cross-task Generalization in NLP

Humans can learn a new language task efficiently with only few examples, by leveraging their knowledge obtained when learning prior tasks. In this paper, we explore whether and how such cross-task generalization ability can be acquired, and further applied to build better few-shot learners across diverse NLP tasks. We introduce CrossFit, a problem setup for studying cross-task generalization ability, which standardizes seen/unseen task partitions, data access during different learning stages, and the evaluation protocols. To instantiate different seen/unseen task partitions in CrossFit and facilitate in-depth analysis, we present the NLP Few-shot Gym, a repository of 160 diverse few-shot NLP tasks created from open-access NLP datasets and converted to a unified text-to-text format. Our analysis reveals that the few-shot learning ability on unseen tasks can be improved via an upstream learning stage using a set of seen tasks. We also observe that the selection of upstream learning tasks can significantly influence few-shot performance on unseen tasks, asking further analysis on task similarity and transferability.


Introduction
Pre-trained language models fine-tuned with abundant task-specific data have become the predominant recipe for state-of-the-art results in NLP. However, these approaches are heavily dependent on large-scale labeled datasets that are expensive to create, and the resulting models still generalize poorly to out-of-distribution inputs created with small, harmless perturbations (Ribeiro et al., 2020). In retrospect, researchers have advocated for building more human-like, general linguistic intelligence that can "reuse previously acquired knowledge about a language and adapt to a new task quickly" (Yogatama et al., 2019;Linzen, 2020). 1 Our code is at https://github.com/INK-USC/CrossFit.

A repository of 160 diverse few-shot tasks in NLP
Stage 1. Upstream Learning (Multitask Learning, Meta-learning, etc.) Text-to-text Transformer Text-to-text Transformer

Conditional Generation
The Cr ossFit Challenge

Can we build a few-shot learner to generalize beyond task boundaires?
Figure 1: We present the CROSSFIT Challenge to study cross-task generalization in a diverse task distribution.
To support this problem setting, we introduce the NLP Few-shot Gym, a repository of 160 diverse few-shot, text-to-text tasks in NLP.
Existing work has approached this problem via better few-shot fine-tuning, by re-formulating target tasks into cloze questions that resembles the pretraining objective (Schick and Schütze, 2020a,b), generating prompts and using demonstrations (Gao et al., 2020). Such progress primarily focus on improving instance-level generalization, i.e., how to better generalize from few labeled instances to make predictions about new instances, within the scope of one individual task. From a broader perspective, human-like learning ability also benefits from task-level generalization, or cross-task generalization, i.e., how to learn a new task efficiently given experiences of learning previous tasks.
Such ability has been widely studied in computer vision and robotics community (Yu et al., 2020;Triantafillou et al., 2020), but is relatively underexplored in NLP. Pruksachatkun et al. (2020) and Vu et al. (2020) study transferability between one intermediate task and a given target task, while it's possible to further improve performance with multiple intermediate tasks. Han et al. (2018) and Bansal et al. (2020a) focus on cross-task generalization within the scope of classification tasks, whereas hu-mans can generalize across different task formats (classification, multiple choice, generation, etc.), goals (question answering, fact checking, etc.) and domains (biomedical, social media, etc.).
Towards developing general linguistic intelligence, we present CROSSFIT, a few-shot learning challenge to acquire, evaluate and analyze crosstask generalization in a realistic setting, with standardized training pipeline, data access and evaluation protocol. The CROSSFIT challenge requires a model to first learn from a set of seen tasks in an upstream learning stage, and then perform fewshot learning on a set of unseen tasks, as illustrated in Fig. 1. In accompany, we introduce the NLP Few-shot Gym, a repository of 160 few-shot NLP tasks gathered from open-access resources, covering a wide range of capabilities and goals, and formulated into a unified text-to-text format. To analyze the capability and limitation of existing approaches to the CROSSFIT challenge, we design eight specific seen/unseen task partitions.
With the CROSSFIT Challenge and the NLP Fewshot Gym, we aim to investigate the following research questions: • Q1. Can we teach cross-task generalization ability to pre-trained models with existing methods? • Q2. During upstream learning, is it better to be "well-rounded" (learning from diverse tasks) or be "specialized and targeted" (learning from tasks in the same category with unseen tasks)? • Q3. Does it help if we have more labelled data for seen tasks during upstream learning?
To address the above questions, we empirically analyze the performance of multi-task learning and three meta-learning algorithms (MAML (Finn et al., 2017), first-order MAML and Reptile (Nichol et al., 2018)). We observe that these approaches can indeed lead to better few-shot performance on unseen tasks. Interestingly, simple multi-task learning outperforms existing meta-learning methods in many cases, encouraging future research on identifying the reasons and developing improved meta-learning methods. For Q2, we observe that performance of individual unseen tasks varies with different selection of seen tasks, calling for more thorough investigation of the relationship between task similarity and transferability. As for Q3, we find that enlarging the size of upstream data does not necessitate better cross-task generalization abilities. We envision cross-task generalization to be an integral component towards general linguistic intelligence, and we hope CROSSFIT serves as a useful testbed for driving related progress.

Related Work
Few-shot Fine-tuning. Few-shot learning refers to teaching models a new task with a small number of annotated examples. Large-scale pre-trained language models (e.g., BERT (Devlin et al., 2019)) have demonstrated great ability to learn new tasks efficiently via fine-tuning . Schick and Schütze (2020a,b) proposed patternexploiting training (PET), which formulates text classification and NLI tasks into cloze questions (or "prompts") that resemble masked language modeling. PET can be further improved by generating prompts automatically and incorporating demonstrations into the input (Gao et al., 2020); and by densifying the supervision signal with label conditioning (Tam et al., 2021). While successful, in these approaches the downstream tasks are learned in isolation. Our work aims to boost few-shot learning ability on unseen tasks via acquiring cross-task generalization ability from diverse seen tasks.
Meta-learning in NLP. Recent works have explored meta-learning methods for relation classification (Han et al., 2018;Gao et al., 2019), general text classification (Dou et al., 2019;Bansal et al., 2020a,b), low-resource machine translation (Gu et al., 2018), cross-lingual NLI/QA (Nooralahzadeh et al., 2020). In general, these works apply meta-learning algorithms to a set of sub-tasks; however the sub-tasks are either synthetic (e.g., classifying a new set of five relations is a new sub-task) or drawn from a rather narrow distribution (e.g., QA in one language is a sub-task). In our work, we explore a more realistic settinglearning from a set of NLP tasks with diverse goals: classification, question answering, conditional generation, etc. This setting is attracting attention in NLP community rapidly and is also explored in very recent work (Zhong et al., 2021;Mishra et al., 2021;Bragg et al., 2021;Wei et al., 2021).
Unifying NLP Task Formats. Researchers have explored unifying the formats of different tasks, in order to better enable knowledge transfer, e.g., DecaNLP (McCann et al., 2018), UFO-Entail (Yin et al., 2020) and EFL (Wang et al., 2021). Following T5 (Raffel et al., 2020), we adopt a unified text-to-text format that subsumes all text-based tasks of interest. Related to our work, UnifiedQA (Khashabi et al., 2020) examines the feasibility of training a general cross-format QA model with multi-task learning. Our work extends from these ideas, and we significantly enlarge the task repository to 160 to broaden the coverage, in hopes to build a general-purpose few-shot learner.

The CROSSFIT Challenge
In this section, we present the CROSSFIT Challenge, a problem setting for acquiring and evaluating cross-task generalization. Ideally, a strong CROSSFIT system can capture cross-task generalization ability from a set of seen tasks and thus adapts to new unseen tasks efficiently.

Preliminaries
The meaning of "task" is overloaded: "tasks" can be categorized at different granularity (e.g., text classification vs. QA, yes/no QA vs. machine reading comprehension), and from different aspects (e.g., domain, label space). Herein we take a general formulation by defining a "task" with its training and testing examples. We define a task T as a tuple of (D train , D dev , D test ). Each set D is a set of annotated examples {(x i , y i )} in text-to-text format. In few-shot setting, the size of D train and D dev are required to be small (e.g., 16 example per class for classification tasks).
Existing work mostly focuses on improving instance-level generalization for individual task by using task-specific templates. Performance on individual tasks is used as the measure of success. For the CROSSFIT Challenge, we aim to acquire cross-task generalization and build better generalpurpose few-shot learners, which calls for a different problem setting with distinct training procedure and evaluation protocol.

Problem Setting
Tasks and Data. To acquire and evaluate crosstask generalization, we first gather a large repository of few-shot tasks T , and partition them into three non-overlapping sets T train , T dev , T test . In hopes to examine the capability and limitation of an approach in different settings, and to answer our research questions, we design multiple task partitions with different focuses. Details of the repository and partitions, or as we name them, the NLP Few-shot Gym, are deferred to §4.
Learning Stages. A CROSSFIT method may learn from T train and perform necessary tuning with T dev in the upstream learning stage; it is then evaluated with few-shot tasks in T test : • Upstream learning stage. Here, the algorithm has access to the D train and D dev for each training task in T train , while D test is unavailable. The algorithm also has access to all data in T dev , but for validation purpose only (i.e., it is not allowed to use T dev to update model weights). • Few-shot learning stage. In this stage, T test became available. Models resulting from the upstream learning stage are required to learn from D train via a particular few-shot learning method (e.g., direct fine-tuning). The final few-shot learning performance is evaluated on D test . 2 Evaluation Metric. Evaluating the performance of a model on a diverse collection of NLP tasks is inherently challenging, as different tasks use different metrics. It is thus not reasonable to simply aggregate performance of classification tasks (e.g., accuracy, F1) and generation tasks (e.g., ROUGE, BLEU) by taking the average. To address this problem, we first narrow down to a collection of 7 evaluation metrics: classification F1, accuracy, QA F1, exact match (EM), Rogue-L, Matthew correlation, and Pearson correlation, which cover all tasks in our experiments. Then, we define Average Relative Gain (ARG), a metric that computes relative performance changes before and after the upstream learning stage for each test task, and finally take the average across all test tasks.
For example, suppose we have T test = {T A , T B }. If an upstream learning algorithm helps improve the few-shot learning performance from 50% F1 score to 70% on task T A (i.e., a 40% relative improvement), and from 40% accuracy to 30% on task T B (i.e., −25% relative improvement), the final ARG on T test would be computed as 40%+(−25%) 2 = 7.5%.
The ARG metric reflects the overall performance gain on all tasks in T test , no matter what specific metrics each task uses. We use ARG for a highlevel comparison, and we still analyze the performance for each task (e.g., absolute performance metrics, performance growth with "more shots", sensitivity to different selection of T train ) in our in-depth analysis.

NLP Few-shot Gym
Towards learning to generalize across tasks in CROSSFIT challenge, we need a resource that contains sufficient number of tasks, covering a wide range of NLP applications, and presented in a unified text-to-text format. Herein, we introduce the NLP Few-shot Gym, a repository of 160 few-shot tasks gathered from existing open-access datasets.

Dataset Selection
We choose to use Huggingface Datasets 3 (Lhoest et al., 2021) as the pool of our candidate tasks. We filter these datasets on a case-by-case basis, mainly using the following criteria: (1) We focus on English monolingual datasets. (2) We exclude datasets that require information retrieval, as they require a separate retriever. (3) We exclude sequence labeling tasks (e.g., dependency parsing, NER), which are highly dependent on tokenization, and are hard to evaluate in text-to-text format. (4) We exclude datasets dealing with extremely long documents (e.g., a scientific paper) as input, as most pre-trained models cannot process such long input sequences. We finalize our selection with 160 datasets which are detailed in Appendix A.

A Unified Text-to-Text Format
We follow Raffel et al. (2020) and convert all of our datasets into a unified text-to-text format. For example, the task of natural language inference (originally a sentence-pair classification problem) becomes: premise: <premise> hypothesis: <hypothesis>, and the target sequence is either the word entailment, contradiction or neutral. As for machine reading comprehension tasks, the input format is question: <question> context: <context> and the target sequence is the correct answer span. We also reference the format for QA tasks from UnifiedQA (Khashabi et al., 2020).

Formulating Few-shot Tasks
We mainly follow the practice in (Gao et al., 2020) for few-shot sampling. For classification and regression tasks, we include 16 training examples per class in D train . For other types of tasks, we include 32 examples in D train . In conformity with real-world situations where labeled data are scarce, we assume a development set D dev which shares the same size with D train .
We sample D train and D dev splits from each dataset's original train set with 5 different random seeds. This helps us reduce variance during fewshot evaluation, and also enlarges the number of few-shot tasks used for learning. Consequently, the "effective size" of our NLP Few-shot Gym is 160 × 5 = 800, while we use the number 160 throughout the paper to avoid possible confusion.
We use the original development set for each dataset as D test , or withhold 20% of the dataset when the official development split is not available. The held-out test examples are sampled once before sampling D train and D dev .

Task Ontology and Partitions
As mentioned in §3.2, a CROSSFIT method is expected to first acquire cross-task generalization on a set of T train and evaluate such ability on T test . To comprehensively analyze to what extent a trained model can generalize, and how its behavior differs in different scenarios, we need to build different partitions of (T train , T dev , T test ).
Towards this goal, we first manually classify the 160 tasks and form a task ontology with categories and sub-categories, as shown in Fig. 2. The first-level categories include classification, question answering, conditional generation, and oth-ers. 4 Further, we design eight different partitions of (T train , T dev , T test ). We illustrate four partitions in Fig. 3 and provide more details in Table 1.
Our Partition 1 randomly split all 160 few-shot tasks into the three sets, where |T train | = 120 and |T dev | = |T test | = 20. The design of Partition 1 mimics the real-world language learning environment where the goal is to build a general-purpose few-shot learner, and a set of diverse tasks (T train ) are used to train the learner. Our Partition 2.1-2.3 withhold 10 classification tasks for development and 10 more for testing. The T train is controlled to have either 100% classification tasks, 100% non-classification tasks, or half-and-half. These three partitions help us to understand the influence brought by different task distribution in T train . The remaining four partitions still focus on crossing task boundaries, but in a finer granularity: seen and unseen tasks are in the same category, but not the same sub-category. For example, Partition 3.1 has 57 non-NLI classification tasks as T train , and 8 NLI tasks as T test . These partitions help us to understand whether cross-task generalization in this finer granularity is easier for model to acquire.

Methods to CROSSFIT
We mainly use BART-Base (Lewis et al., 2020) as the text-to-text transformer for our analysis in the CROSSFIT setup. We leave confirmatory experiments with T5-v1.1-Base and BART-Large model in Appendix C.
Direct Fine-tuning on Test Tasks. This serves as the basic baseline method for the CROSSFIT challenge, which does not make use of T train or T dev , or go through the upstream learning stage. For each task T ∈ T test , we directly fine-tune the text-to-text model with its D train , tune the hyperparameters with D dev , and assess its performance with the test set D test . We use the performance of direct fine-tuning as the base for computing ARG scores of other CROSSFIT approaches. We expect a model trained with upstream learning would capture cross-task generalization ability and thus have better ARG scores.
Multi-task Learning (MTL). A straightforward yet effective method is to combine the data 5 in the training tasks to learn a multi-task 4 We later discuss the limitation of this design in §6-Q2 5 Both Dtrain and D dev are used, as D dev is used for gradient updates in meta-learning algorithm. We do so to make sure that the data access for the two methods is fair. We evaluate a CROSSFIT approach on different task partitions to examine its generalization ability in different scenarios. Full details in Table 1. The locations and distances in this figure are hypothetical and for illustrative purposes only.
model, before fine-tuning it on each test task. Specifically, we gather source-target examples for all tasks in T train and fine-tune the text-to-text model with these examples. Then we use the resulting checkpoint as initialization and perform the same procedure in "direct fine-tuning" for each test task in T test . The performance gain over the direct fine-tuning is used for computing its overall ARG score.

Model-Agnostic
Meta-learning (MAML). Cross-task generalization ability, closely aligns with the concept of learning to learn. Hence, we use MAML (Finn et al., 2017), a representative meta-learning approach during upstream learning. The core concept of MAML is to learn a set of initialization weight, from which the model adapts fast to a new task within few gradient updates. In MAML training, we iterate through tasks in T train to update the model. For each train task (D train , D dev ), we first sample a support batch B support from D train and a query batch B query from D dev . We use f θ to denote the text-to-text model with parameters θ. Using B support , we first compute the updated parameters θ ′ with gradient descent (i.e., the inner loop). Due to the large size of pre-trained text-to-text models, we  Table 1: (T train ,T dev ,T test ) partitions used in the study (full lists in Appendix B), and their ARG scores when upstream learning methods are applied. "cls." stands for "classification", "Para. Iden." for "paraphrase identification", "MRC" for "machine reading comprehension" and "MCQA" for "multiple-choice QA".
use one gradient update in the inner loop, i.e., θ ′ = θ − α∇ θ L(f θ , B support ). Then we apply the updated text-to-text model f θ ′ to B query , and do one step of meta-optimization (i.e., the outer loop), First-order MAML. First-order MAML (Finn et al., 2017) avoids second-order optimization and improves training stability using the first-order approximation by differentiating with respect to the fast weights θ ′ instead of the original parameters θ for the gradient Reptile. Reptile (Nichol et al., 2018) is another memory-efficient, first-order meta-learning algorithm that first makes multiple gradient updates in the inner loop, then directly uses θ ′ − θ to approximate ∇ θ L(f θ ′ , B query ), i.e., θ ← θ + β(θ ′ − θ).

Empirical Analysis
In this section we look to interpret the results and answer our research questions. We summarize the ARG scores in Table 1 and plot the performance of each test task (for each partition) in Fig. 4-5.
Q1. Can we teach pre-trained LMs to generalize across tasks with existing methods?
Overall Performance. From Table 1, we observe that, on average, the tested upstream learning methods indeed improve cross-task generalization: their ARG scores are positive, meaning that they are better than direct fine-tuning (ARG=0%). Further, by aggregating results from all upstream learning methods and task partitions, we find that the performance on 51.47% test tasks are significantly improved (> 5% relative improvement compared to direct fine-tuning); 35.93% tasks are relatively unaffected (between ±5%); and 12.60% tasks suffer from worse performance (< −5%).
Correlated Performance Gains. The performance gain obtained with different upstream learning methods are correlated with each otheri.e., tasks that benefit from multi-task learning is likely to also benefit from meta-learning. For the Random partition, the Spearman Correlation between the relative improvement brought by MTL and MAML is 0.66, with p value equals to 0.0015. This suggests that different upstream learning methods, while taking different optimization objectives, capture similar inductive bias from T train .
MTL is a strong baseline. Surprisingly, the most straight-forward multi-task learning method is hard to beat. This could be counter-intuitive, as meta-learning methods are specifically designed for rapid generalization to unseen tasks, sharing the same goal with our CROSSFIT challenge. We think there are three possible reasons: (1) Due to memory constraints, we limit the number of innerloop updates to be one, which may be insufficient. Also, meta-learning methods are highly sensitive to hyper-parameters and even random seeds (Antoniou et al., 2019), which we do not tune exhaustively for practical reasons.
(2) Text-to-text transformers have much more complex architectures, while most meta-learning methods are typically applied to small feed-forward/convolutional networks. (3) The CROSSFIT challenge has a highly diverse set upstream tasks, which may introduce under-explored difficulties. That being said, we believe it is important to identify the true cause, and to develop improved meta-learning methods for the CROSSFIT challenge as future work.   (Wu et al., 2020) and selecting source tasks to avoid negative transfer (Vu et al., 2020) are also growing research topics. In this work we refrain from further investigation; however we believe combating negative transfer and thus improving CROSSFIT performance is a promising future direction.
Q2. Well-rounded or specialized? Which is a better strategy of upstream learning? "Learning to be well-rounded vs. learning to be specialized" is a common dilemma that human learners struggles with. For the CROSSFIT challenge, the former refers to learning from a set of diverse tasks in upstream learning; the latter refers to learning from a set of tasks closer to target fewshot tasks. To study this research question, we want to find out which option works better in upstream learning. Put differently, we aim to analyze the influence of upstream task selection for a fixed set of the downstream tasks.
Setup. We first conduct controlled experiments with Partition 2.1-2.3, where T test is a fixed set of classification tasks, and T train varies. In Partition 2.1, all tasks in T train are classification tasks (i.e., "specialized and targeted"); in Partition e m o a n li e th o s ra ce fin a n ci a l_ p h ra se b a n k d b p e d ia _ 1 4 w ik i_ q a e th o s re lig io n su p e rg lu e c b ta b _ fa ct ye lp _ p o la ri ty a ve ra g e 25% 0% 25% 50% 75%

100%
Relative Performance Gain (%) 1 1 .6 8 % 1 1 .8 2 % 1 1 .9 1 % (a) Multitask Learning 45 classification tasks 23 classification + 22 nonclassification tasks 45 nonclassification tasks e th o s ra ce a n li e m o w ik i_ q a d b p e d ia _ 1 4 fin a n ci a l_ p h ra se b a n k e th o s re lig io n ta b _ fa ct su p e rg lu e c b ye lp _ p o la ri ty a ve ra g e 25% 0% 25% 50% 75%
Analysis and Discussion. It is surprising at first that non-classification tasks and classification tasks are equivalently helpful in terms of ARG scores (see Fig. 5). On a second thought, this observation is encouraging as it demonstrates that acquiring cross-task generalization is feasible and promising, even when T train and T test are drastically different. It also suggests that our categorization of tasks ( §4.4) may not align with how models learn transferable skills: selecting T train tasks that have the same format and goal as the test task may not lead to optimal transfer. In retrospect, we acknowledge that our design of ontology and partitions based on task format and goal is flawed. This is merely one aspect of "task similarity". However, understanding the complex relationship between tasks is another challenging and under-explored problem. We consider our ontology as a starting point, rather than a fixed final one. We use the current ontology to guide our experiment and analysis, and we hope future analysis could help build a more informative ontology.
Case Studies. We further look at cases where a test task appear in T test of multiple partitions. For example, AI2_ARC and Race-High are in the T test of both Random partition and Held-out-MCQA partition. We present the results in Table 2. In general, the performance of these tasks varies when  Q3. Does it help if we have more labelled data for upstream tasks?
As described in §4.3, we limit our upstream tasks to be also few-shot: classification tasks have 16 examples per class, and non-classification tasks have 32 examples. This decision is empirically determined following prior works (Schick and Schütze, 2020a,b; Gao et al., 2020) and makes our extensive analysis practical and efficient. It is possible that using more data for each upstream task can significantly improve cross-task generalization. To investigate this, we conduct a set of controlled experiments where the number of examples in upstream tasks are changed to [2,4,8] times of the original size. We use the Held-out-Para Partition and multi-task learning for the experiments, and present the result in Fig. 6. Surprisingly, we find that the effect from using more upstream data is inconsistent on different target tasks. The overall ARG for all sizes are close: even 8x larger up-stream data leads to only 4% improvement in ARG. We conclude that enlarging the size of data during upstream learning does not necessitate better crosstask generalization ability. This also justifies our decision to keep upstream tasks few-shot.
Q4-Q6. Additional Analysis Due to space limit, we summarize our other findings below and defer the details to Appendix C.
Few-Shot → More-Shot (Q4). In practice, users may continue to collect data over time. We wonder if cross-task generalization ability is still helpful for medium/high-resource target tasks. We find that the performance gain from upstream learning is still evident when 1024 shots are available. The performance gap diminishes with millions of training examples.
Using Different Base Models (Q5). We extend our analysis on BART-base (139M) to larger pretrained text-to-text Transformers: BART-Large (406M) and T5-v1.1-Base (248M). Generally, the performance grows with models sizes with only few exceptions, which suggests that upstream learning methods we use are model-agnostic, and can be applied to larger models to further improve fewshot performance.
Integration with PET Training (Q6). Patternexploiting training (PET) (Schick and Schütze, 2020a,b) was originally proposed for classification tasks and encoder language models. We test a few variants of PET training with BART-Base and try applying PET training after upstream learning. In general we observe deteriorated performance compared to direct fine-tuning. We hypothesize that PET methods are not directly applicable to encoderdecoder language models used in our study.

Conclusion and Future Work
In this paper, we study the problem of building better few-shot learners via acquiring cross-task generalization ability from diverse NLP tasks. Towards our goal, we introduce the CROSSFIT Challenge, an task setup that standardizes the training pipeline, data access and evaluation protocol. We also present the NLP Few-shot Gym, a repository of 160 diverse few-shot NLP tasks, to support CROSSFIT learning in different scenarios. We empirically demonstrated that cross-task generalization can be acquired via multi-task learning and meta-learning; confirmed that the selection of seen tasks would influence the few-shot performance on unseen tasks.
We have highlighted several unexpected or undesired observations in our analysis, for which we invite future work in understanding and combating related issues. In addition, we envision the CROSSFIT Challenge and the NLP Few-shot Gym to serve as the testbed for many interesting "meta-problems", such as (1) learning to generate prompt for diverse task formats and further improve learning efficiency (Shin et al., 2020;Gao et al., 2020); (2) learning to select appropriate source tasks to learn from during upstream learning (Zamir et al., 2018;Standley et al., 2020), potentially with task2vec methods (Achille et al., 2019;Vu et al., 2020); (3)

C Additional Results and Analysis
Q4. Does the improved cross-task generalization ability go beyond few-shot settings?
In real-world applications, annotated data usually grow for a few-shot task over time. Is upstream learning still helpful when a target task has more shots? To study this question, we study CommonsenseQA (in Held-out-Multiple-Choice Partition), ROPES (in Held-out-MRC Partition), and MNLI (in Held-out-NLI Partition) as target tasks in medium and high-resource scenarios. We take their corresponding checkpoints after upstream learning and conduct experiments in medium and highresource scenarios. That is, we randomly sample {32, 64, . . . , 4096} examples from the three datasets, and use them as D train . Then, we sample a D dev with the same size as D train , or has the size of 1024 if |D train | > 1024. We also try fine-tuning with the full dataset. 6 The performance of these settings is shown in Fig. 7.
From Fig. 7, we see that the benefits brought by upstream learning methods extend into medium resource cases with up to 2048 training examples. For CommonsenseQA, checkpoints from upstream learning outperform direct fine-tuning significantly, even with the full dataset. This finding encourages the use of upstream learning before task-specific fine-tuning when the target task has limited annotation. On the other hand, for resource-rich tasks (e.g., MNLI), the improvement brought by upstream learning diminishes. This aligns with the findings of (Wang et al., 2020) who discuss the benefits of pre-training on resource-rich tasks.
Q5. Can we further improve few-shot performance by using different/larger pre-trained models?
We have been mainly using BART-Base (139M parameters) as the main network, while it is possible to further push the limits of few-shot learning by using scaling up to larger models or using different model architectures. Previous work has shown that scaling up model size leads to better performance (Raffel et al., 2020;Brown et al., 2020). Moreover, since meta-learning algorithms are naturally unstable, it is important to verify whether they 6 We do five random samples of 1024 examples as D dev and use the remaining examples in the original train set as Dtrain. We use the original dev set for testing. function as expected with larger models. In Q5, we experiment with T5-v1.1-Base (248M) 7 and BART-Large (406M) model with Held-out-Para Partition to verify these assumptions. We only consider firstorder methods, as second-order optimization with these larger models is impossible with our available computation.
Our results are plotted in Fig. 8. In Fig. 8(a) we compare the few-shot performance of direct finetuning on these three pre-trained models. On average, few-shot performance grows with models size, with a few exceptions such as QQP+T5-v1.1-Base and MRPC+Bart-Large. In Fig. 8(b-c) we plot the effect brought by upstream learning method for larger models. Except for FoMAML+T5-v1.1-Base 8 , upstream learning methods consistently improves few-shot performance on T test , which verifies that upstream learning methods we use are model-agnostic, and can be applied to larger models to further improve few-shot performance.
Q6. Can we use pattern-exploiting training to replace direct fine-tuning to achieve even better performance?
Pattern-exploiting training (PET) is a novel method that formulate a target task into cloze-style questions (Schick and Schütze, 2020a,b; Gao et al., 2020). This approach narrows the gap between the masked language modeling objective during pre-training and downstream task fine-tuning, and therefore leads to more efficient transfer. PET is demonstrated to be effective with encoder models (e.g., RoBERTa), however, whether it is applicable to text-to-text models with auto-regressive decoders is underexplored to the best of our knowledge. In Q6, we study whether applying PETstyle methods to text-to-text models is feasible, and whether combining the two methods further pushes the few-shot performance.
To align with the experiment settings in (Schick and Schütze, 2020a,b; Gao et al., 2020), we introduce a new task partition "Held-out-GLUE", which uses non-GLUE classification tasks as T train , and GLUE tasks as T test . We use the top 3 patterns in (Gao et al., 2020)   ensemble of the three models to produce the final prediction.
Since pattern-exploiting training is originally designed for encoder models (e.g., BERT/RoBERTa), we first tried two of its variants that adapts it to our auto-regressive transformer models. The first variant generates complete sentence, e.g., generate "The movie is great. A wonderful piece" from "The movie is great. A <mask> piece" for sentiment classification. The second variant generates only the word "wonderful", from "The movie is great. A <mask> piece". Though the first variant is more similar to the denoising pre-training objective of BART, we find the second variant to have better performance.
We then launch pattern-exploiting training using variant two with the original BART-Base models. We observe negative performance on average (leftmost blue bar in Fig. 9). Performance is improved with CoLA and MRPC, but not with the remaining GLUE tasks. We further launch experiments with/without pattern-exploiting training, with our upstream learning checkpoints. Still pattern-exploiting training leads to deteriorated performance on average.
We stop further investigation since this is out of the scope of our study. Still we believe it is important to identify the reasons and develop patternexploiting methods for auto-regressive models.

D Reproducibility
Implementation. All our experiments are implemented with Huggingface Transformers 9 (Wolf et al., 2020). For higher-order optimization in the meta-learning approach optimization, we use higher library 10 . Our code has been uploaded in supplementary materials, and is also open-sourced at https://github.com/INK-USC/CrossFit.
Hyper-parameters. We mainly follow the practice in (Gao et al., 2020). During few-shot finetuning, we select the learning rate from {1e − 5, 2e − 5, 5e − 5}, and the batch size from {2, 4, 8}, based on D dev performance. We set the total number of updates to be 1000, number of warmup updates to be 100. We evaluate the model on D dev every 100 steps.
Infrastructure and Runtime. Upstream learning are done with one single Quadro RTX 8000 (48GB). Upstream learning jobs finishes within 3 hours on average. Fine-tuning experiments are all done with one single GPU, with either NVIDIA Quadro GP100, NVIDIA Quadro RTX 8000, NVIDIA Quadro RTX 6000, NVIDIA GeForce RTX 1080 Ti, or NVIDIA GeForce RTX 2080 Ti, based on availability. Fine-tuning on one few-shot g lu e r te g lu e s st 2 g lu e q n li g lu e m n li g lu e q q p g lu e c o la g lu e m rp c a ve ra g e