Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning

The ability to continuously expand knowledge over time and utilize it to rapidly generalize to new tasks is a key feature of human linguistic intelligence. Existing models that pursue rapid generalization to new tasks (e.g., few-shot learning methods), however, are mostly trained in a single shot on fixed datasets, unable to dynamically expand their knowledge; while continual learning algorithms are not specifically designed for rapid generalization. We present a new learning setup, Continual Learning of Few-Shot Learners (CLIF), to address the challenges of both learning settings in a unified setup. CLIF assumes a model learns from a sequence of diverse NLP tasks arriving sequentially, accumulating knowledge for improved generalization to new tasks, while also retaining performance on the tasks learned earlier. We examine how the generalization ability is affected in the continual learning setup, evaluate a number of continual learning algorithms, and propose a novel regularized adapter generation approach. We find that catastrophic forgetting affects generalization ability to a less degree than performance on seen tasks; while continual learning algorithms can still bring considerable benefit to the generalization ability.


Introduction
The ability to recall acquired knowledge for learning new tasks quickly and efficiently over time has been seen as a crucial metric of general linguistic intelligence (Yogatama et al., 2019). Progress on this research problem has led to remarkable improvements in recent works on few-shot learning (Brown et al., 2020;Gao et al., 2021). However, these methods have primarily focused on learning from a static set of tasks (datasets) in an offline manner, without dynamically expanding the acquired 1 Code and data are publicly available at https:// github.com/INK-USC/CLIF  knowledge over time. This training scheme is in contrast with the way humans process natural language (Chomsky, 2002;Montague, 1970): humans are able to process novel meanings by retaining past knowledge, combining/decomposing chunks of language into prior learned language components, and avoid learning from scratch.
Motivated by this observation, we study whether NLP models could accumulate generalizable knowledge continuously over a sequence of tasks and learn to generalize to new tasks rapidly (i.e., with few examples). This problem has not been investigated in the existing works -a related line of efforts that look to learn from sequentially arriving tasks, known as continual learning (CL) or lifelong learning (Robins, 1995;Sun et al., 2020;de Masson d'Autume et al., 2019), mainly focus on retaining the performance on seen tasks when the model is continuously updated on new tasks (i.e., to overcome the catastrophic forgetting issue).
To study this ability, we propose the Continual LearnIng of Few-shot Learners (CLIF) setup (illustrated in Figure 1) to simulate the challenge: In CLIF, the model learns over a sequence of NLP tasks (arriving one by one; without revisiting), and then evaluated in terms of (i) generalization to new (few-shot learning) tasks; and (ii) preserving its performance on solving seen tasks. We train and evaluate over a diverse set of NLP tasks, spanning over entity typing, sentiment analysis, natural language inference, and other classification tasks.
With the CLIF setup, we conduct a series of experiments on existing models, in order to understand the relationship between continuous knowledge accumulation and few-shot generalization. Our first analysis is to understand how the generalization ability evolves during continual training, and whether catastrophic forgetting affects the acquisition of generalization ability. We find a negative effect of catastrophic forgetting on the generalization ability, and a stronger negative effect on the performance over the seen tasks.
In a follow-up analysis, we find most existing CL methods hardly benefit models' generalization ability, even they are shown to alleviate catastrophic forgetting. This implies some nontrivial challenges for accumulating knowledge that can help model generalization. Inspired by recent research on Hypernetworks for few-shot learning (Requeima et al., 2019) and continual learning approach using Hypernetworks (von Oswald et al., 2020), we propose Bi-level Hypernetworks for Adapters with Regularization to address challenges of the CLIF. We evaluate these approaches extensively by varying the number of training examples and the orders of tasks at training.
To summarize, the main contribution of this work is threefold (1) we propose CLIF setup, its data streams and protocols to comprehensively evaluate lifelong knowledge accumulation in NLP, and (2) we compare existing algorithms to demonstrate weaknesses of these algorithms (3) and propose Bi-level Hypernetworks for Adapters with Regularization as a solution to inspire future works.

The CLIF Problem
We assume there is an NLP model f trained continually on different tasks over time (i.e., continual learning), and then rapidly generalizes to many unseen tasks with few-shot examples (i.e., few-shot adaptation). In the continual learning stage, the model encounters an ordered list of N u upstream tasks: [T 1 u , . . . , T Nu u ], where each task has its own training and test sets. To test the few-shot learning ability of the sequentially trained model f , we then adapt it on a set of N v few-shot tasks individually where only a few training examples are available for each unseen task. We name this learning setting as CLIF, which stands for continual learning for few-shot adaptation. In addition to the traditional objective in CL to preserve performance on seen tasks, in CLIF it is also crucial to retain generalizable knowledge to achieve better few-shot learning performance at the end of training.
Evaluation Protocol As illustrated in Figure 2, there are three major aspects for evaluating a method to the CLIF setting: few-shot performance, final performance, and instant performance. 1) Few-shot Performance. First, we evaluate the continually trained model f on a set of unseen tasks, by fine-tuning it for each task T i v individually with a few annotated examples when the training over upstream tasks T 1 u ..T Nu u ends. Thus, we can assess the few-shot generalization ability. We note the few-shot accuracy for a task T i v as is the set of ground truth labels, and F is the metric function (e.g., accuracy). We report s FS averaged over all few-shot tasks, i.e., s FS = 1 Nv Nv i=1 s i FS . We also compute a relative improvement ∆ FS = s FS −s FS s FS over the performance s F S of the models separately trained on each few-shot task.
2) Instant Performance. We evaluate the performance of an upstream task T i u right after the model f finishes the learning on it. We note the set of model prediction on the test set of task T i u right after the model f learns the task j asŶ i,j u . The instant performance over task T i u is defined as . For example, we evaluate the performance of f on T 2 u after the model f is trained on the data of T 1 u and T 2 u , before further train it on T 3 u . The performance of f on T 2 u now can thus tell us how well the model transfers its knowledge from learning T 1 u to learn T 2 u -using Learning Stage
Few-shot (T v ) SuperGLUE-CB, Dbpedia-14, Wiki-QA, emo, Yelp-Polarity, ethos-religion, tab-fact, financialphrasebank, ANLI, ethos-race the performance when f is trained only on T 2 u as a reference. We compute average instant performance of all upstream tasks, We additionally compute a relative improvement over the performance s inst. of models separately trained on each upstream task to indicate benefit of upstream learning.
3) Final Performance. We also evaluate the performance of f at the end of the continual learning over upstream tasks to know how much the model f forgets the knowledge about the task after it learns to solve more tasks. The final accuracy s i final of a task T i u is defined as F (Y i u ,Ŷ i,Nu u ). Similarly, we report the averaged final accuracy over all tasks, noted as . For a single model, the forgetting can be quantified as s inst − s final .
Challenges The CLIF setting is particularly challenging for existing few-shot learning methods. Most few-shot learning methods assume that the upstream training datasets for all tasks are always available and there is no temporal order for learning. Hence, the upstream tasks can be learned jointly in a multi-task learning setting. However, the CLIF problem follows a continual learning setup, where the tasks are visited sequentially without revisiting. Thus, methods relying on random sampling from a task distribution are not applicable.

Tasks and Data Streams
To push the CLIF challenge to a more practical setup, we consider a diverse set of NLP tasks to perform CL and few shot learning. We consider two dataset combinations, referred to as CLIF-26 and CLIF-55 tasks, summarized in Table 1. In the first combination, following , we use the GLUE (Wang et al., 2019a) bench-mark as our upstream tasks for CL stage for experiments which consists of N u = 9 tasks. We then evaluate the few-shot learning ability over N v = 17 DivFSL  tasks, spanning over diverse NLP tasks including sentiment analysis, entity typing and natural language inference. In CLIF-55, we train and test the model over N u = 45 and N v = 10 tasks selected from Huggingface datasets library 2 . The selected datasets span over a broad family of NLP tasks, including natural language inference, emotion classification, topic classification, fact checking, hate speech detection, paraphrasing, and others.
To adopt it for our learning setting, we specify an order on the tasks presented to the model for CLIF-26 and CLIF-55 (details in Appendix A). We also consider alternative task orders in our experiments. The model sequentially visits each task during training. We limit the number of training examples in each GLUE task in CLIF-26 to 10,000 to avoid overly imbalanced datasets. For CLIF-55, we use 90 examples per class for continual learning. We use k = 16 examples per class in few-shot learning tasks for both CLIF-26 and CLIF-55 if not specified, and include more setups of k in the experiments. As the test labels for GLUE are not publicly available, we report performance on validation sets. We convert regression tasks (e.g. STS-B) to binary classification tasks by setting the threshold in the middle of the maximum and minimum regression scores.
All examples are converted into sequenceto-sequence question-answering formats following (McCann et al., 2018) to allow a single model to solve all tasks. We consider exact match between the generated answer span and the groundtruth span as a correct prediction. For both the upstream tasks and few-shot tasks in CLIF-26 and CLIF-55, we use the prediction accuracy as the metric function.

Method
This section presents baseline methods to set up the lower bounds for the CLIF problem, and approaches to improve the performance. We view an approach by its base model and the learning algorithm. We first introduce the base models in our study (Sec. 3.1); Then, we introduce a few existing methods for continual learning and continual metalearning (Sec. 3.2). Finally, we present a novel regularized bi-level adapter generation framework to better address the CLIF problem (Sec. 3.3).

Base NLP Models
BART and BART-Adapter. As we formulate the NLP tasks in the CLIF problem in a unified text-to-text format, we use pre-trained language models (LMs) as the architecture of the model f and fine-tune the entire model during training. We mainly use the BART-base (Lewis et al., 2020) where f a is the adapter layer at layer . Only adapters are learned during training, while the BART model is frozen. We note two approaches BART and BART-Adapter respectively.
Hyper-Networks for Adapter Generation. In addition to BART and BART Adapter, we also use consider a HyperNetwork (HNet) architecture. The hypernetwork, noted as g, takes a task representation z as input and generates model parameter of another prediction model, noted as f to solve the task. In few-shot learning, z is usually computed as the average representation of training examples of the task, z = 1 D i tr is the training set of the task T i and f e in an encoder model. In our case, we use a BART model as f e and feed it the concatenation of x and label y in text format to obtain the task representation z. As the model allows flexible control of model parameters with training examples, it is broadly applied for few-shot learning (Requeima et al., 2019;Gidaris and Komodakis, 2018); besides, z can also be randomly initialized and end-to-end learned (Ha et al., 2017). As the parameter space of largescale PTLMs like BART is huge, following (Ye and Ren, 2021), we generate model parameters only for adapters.
In summary, we consider BART fine-tuning, BART-Adapter learning and HNet for adapter generalization as three base NLP models. In section 3.2, we introduce algorithms to learn these models in the CLIF setting.

Baseline Learning Algorithms
Single Task Learning To understand the reference performance of a base model on an upstream task without any knowledge transfer, we apply the single task learning (STL) method, which trains and tests a model f on the dataset of each task in isolation. In this case, we ignore the sequential nature of the CLIF problem so we can use this STL performance to assess the effectiveness of different continual methods (introduced below). Ideally, a valid CL algorithm should have a better few-shot accuracy than STL results, meaning that it accumulates knowledge and effectively transfer it for learning. Similarly, to know the reference performance of the few-shot tasks, we learn a model f for each few-shot task on the given examples, without any upstream training, so that we can use such performance to assess how well a CLIF method improves the generalization ability.
Continual Learning Algorithms As a straightforward baseline method, we use Vanilla to denote simply training the model f sequentially on the upstream tasks. Specifically, it trains the model f on T i u until its performance converges and then continually train f on the data of T i+1 u .
Note that the access of the data on previous tasks is not allowed in CL. We also consider CL algorithms such as EWC (Kirkpatrick et al., 2017), MbPA++ (de Masson d'Autume et al., 2019) and meta-MbPA (Wang et al., 2020) in our experiments. We use an online variant of EWC (Schwarz et al., 2018). EWC regularizes the change of important model parameters during training. The MbPA++ method performs test-time adaptation over a few training examples stored in the memory. The meta-MbPA method includes a meta-learning objective to adapt fast.
As a comparator that does not suffer from forgetting, we also report the results of multi-task learning over upstream tasks (MTL) for reference.
Hyper-Networks for CL. von Oswald et al.
(2020) proposed a hypernetwork-based continual learning algorithm, where the high-level idea of mitigating catastrophic forgetting is to penalize the hypernetwork for the change of generated model weights for previous tasks when it learns a new task. While the original work generates entire parameters of a model, we adapt it to PTLMs by generating the weights of adapters only. We note the approach as HNet-Reg.
Specifically, when the model has just finished learning the task T i−1 u and right before learning the task T i u in the continual learning stage, we compute the adapter weights generated by our current hypernetwork for all prior tasks T 1 u ..
where the generation is controlled by applying the hypernetwork h on the stored task representations of previous tasks Here, the task representation z i for task T i u is randomly initialized before learning the task and optimized jointly while learning the task. Then, in each step of learning T i u , we randomly sample a prior task T j u (j < i) to regularize the hypernetwork learning. It penalizes the 2 distance between the adapter weights generated at the current step θ j and the pre-computed one, i.e., ||θ j −θ i−1 j || 2 2 . Therefore, we avoid the hypernetwork g changes its output for a prior task too much during the continual learning stage, so that the knowledge accumulation is better guaranteed for the learned model.
Limitations. EWC and HNET-Reg are not welldesigned for the CLIF problem, which additionally tries to improve the few-shot generalization on unseen tasks after continual learning. While the test-time adaptation in MbPA and meta-MbPA may benefit few-shot learning, such ability is not studied in these works. Besides, as these two algorithms store real examples of previous training tasks, it is not applicable in privacy sensitive applications where data from earlier task is no longer accessible, which is a typical scenario in continual learning.

Our Extension: Bi-level Hypernetworks for Adapters with Regularization
Inspired by hypernetwork approaches for few-shot learning and continual learning, we extend the hypernetwork-based CL methods for CLIF. We present a novel method, Bi-level Hypernetwork for Adapters with Regularization (BiHNet+Reg), which learns to use the bi-level task representations to generate adapter weights for learning a fast adaptive model over a sequence of tasks, while mitigating the forgetting effect via regularization. As shown in Figure 3, the proposed method consists of three components: (1) a context predictor to generate bi-level task representations (i.e., high-resource and few-shot representations) from training examples, (2) a hypernetwork to generate weights of adapters given the task representations, and (3) a regularization term to discourage weight Bi-Level Task Rep. changes of seen tasks to avoid forgetting following (von Oswald et al., 2020). We discuss each individual component below.
Context Predictor. We propose to generate two task representations for each task t to model it in the high-resource and few-shot cases respectively, denoted as z t h and z t f , with a frozen BART model. The high-resource representations are used to encourage the knowledge transfer during continual learning; the few-shot task representations help us mimic the few-shot tasks in the few-shot learning stage for better generalization, similar to meta-learning. Specifically, we use an LM (e.g., BART) as the context representation model R for encoding an example (x, y): we feed x and y to the encoder and the decoder of the model R, and use the latent representation from this last-layer activation. The high-resource task representation is then computed as the average of all examples' representations in task t, noted as z t h = 1 |Dt| (x i ,y i )∈Dt R(x i , y i ); while the few-shot task representation z t f uses the average of a limited number (say, K) of sampled exam- Note that the high-resource representations of upstream tasks are stored in a memory module over time during the continual learning, In the few-shot learning stage, we set K as the number of given examples, so the z h = z f for any tasks.
Adapter-Wise Hypernetworks. Following the practice introduced in Sec. 3.1, we use a hypernetwork g to generate weights of adapters between the layers of the frozen BART model f . During training, we use high-resource and sampled task representations z t h and z t f to generate adapter weights separately, noted as θ h t and θ f t . We optimize the prediction loss with both adapters.
Regularization. Given that the HyperNetwork is the only trainable part in our model, we impose regularization on generated adapters to mitigate forgetting following HNet+Reg introduced in 3.2. While our BiHNet is trained to generate adapters from both high-resource and low-resource task representations, we find it sufficient to only store and regularize outputs from high-resource task representations.

Summary and Highlights
To sum up, our proposed method first generates bi-level task representations for training adapter-wise hypernetworks with a regularization term dedicated for avoiding forgetting over time. Unlike replay-memory based CL approaches (e.g., MbPA (de Masson d'Autume et al., 2019)), our method does not store any real training examples. Instead, it uses task representations for storing the memory, and thus allows the method to be applied in privacy-sensitive scenarios.

Results and Analysis
We address our two major research questions in this section: (1) how models accumulate generalizable knowledge over time in a CL setup compared to offline setups given potential catastrophic forgetting, and (2) whether continual learning approaches reduce catastrophic forgetting of both seen-task performance and generalizable knowledge. We experiment with various combinations of model architectures in 3.1 and learning algorithms 3.2. We note a method by its model architecture and CL algorithm applied, e.g., BART-Vanilla, BiHNet-EWC. We include details of implementation in Appendix A.

Examining Knowledge Accumulation
In this section, we present analysis of model's ability to acquire generalizable knowledge in offline and CL setup. We note BiHNet methods, which correspond to learning to generate adapters, should be compared with BiHNet-Single and BART-Adapter-Single, which are zero-knowledge baselines that learns to generate or learn adapters from random initialization; similarly, BART methods should be compared with BART-Single. We focus on identifying challenges in CLIF, and leave discussions of methodology in the next subsection.
Q1: Is knowledge from upstream tasks helpful for a model's few-shot generalization in offline and continual learning setups? To answer the question, we compare the performance of MTL with learning separate models per few-shot task without learning upstream tasks. Table 2 summarizes the results. On both CLIF-26 and CLIF-55 datasets, we see BiHNet-MTL could outperform zero-knowledge baselines in few-shot Acc. by 0.4% and 1.0%, which implies upstream tasks are helpful for few-shot generalization in standard offline learning setups. For BART models, we notice BART-MTL improves over BART-Single on CLIF-55 datasets by 2.5%. However, we notice the opposite for CLIF-26. Given that the entire BART parameters are optimized in these models, we hypothesize that BART-MTL may have suffered from the forgetting of knowledge in the pre-trained BART model itself; while in adapter and BiHNet models, the BART model is frozen. Therefore, in the rest of the section, we focus more on BiHNet approaches.

Q2: How does the model's generalization ability evolve over time? We focus on BiHNet-Vanilla and BART-Vanilla approaches and answer three sub-questions.
Is the knowledge being monotonically accumulated over upstream tasks? In comparison to two zero-knowledge baselines, we notice BiHNet-Vanilla generally improves both Instant Accuracy (4.2% on CLIF-26 and 6.8% on CLIF-55) and Few-shot Accuracy (0.8% on CLIF-55), except  in few-shot Acc. on CLIF-26 (-0.4%). The results confirm positive knowledge accumulation to some extent. In Figure 4, we plot the few-shot accuracy on CLIF-26 when the model sequentially visits each upstream training task. We note the few-shot accuracy of BiHNet-Vanilla does not monotonically increase, which implies interference between these upstream learning tasks or forgetting of generalizable knowledge. Does the order of the tasks matter? Figure 5 present performance of methods under different orders of tasks on CLIF-26. We order the tasks by increasing and decreasing relevance to few-shot learning tasks, where the relevance is defined as few shot accuracy when the model transfers from a single upstream tasks. The results show in both orders BiHNet-Vanilla is less competitive than BART-Adapter-Single. It implies that in continual learning the knowledge accumulation is less robust without CL algorithms. Q3: Does model's catastrophic forgetting hinder its knowledge accumulation? In Table 2, we see clear differences between final accuracy of Vanilla and MTL approaches (by around 20 points), which verifies the catastrophic forgetting of seen-task performance when training examples are not i.i.d. However, we find the gap between MTL and Vanilla training is close for few-shot learning performance, where BART-Vanilla is even better than BART-MTL, which can be a positive outcome of adequate forgetting for alleviating overfitting (Wang et al., 2020). It indicates the catas- trophic forgetting influence generalization ability to a lesser degree compared to its effect on seentask performance.

Effect of Continual Learning Algorithms
With the insights obtained for earlier questions, we now analyze whether baseline continual learning algorithms and the proposed approach help knowledge accumulation and improve models' (few-shot) generalization ability.  Q2: Does mitigating catastrophic forgetting better retain generalization ability? On CLIF-26, by comparing the few-shot accuracy of BiHNet-Vanilla and BiHNet-Reg, we notice an relative improvement of few-shot accuracy and instant accuracy by 2.3% and 0.4% on two datasets. We see a similar trend on CLIF-55. From Figure 5, we see BiHNet-Reg outperforms BiHNet-Vanilla in the default and decreasing relevance order; while we observe an outlier in BiHNet-Reg runs in the increasing relevance order. From Figure 4, we see few-shot learning accuracy improves more stable as BiHNet-Reg learns more upstream tasks.
Q3: Does BiHNet-Reg improve over HNet-Reg? The major differences of BiHNet-Reg compared to HNet-Reg (von Oswald et al., 2020) are (1) few-shot task representations and (2) inferring task representations with context predictors instead of learning them as trainable embeddings. As an ablation study, we progressively replace out two components in BiHNet , as shown in Table 3. We see removing few-shot task-representation causes the few-shot accuracy to drop on both datasets by 1.08 and 0.33 points. Using trainable task embeddings in place of task encoders causes the few-shot accuracy and final accuracy to drop on CLIF-55; however, on CLIF-26, we see a slight performance improvement. It implies task encoders are useful when the number of sequential training tasks is large (such as in CLIF-55). Trainable task embeddings can be an alternative in the other case 4 . and CLIF-55. We observe BiHNet-Reg always achieves the best performance and the improvement is generally more significant when the training sets are smaller.

Discussion.
Our results indicate BiHNet-Reg could effectively improve knowledge accumulation over time compared to similar adapter learning frameworks (BiHNet-Single and BART-Adapter-Single). However, BiHNet-Reg does not rival BART-Single in terms or few-shot learning accuracy. We believe this is due to the restricted model capacity of adapter, as compared to fine-tuning entire transformer. This opens up future work on improving continual learning algorithms that are compatible with PTLM fine-tuning.
Continual Meta-Learning There exists literature that studies continual meta-learning outside NLP application, with various definition of the problem. Some prior works (Xu et al., 2019;de Masson d'Autume et al., 2019;Wang et al., 2020) aim to develop algorithms that allows fast recovery of previous performance when a few training examples of an early task are available again at the test time. Caccia et al. (2020) proposed a setup where models visit a sequence of potentially re-occuring tasks and measured online cumulative performance as metrics. Antoniou et al. (2020) assumes the model visits a sequence of few-shot classification tasks while the test tasks consist of seen classes at training. The problem setup of Jerfel et al. (2019) is most related to ours which learns to perform few-shot learning on new tasks better, but is only studied for image classification tasks with much smaller number tasks. To our best knowledge, our work is the first to study continual knowledge accumulation for few-shot learning in diverse NLP tasks for large-scale transformer models.

Conclusion
We present the Continual Learning of Few-Shot Learners (CLIF) challenge to simulate the scenario where a learner continually accumulate (generalizable) knowledge over a sequence of NLP tasks, while retaining its performance on the seen tasks. We propose evaluation protocols to study the performance of existing continual learning algorithm, and present our method BiHNet-Reg. We demonstrate the potentials of building a NLP system that, through continual training, can perform more tasks and also become more efficient in mastering new tasks. Future works include extending our work to task agnostic scenarios where the distribution of data may shift continuously and studying algorithms for continual refinement of large-scale pre-trained models with emerging unlabeled data.

A Implementation Details
We tune hyperparameters except the time steps of few-shot training on the validation set of upstream continual learning tasks. We tune the hyperpameters on CLIF-26 and apply the same for CLIF-55 for the same approaches. We tune learning rates by enumerating over [3e-4, 1e-4, 3e-5, 1e-5], and finally use a learning rate of 3e-5 for all MTL approaches and fine-tuend BART approaches (e.g., BART-EWC, BART-Vanilla), and a learning rate of 1e-4 for BiHNet, HNet, and BART-Adapter-Single. We use a batch size of 64 across experiments. We train the model for at most 100 epochs for each training task with a patience of 3 epochs without validation performance improvement. Before training on a new task, we revert the model to the checkpoint with the best validation performance in the previous task. In the few-shot learning stage, we use the same learning rate and train the model for 400 epochs, assuming no validation sets to perform early stopping. The number of training steps are decided based on the performance of BiHNet-Vanilla on airline, conll, and disaster tasks. We set the hidden size of adapters inserted between layers of BART transformers as 256 and the one in the classification head as 64. The weight generator in BiHNet is implemented as a two-layer MLP with a hidden size of 32. For replay based approaches (MbPA++ and meta-MbPA), we store all examples following these works and randomly draw minibatches to replay every 100 training steps. For BiHNet, HNet, and EWC, we set the regularization strength (coefficient before the regularization loss term) as 0.01 without further tuning. We use a sample size 64 to compute the few-shot task representation on CLIF-26 and 10 for CLIF-55 at training. Experiments are run on Nvidia Quadro 6000 or Quadro 8000 GPUs with cuda version 10.1 installed. Through out the experiments (including the hyperparameter search), we run each method with three random seeds.
Details of Datasets . For CLIF-26, we use the train, validation, and test split from . For a seen trained model, we evaluate its few-shot ability over 5 different partitions of traintest splits of a single few-shot task. For CLIF-55, we use the train, validation, and test splits provided in the datasets library 5 . The few-shot training and validation sets are random samples of the official train and validation splits; while we do not subsample the test split. Similarly, we evaluate fewshot learning ability over 5 different samples of training and validation examples.
Details of Task Orders. Table 7 summarize the list of 45 upstream training tasks and 10 few-shot training tasks. Table 6 further shows the order of continual learning tasks.

B Parameter Efficiency
We show the statistics of trainable and total parameters in each compared architecture in Table 4 on CLIF-26. In our default settings, BiHNet has twice as many trainable parameters as BART and above three times as BART-Adapter. However, we could significantly reduce the number of parameters by setting the hidden size d of the Hypernetwork smaller than the number of the tasks. We reduce d to 4, and summarize the results in 5. We notice the approach achieves instant accuracy and few-shot accuracy on par with BiHNet-Reg in the standard setup. We notice the approach achieves lower final accuracy compared to the default setup, but the score is still more competitive than baselines, such as BART-MbPA and BART-meta-MbPA, and BiHNet-Vanilla.