CITB: A Benchmark for Continual Instruction Tuning

Continual learning (CL) is a paradigm that aims to replicate the human ability to learn and accumulate knowledge continually without forgetting previous knowledge and transferring it to new tasks. Recent instruction tuning (IT) involves fine-tuning models to make them more adaptable to solving NLP tasks in general. However, it is still uncertain how instruction tuning works in the context of CL tasks. This challenging yet practical problem is formulated as Continual Instruction Tuning (CIT). In this work, we establish a CIT benchmark consisting of learning and evaluation protocols. We curate two long dialogue task streams of different types, InstrDialog and InstrDialog++, to study various CL methods systematically. Our experiments show that existing CL methods do not effectively leverage the rich natural language instructions, and fine-tuning an instruction-tuned model sequentially can yield similar or better results. We further explore different aspects that might affect the learning of CIT. We hope this benchmark will facilitate more research in this direction.


Introduction
Recent studies have shown that multi-task instruction tuning (IT) makes language models better zeroshot learners (Wei et al., 2022;Sanh et al., 2022;Wang et al., 2022;Chung et al., 2022;Longpre et al., 2023).IT fine-tunes pre-trained language models (PLMs) on various tasks with natural language instructions (Fig. 1) and can achieve remarkably well generalization to unseen tasks.
Despite their impressive performance, these instruction-tuned PLMs still fall short on domainspecific tasks due to the limited exposure to relevant knowledge and vocabulary from the training corpus (Luo et al., 2023).Moreover, PLMs are static after deployment, and there is no mechanism 1 Code and data are available at https://github.com/hyintell/CITB. to update themselves or adapt to a changing environment (Zhang et al., 2023;Bubeck et al., 2023).Continual learning (CL) aims to enable information systems to learn from a continuous data stream across time (Biesialska et al., 2020).Therefore, it is promising to leverage CL for instruction-tuned PLMs to continually adapt to new domains and tasks without costly re-training.Despite its importance, it is non-trivial to alleviate catastrophic forgetting, a phenomenon in which previously learned knowledge or abilities are degraded due to overwritten parameters (McCloskey and Cohen, 1989).Moreover, enabling knowledge transfer is also essential since many tasks are similar and have common knowledge (Ke et al., 2021).
Unfortunately, there is little work on applying CL for IT and has only been explored in rather specific settings.Scialom et al. (2022) continually fine-tune a T0 (Sanh et al., 2022) on eight new tasks with memory replay to avoid forgetting.Despite effectiveness, they need to store large number of instances per task in memory, which is too costly when scaling to a larger number of tasks.
In addition, they do not study knowledge transfer between tasks.Yin et al. (2022) propose to use history task instructions to reduce forgetting and enable knowledge transfer.However, they do not compare with commonly adopted CL methods, which makes the effectiveness of other CL methods unknown.Moreover, they only evaluate the model on the newly learned tasks while largely ignoring previously learned tasks during the multi-task training stage (Fig. 1).They also overlook the intrinsic ability of the instruction-tuned model on unseen tasks.Lastly, both of them use different evaluation metrics and setups, which creates an obstacle to comparing different techniques and hinders the development of this field.
To this end, we first formulate this practical yet under-explored problem as Continual Instruction Tuning (CIT).Then, we propose a first-ever benchmark suite to study CIT systematically.Our benchmark, CITB, consists of both learning and evaluation protocol and is built on top of the recently proposed SuperNI dataset (Wang et al., 2022).We create two CIT task streams: InstrDialog stream, which consists of 19 dialogue-related tasks spanning three categories; InstrDialog++ stream, which includes all the tasks in InstrDialog stream and 19 additional tasks selected from broad categories and domains.Using the two long task streams, we implement various CL methods to study forgetting and knowledge transfer under the setup of CIT.We find that directly fine-tuning an instruction-tuned model sequentially yields competitive performance with existing CL methods.With further investigation, we find that rich natural language instructions enable knowledge transfer and reduce forgetting, which is barely fully leveraged by current CL methods.We conduct comprehensive experiments to explore what effects the learning of CIT.We hope our CITB benchmark will serve as a helpful starting point and encourage substantial progress and future work by the community in this practical setting.To summarize, our main contributions are: • We formulate the problem of CIT and establish a benchmark suite consisting of learning and evaluation protocols.

Related Work
Instruction Tuning.Much effort has been made recently to use natural language instructions to solve multiple tasks concurrently or to align with human preferences (Touvron et al., 2023;Zhou et al., 2023;OpenAI, 2023).Unlike simple and short prompts (Liu et al., 2021), natural language instructions (Fig. 2) can be more comprehensive, including components such as task definition, incontext examples (Brown et al., 2020), and explanations.Through IT, PLMs learn to complete tasks by following instructions, which enables them to solve new tasks by following instructions without learning (i.e., generalization ability).Ideally, we expect the instruction-tuned model to understand any given task instruction so that an end user can directly leverage the model to solve the task without the need to annotate a large dataset and train it.Unfortunately, despite the instruction-tuned models such as FLAN (Wei et al., 2022;Longpre et al., 2023), T0 (Sanh et al., 2022), and Tk-Instruct (Wang et al., 2022) showing strong generalization performance to their evaluation tasks, there is still a sizeable gap compared with supervised training, which limits the usage of the models.From a practical point of view, a desirable instruction-tuned model should be able to extend its ability by continually learning those under-performed tasks or any new task, while not forgetting the old ones.
Continual Learning.In contrast to multi-task learning, continually fine-tuning a model on tasks might lead to catastrophic forgetting (McCloskey and Cohen, 1989), where the model forgets previously acquired knowledge after learning new tasks.
In CL literature, approaches to overcoming catastrophic forgetting can be grouped into three categories (Biesialska et al., 2020;Ke and Liu, 2023).
Regularization-based methods use an additional loss to prevent important parameters of previous tasks from being updated (Kirkpatrick et al., 2017;De Lange et al., 2019).Replay-based methods store and replay a small subset of training data from previous tasks to prevent forgetting (Rebuffi et al., 2017;Scialom et al., 2022); Architecturebased methods introduce task-specific components for new tasks and isolate parameters of old tasks (Madotto et al., 2021;Zhu et al., 2022).However, the effectiveness of these CL methods for Input: "Sentence: The lens of the eye is a(n) convex shape.Question: What shape is the lens of the eye?" Expected Output: "convex"

Definition:
In this task, you are given a sentence and a question, you would be asked to create the answer which is contained in the sentence provided.

Positive Example 1
Input: "Sentence: Heat from the sun causes the most evaporation of water from a lake.Question: Which of these causes the MOST evaporation of water from a lake?" Output: "Heat from the Sun" Explanation: "The output is correct as … it as the 'Heat from the sun'"

Preliminaries
Instruction Tuning (IT).Following previous studies (Wei et al., 2022;Sanh et al., 2022;Wang et al., 2022), each task t ∈ T consists of its natural language instruction I t and a set of N input-output instances D t = x t i , y t i ∈ X t × Y t N i=1 , which can be split into the training D t train , validation D t dev and test sets D t test .Each instance is filled into an instruction template such that different tasks can be transformed into a unified text-to-text format (Fig. 2).IT aims to learn a model f : I t × X t → Y t that can predict the output y t i given the task instruction I t and an input x t i .In general, the model is first trained on a mixture of tasks ( T seen ) and then evaluated for its zero-shot generalization ability on held-out tasks (T unseen ), where T seen ∩ T unseen = ∅.The model is expected to learn to follow instructions via the training tasks and then solve new tasks with only the help of task instructions.

Continual Instruction Tuning Benchmark
In this section, we first formalize the CIT problem (Fig. 1).Then, we present the learning and evaluation protocol of our framework CITB.Lastly, we describe the data for creating the benchmark.

Continual Instruction Tuning
In contrast to static IT, which only learns a fixed set of tasks (T seen ), the model should be able to keep learning new tasks without catastrophically forgetting previously learned knowledge and facilitate knowledge transfer if possible.Let us expand the definition such that we have a set of T tasks T seq = {t 1 , • • • , t T } that arrives sequentially.Note that the tasks in the stream can be any type and are not restricted to specific categories or domains.Similarly, each task t j ∈ T seq has a natural language instruction I t j , training D t j train , validation D t j dev and test sets D t j test .Likewise traditional CL, the goal of CIT is to learn a single model f from T seq sequentially.
CIT vs. Traditional CL.While sharing similar desiderata with traditional CL, CIT differs in that: (1) it pays more attention to effectively leveraging the rich natural language instructions to prevent catastrophic forgetting and encourage knowledge transfer; (2) because of the multi-task nature of instructions, all tasks can be formatted in the unified text-to-text format, therefore CIT can learn any task or domain instead of a few specific tasks or domains; (3) after learning a few tasks, the model should have learned how to follow instructions to complete tasks, therefore, we expect fewer training instances required and higher knowledge transfer for future tasks.

Learning Protocol of CIT Benchmark
A non-instruction-tuned model (e.g., T5; Raffel et al. 2020) may struggle to understand instructions if trained only on a task sequentially.It is also against our motivation to extend a model's ability that is already instruction-tuned.Therefore, we separate the learning process into two stages.
Stage 1: Initial Multi-task Fine-tuning.To teach a model a better understanding of task instructions, we first fine-tune the model on instruction data.Suppose we have another group of M tasks that also equips with natural language instructions, where T init ∩ T seq = ∅.We fine-tune a base pre-trained model on the training set (i.e., D init train = M i=1 D t i train ) of the mixed M tasks to get an instruction-tuned model, denoted as f init .After training, most of the training data D init train is unavailable for subsequent sequential learning, but a memory M init (M init ≪ |D init train |) that stores a small portion of training instances is accessible.We use this model as the starting point to conduct the subsequent learning.
To keep extending knowledge of the instructiontuned f init , we fine-tune it on the training set D t j train of each task t j in the stream T seq .Similarly, when learning the task t j , the training data of previous tasks in the stream (i.e., D seq train = j−1 i=1 D t i train , i < j < T ) is unavailable, but a small memory M seq can be used for training.

Evaluation Protocol of CIT Benchmark
Evaluation Process.After learning each task t j (1 < j < T ) in stream T seq , we consider three datasets to measure the model's performance.(1) Similar to the standard CL, we evaluate the model on the test sets of all previously learned tasks in the stream, the test set of the current task, and the test set of the next task, denoted as D seq test .This helps us measure whether the model forgets previous knowledge and whether it is helpful to learn future tasks; (2) We evaluate the model on the test sets of the M tasks that are used in stage 1 to teach the model how to follow instructions, denoted as This is where different from the conventional CL.In CL, previous works only evaluate downstream tasks in the stream but not the tasks during the pre-training phase because such data is generally not accessible to end-users (Ke et al., 2022).(3) Since multi-task instruction-tuned models have shown strong zero-shot generalization to unseen tasks (Wang et al., 2022), our initial model trained in stage 1 might also have zero-shot generalization to some unseen tasks T unseen , where T init ∩ T seq ∩ T unseen = ∅.Let D unseen test be the test sets of all tasks in T unseen .We evaluate the model on D unseen test if it is available.To sum up, once a new task is learned, the model will be evaluated on: In  (Lin, 2004) to measure the aggregated performance of each task.They have shown that ROUGE-L generally works well for both generation and classification tasks.
Following Lopez-Paz and Ranzato (2017) and Biesialska et al. (2020), we also use CL-related metrics to measure the learning procedure.Let a j,i be the ROUGE-L score of the model on the test set of task t i right after training on task t j , we define the following: Average ROUGE-L (AR), which measures the average performance of the model on all tasks after the final task t T is learned: We use Final ROUGE-L (FR) to measure the performance of the model on D init test and D unseen test , respectively, after the final task t T is learned.
Forward Transfer (FWT), which measures how much the model can help to learn the new task.FWT also tests the model's zero-shot generalization to new tasks: Backward Transfer (BWT), which measures the impact that continually learning on subsequent tasks has on previous tasks: Notably, positive BWT indicates that subsequent tasks can improve the performance of previous tasks, while negative value implies knowledge forgetting.

Data Curation
In this work, we adopt the recently proposed Su-perNI (Wang et al., 2022)  The remaining training tasks can be used for stage 1 initial multi-task fine-tuning ( §4.2).In summary, the number of initial fine-tuning tasks available is M = |T init | = 718, and we use the official 119 test task sets as T unseen to evaluate whether the performance deteriorates for unseen tasks after learning new tasks.For all tasks, we fill instances in a natural language instruction template and transform them into a unified text-to-text format ( §3).Unless otherwise specified, we use the instruction template consisting of the task definition and two positive examples for all tasks because it generally yields the best performance (Wang et al., 2022).See an example of natural language instructions in Fig. 2, and the selected tasks in Table 5.We study the effect of the instruction template in §6.3.

Experiments
Using our CITB benchmark, we conduct experiments on various popular CL methods of different kinds.We describe our experiment setups and compared methods in this section.

Setup
Model.We use the LM-adapted version of T5small (Raffel et al., 2020), which is further trained with a language modeling objective.We initialize a T5 model from HuggingFace5 .Since it is costly to fine-tune on all 718 tasks, we randomly select 100 tasks from T init and fine-tune T5 to obtain an instruction-tuned model f init ( §4.2), which has learned to understand some instructions and can act as a good starting point to conduct subsequent learning.Note that the 100 randomly selected training tasks do not overlap with InstrDialog and In-strDialog++, but their task categories might overlap with the categories in InstrDialog++.
Train/Dev/Test Splits.Since the number of instances in each task is imbalanced and a large number of training instances do not help generalization in IT (Wang et al., 2022) stream.For InstrDialog++, since it has a longer task sequence, to save computational cost, we use 100/50/100 instances per task instead.For T init , we use 100/50/100 instances per task and 100 instances per task for T unseen .We study the effect of different numbers of training instances in §6.3.

Baselines and Compared Methods
We implement commonly used CL methods from three categories (Biesialska et al., 2020) to benchmark CIT.
Regularization-based methods rely on a fixed model capacity with an additional loss term to consolidate previously gained knowledge while learning subsequent tasks.We use L2 and EWC (Kirkpatrick et al., 2017), which uses a fisher information matrix to reduce forgetting by regularizing the loss to penalize the changes made to important parameters of previous tasks.
Replay-based methods store a small subset of training instances from previous tasks in a memory.The data are replayed later to reduce forgetting.We adopt Replay, which saves random instances from each task in a memory and then jointly trains the model on new task data and the old data in the memory; AGEM (Chaudhry et al., 2019), which adds constraint to prevent parameter update from increasing the loss of each previous task.The loss of previous tasks is calculated using the instances stored in the memory.
Architectural-based methods introduce taskspecific parameters to the base model to prevent subsequent tasks from interfering with previously learned parameters.We adopt AdapterCL (Madotto et al., 2021), which freezes the pretrained model and trains a residual Adapter (Houlsby et al., 2019) for each task independently.
Apart from the three categories of CL, we also implement instruction-based baselines because all previous CL methods are not designed for CIT.No prior work had tried to fine-tune an instructiontuned model sequentially without any mechanism for preventing forgetting or encouraging knowledge transfer.However, it is commonly considered a performance lower bound in CL literature.To this end, we propose to continually fine-tune the initial instruction-tuned model ( §5.1) on subsequent tasks, named as FT-init.As a direct comparison, we initialize a new T5 model, which is not tuned on any instruction data, and we continually finetune it (FT-no-init).In addition, we also report the performance of the initial instruction-tuned model (Init), which is the starting point before subsequent learning.Lastly, we jointly fine-tune a T5 model using all the data, including the training data used in stage 1 and the training data of all subsequent tasks in the stream (Multi).This is often regarded as the performance upper bound in CL and does not have catastrophic forgetting and knowledge transfer.

Implementation Details
For both InstrDialog stream and InstrDialog++ stream, we conduct continual instruction tuning using the same initial instruction-tuned model for all methods except FT-no-init, AdapterCL, and Multi.For these methods, we initialize a new T5 model.For AGEM and Replay, we experiment on a memory size of 10 and 50, i.e., we set the memory size M init to 10 and 50, same as M seq .We jointly train the data in M init , M seq , and the new task data for these two methods.For AdapterCL, we use a bottleneck of 100.Due to limited computing resource, we randomly permute the task streams and run all experiments using three random seeds.
We refer this as task order 1.We study the effect of task orders in §6.3.The selected tasks for task stream InstrDialog and InstrDialog++ are listed in Table 5.The task orders are listed in Table 6.More details are in Appendix A.

Results and Analysis
In this section, we report the performance of various baselines discussed in §5.2 on our benchmark.

Results on InstrDialog Stream
Table 1 shows each method's overall performance and resource requirement after continually learning the InstrDialog stream.We have the following observations: First, all methods except AdapterCL have improved AR, compared to the zero-shot performance (22.5) of the starting point model Init.This shows CIT can extend a model's knowledge.In contrast, although AdapterCL is parameter-efficient and does not rely on memory, it performs even worse than Init.We conjecture that AdapterCL fails to learn instructions effectively because it is initialized from a non-instruction-tuned model (T5) and the few tunable parameters restrict it from learning complex instructions.
Second, among all baselines, Replay generally have the best performance.All methods except Replay (50) have negative BWT, meaning that they all suffer from catastrophic forgetting.Furthermore, forgetting on T init and T unseen is even worse, which demonstrates that the ability of the initial instruction-tuned model Init has deteriorated after learning new dialogue tasks.We also find storing more examples in memory improves Replay but does not significantly help AGEM.It might be because the constraints added to the loss are roughly the same, no matter how many instances are stored.Despite used additional parameters, regularizationbased L2 and EWC perform similar to other baselines.Multi overall performs well, with the highest AR and improved FR on T init , however, it also forgets tasks in T unseen .Replay (50) has a higher FR on T init than Multi because the 5,000 instances stored in M init are jointly trained multiple times when learning subsequent tasks ( §5.3), leading to better data fitting.
Third, FT-init performs surprisingly well on all metrics and is competitive to L2, EWC, and AGEM.This finding contradicts the common sense in CL that simply fine-tuning a model sequentially would lead to catastrophic forgetting because all parameters can be freely updated in learning new tasks (Kirkpatrick et al., 2017).Even for FT-no-init, which is not tuned on any instruction data, shows increased AR after learning the 19 dialogue tasks.This raises a question: do various CL methods truly mitigate forgetting and promote knowledge transfer in CIT?We hypothesize that the rich natural language instructions lead to the remarkable performance of the baselines ( §6.3).

Results on InstrDialog++ Stream
The performance of all methods after learning the InstrDialog++ steam is shown in Table 2.We observe most of the same findings as in §6.1, except that: First, Init has a higher (30.5 vs. 22.5) zero-shot performance on this long stream than on InstrDialog, as in Table 1.We analyze that the categories of the selected 100 training tasks ( §5.1) overlap with the categories in the stream, which enables more knowledge transfer of tasks between the same category because of the similar natural language instructions.For example, both sets have tasks from sentiment analysis and toxic language detection.In contrast, Init did not learn dialogue tasks, thus showing lower generalization on Second, we can see improved performance for almost all methods compared to Table 1, especially on T init and T unseen .For FT-init and FT-no-init, the improvements of FWT and BWT are particularly significant, reaching the best among all CL methods.
Combining the results on the two streams from Table 1 and 2, we find that catastrophic forgetting exists in CIT.However, learning a longer task stream and diverse tasks of different types leads to better knowledge transfer and lower forgetting.

Ablation Studies
In this section, we investigate the reason why instruction-based baselines (FT-init and FT-no-init) perform as well as or even better than conventional  CL methods ( §6.1 & §6.2).We also explore different aspects that might affect CIT.
Rich instructions enable knowledge transfer and reduce forgetting in CIT.We use the same setup as in Table 2, except for using different instruction templates.Results in  FWT and BWT).For the model that is not fine-tuned on any instruction data (FT-no-init), we find it worse than FT-init, showing the benefits of the initial multi-task training on instruction data.Similar observations are found on T init and T unseen in Table 4, where the model catastrophically forgets its initial abilities after learning a long stream of tasks.Providing descriptive task definitions significantly boosts task performance as well 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18   as facilitates knowledge transfer.Moreover, it also maintains the model's generalization ability on unseen tasks.Combining the results in Table 1, 2, 3, and 4, we find that those conventional CL methods do not fully leverage the instructions to reduce forgetting and facilitate knowledge transfer while learning continuously because the naive FT-init and FT-no-init can also achieve the same.This calls for novel CL methods designed for CIT.
Task types and learning order matter to CIT.To explore how task types and orders in a stream affects CIT, we randomly permute the InstrDialog stream to get two new task orders and conduct the same learning as Table 1.We present the intermediate learning trends of all 19 tasks in Fig. 3, Fig. 4 and Fig. 5.One can see from the plots that all baselines are highly affected by task orders, fluctuating dramatically when different tasks are learned first.We argue that it is because the task difficulties and similarities vary a lot.Learned tasks transfer knowledge through the instructions to new tasks of the same type, therefore facilitating its learning.For example, the last task in order 1 (Fig. 3) is a type of dialogue generation, which is the dominant task type in the stream (11/19, §4.4), therefore all baselines are improved.However, all baselines reach below Multi after learning all 19 tasks, demonstrat- ing knowledge forgetting will eventually appear if learning longer enough tasks.
A large number of training instances do not help knowledge transfer.We vary the number of instances per task used for learning the InstrDialog stream from [10,20,100,200,500].As shown in Fig. 6 and Fig. 7, FWT and BWT gradually decrease when the number of training instances is scaled to large values.It aligns with the findings by Wang et al. (2022) in standard IT, where large number of instances do not help generalization to unseen tasks, we also find it is true in CIT.Additionally, we find instruction-tuned models (FT-init) have better generalization to new tasks (FWT) than the model not fine-tuned on any instruction data (FT-no-init).This shows that, after learning a few tasks, the model have learned how to follow instructions to complete tasks, thus fewer training instances are required for new tasks.

Conclusion
In this work, we establish a benchmark for continual instruction tuning, with two 19 and 38 long task streams to be learned sequentially.We implement and compare various continual learning methods of different types using the benchmark to study their effectiveness under this new domain.We conduct extensive ablation studies to analyze the lack of current practices, and propose a future direction.

Limitations
We identify our limitations as follows.First, due to limited resources, all experiments in this work use T5-small (LM-adapted) as the backbone, which might not entirely reflect continual instruction tuning in general.As Wang et al. (2022) points out, there is a sizable gap between the smaller models and the 11B or 3B models in generalizing to new tasks.Second, when creating the two CIT task streams, we only use English tasks from the Su-perNI dataset (Wang et al., 2022).In future, it can be extended to multilingual task streams to study cross-language continual instruction tuning.Third, we follow the SuperNI dataset to use ROUGE-L as an aggregated metric to evaluate all tasks.Although it acts as a good proxy for the model's overall performance, it might not serve as an effective measurement for some specific tasks.Fourth, while we selected diverse tasks to form the InstrDialog and InstrDialog++ task streams, we did not analyse the characteristics of these tasks (Kim et al., 2023).
In future, we consider to select better source tasks and study how source tasks affect CIT.

Figure 1 :
Figure 1: Illustration of proposed continual instruction tuning (CIT).Unlike previous works, we evaluate the instruction-tuned model on the initial training, unseen, and newly learned tasks.

Figure 3 :
Figure 3: AR of each method during learning the Instr-Dialog stream (task order 1).

Figure 4 :
Figure 4: AR of each method during learning the Instr-Dialog stream (task order 2).

Figure 5 :
Figure 5: AR of each method during learning the Instr-Dialog stream (task order 3).
CIT, it is more critical for the instruction-tuned model to maintain its existing abilities than learn new ones because it can solve multiple tasks by following instructions.Otherwise, if it forgets many tasks, there is no point in using such a model than a task-specific one.Therefore, it is essential to evaluate on D init test and D unseen test .Evaluation Metrics.Due to the diversity of the tasks in CIT and the open-ended generation nature of the text-to-text format, we follow Wang et al. (2022) to use ROUGE-L Madotto et al., 2021)the benchmark.SuperNI consists of more than 1,600 NLP tasks, spanning a diverse variety of 76 broad task types, such as language generation, classification, question answering, and translation.Moreover, each task is equipped with an instruction and a set of instances, and all the instances can be transformed into the text-to-text format.Therefore, the dataset is suitable for studying CIT.The official training set of SuperNI 3 consists of 756 English tasks spanning 60 broad NLP categories, while 119 tasks from 12 categories are used for zero-shot evaluation.We keep the official 119 evaluation tasks untouched and create two CIT task streams from the 756 training tasks.Madotto et al., 2021).To investigate how a model learns new dialogue data under the setup of CIT, we carefully curate all dialogue-related tasks from the training set of SuperNI to form the CIT task stream.Specifically, we use 4 tasks from dialogue state tracking, 11 tasks from dialogue generation, and 4 tasks from intent identification, resulting in a total of 19 dialogue tasks, i.e., |T seq | = 19.We remove tasks that are excluded by the official task splits 4 .InstrDialog stream, then we manually select the other 19 tasks from the remaining training task set.We intentionally select tasks from broad categories, including sentence ordering, style transfer, toxic language detection, and others.In total, we have 38 tasks of 18 categories (3 categories from InstrDialog and 15 categories from the new 19 tasks), i.e., |T seq | = 38.

Table 1 :
Wang et al. (2022) al. (2022)use a fixed size of 500/50/100 instances per task as the train/dev/test set for the InstrDialog Performance of different methods on the InstrDialog stream.Means and standard deviations are reported.† means zero-shot performance."Mem."means the number of instances stored in the memory for each task; T is the total number of tasks in the stream and M is the number of tasks used for initial training."+P" means the percentage of additional parameters added for each task, measured by the total parameters of the base model; "Tun" is the portion of tunable parameters during training."Time" is the average hours for each method to complete the task stream.Best numbers are in bold.

Table 2 :
Performance of different methods on the InstrDialog++ stream.† means zero-shot performance.

Table 4 :
Effect of instruction templates on T init and T unseen .

Table 6 :
Task orders for three runs of the InstrDialog Stream.