Exclusive Supermask Subnetwork Training for Continual Learning

Continual Learning (CL) methods focus on accumulating knowledge over time while avoiding catastrophic forgetting. Recently, Wortsman et al. (2020) proposed a CL method, SupSup, which uses a randomly initialized, fixed base network (model) and finds a supermask for each new task that selectively keeps or removes each weight to produce a subnetwork. They prevent forgetting as the network weights are not being updated. Although there is no forgetting, the performance of SupSup is sub-optimal because fixed weights restrict its representational power. Furthermore, there is no accumulation or transfer of knowledge inside the model when new tasks are learned. Hence, we propose ExSSNeT (Exclusive Supermask SubNEtwork Training), that performs exclusive and non-overlapping subnetwork weight training. This avoids conflicting updates to the shared weights by subsequent tasks to improve performance while still preventing forgetting. Furthermore, we propose a novel KNN-based Knowledge Transfer (KKT) module that utilizes previously acquired knowledge to learn new tasks better and faster. We demonstrate that ExSSNeT outperforms strong previous methods on both NLP and Vision domains while preventing forgetting. Moreover, ExSSNeT is particularly advantageous for sparse masks that activate 2-10% of the model parameters, resulting in an average improvement of 8.3% over SupSup. Furthermore, ExSSNeT scales to a large number of tasks (100). Our code is available at https://github.com/prateeky2806/exessnet.


Introduction
Artificial intelligence aims to develop agents that can learn to accomplish a set of tasks.Continual Learning (CL) (Ring, 1998;Thrun, 1998) is crucial for this, but when a model is sequentially trained on different tasks with different data distributions, it can lose its ability to perform well on previous tasks, a phenomenon is known as catastrophic forgetting (CF) (McCloskey and Cohen, 1989;Zhao and Schmidhuber, 1996;Thrun, 1998).This is caused by the lack of access to data from previous tasks, as well as conflicting updates to shared model parameters when sequentially learning multiple tasks, which is called parameter interference (McCloskey and Cohen, 1989).
Recently, some CL methods avoid parameter interference by taking inspiration from the Lottery Ticket Hypothesis (Frankle and Carbin, 2018) and Supermasks (Zhou et al., 2019) to exploit the expressive power of sparse subnetworks.Given that we have a combinatorial number of sparse subnetworks inside a network, Zhou et al. (2019) noted that even within randomly weighted neural networks, there exist certain subnetworks known as supermasks that achieve good performance.A supermask is a sparse binary mask that selectively keeps or removes each connection in a fixed and randomly initialized network to produce a subnetwork with good performance on a given task.We call this the subnetwork as supermask subnetwork that is shown in Figure 1, highlighted in red weights.Building upon this idea, Wortsman et al. (2020) proposed a CL method, SupSup, which initializes a network with fixed and random weights and then learns a different supermask for each new task.This allows them to prevent catastrophic forgetting (CF) as there is no parameter interference (because the model weights are fixed).
Although SupSup (Wortsman et al., 2020) prevents CF, there are some problems with using supermasks for CL: (1) Fixed random model weights in SupSup limits the supermask subnetwork's representational power resulting in sub-optimal performance.(2) When learning a task, there is no mechanism for transferring learned knowledge from previous tasks to better learn the current task.Moreover, the model is not accumulating knowledge over time as the weights are not being updated.Figure 1: EXSSNET diagram.We start with random weights W (0) .For task 1, we first learn a supermask M1 (the corresponding subnetwork is marked by red color, column 2 row 1) and then train the weight corresponding to M1 resulting in weights W (1)  (bold red lines, column 1 row 2).For task 2, we learn the mask M2 over fixed weights W (1) .If mask M2 weights overlap with M1 (marked by bold dashed green lines in column 3 row 1), then only the non-overlapping weights (solid green lines) of the task 2 subnetwork are updated (as shown by bold and solid green lines column 3 row 2).These already trained weights (bold lines) are not updated by any subsequent task.Finally, for task 3, we learn the mask M3 (blue lines) and update the solid blue weights.
To overcome the aforementioned issues, we propose our method, EXSSNET (Exclusive Supermask SubNEtwork Training), pronounced as 'excess-net', which first learns a mask for a task and then selectively trains a subset of weights from the supermask subnetwork.We train the weights of this subnetwork via exclusion that avoids updating parameters from the current subnetwork that have already been updated by any of the previous tasks.In Figure 1, we demonstrate EXSSNET that also helps us to prevent forgetting.Training the supermask subnetwork's weights increases its representational power and allows EXSSNET to encode task-specific knowledge inside the subnetwork (see Figure 2).This solves the first problem and allows EXSSNET to perform comparably to a fully trained network on individual tasks; and when learning multiple tasks, the exclusive subnetwork training improves the performance of each task while still preventing forgetting (see Figure 3).
To address the second problem of knowledge transfer, we propose a k-nearest neighbors-based knowledge transfer (KKT) module that is able to utilize relevant information from the previously learned tasks to improve performance on new tasks while learning them faster.Our KKT module uses KNN classification to select a subnetwork from the previously learned tasks that has better than random predictive power for the current task and use it as a starting point to learn the new tasks.
Next, we show our method's advantage by experimenting with both natural language and vision tasks.For natural language, we evaluate on WebNLP classification tasks (de Masson d'Autume et al., 2019) and GLUE benchmark tasks (Wang et al., 2018), whereas, for vision, we evaluate on SplitMNIST (Zenke et al., 2017), SplitCIFAR100 (De Lange and Tuytelaars, 2021), and SplitTiny-ImageNet (Buzzega et al., 2020) datasets.We show that for both language and vision domains, EXSSNET outperforms multiple strong and recent continual learning methods based on replay, regularization, distillation, and parameter isolation.For the vision domain, EXSSNET outperforms the strongest baseline by 4.8% and 1.4% on SplitCI-FAR and SplitTinyImageNet datasets respectively, while surpassing multitask model and bridging the gap to training individual models for each task.In addition, for GLUE datasets, EXSSNET is 2% better than the strongest baseline methods and surpasses the performance of multitask learning that uses all the data at once.Moreover, EXSSNET obtains an average improvement of 8.3% over SupSup for sparse masks with 2 − 10% of the model parameters and scales to a large number of tasks (100).Furthermore, EXSSNET with the KKT module learns new tasks in as few as 30 epochs compared to 100 epochs without it, while achieving 3.2% higher accuracy on the SplitCIFAR100 dataset.In summary, our contributions are listed below: • We propose a simple and novel method to improve mask learning by combining it with exclusive subnetwork weight training to improve CL performance while preventing CF.  dynamically identifies previous tasks to transfer knowledge to learn new tasks better and faster.

• Extensive experiments on NLP and vision tasks
show that EXSSNET outperforms strong baselines and is comparable to multitask model for NLP tasks while surpassing it for vision tasks.Moreover, EXSSNET works well for sparse masks and scales to a large number of tasks.

Motivation
Using sparsity for CL is an effective technique to learn multiple tasks, i.e., by encoding them in different subnetworks inside a single model.SupSup (Wortsman et al., 2020) is an instantiation of this that initializes the network weights randomly and then learns a separate supermask for each task (see Figure 7).They prevent CF because the weights of the network are fixed and never updated.However, this is a crucial problem as discussed below.
Problem 1 -Sub-Optimal Performance of Supermask: Although fixed network weights in SupSup prevent CF, this also restricts the representational capacity, leading to worse performance compared to a fully trained network.In Figure 2, we report the test accuracy with respect to the fraction of network parameters selected by the mask, i.e., the mask density for an underlying ResNet18 model on a single 100-way classification on CI-FAR100 dataset.In Figure 3, we report the average test accuracy versus the fraction of overlapping parameters between the masks of different tasks, i.e., the sparse overlap (see Equation 2) for five different 20-way classification tasks from SplitCIFAR100 dataset with ResNet18 model.We observe that SSNET outperforms SupSup for lower sparse overlap but as the sparse overlap increases, the performance declines because the supermask subnetworks for different tasks have more overlapping (common) weights (bold dashed lines in Figure 1).This leads to higher parameter interference resulting in increased forgetting which suppresses the gain from subnetwork weight training.
Our final proposal, EXSSNET, resolves both of these problems by selectively training a subset of the weights in the supermask subnetwork to prevent parameter interference.When learning multiple tasks, this prevents CF, resulting in strictly better performance than SupSup (Figure 3) while having the representational power to match bridge the gap with fully trained models (Figure 2).

Method
As shown in Figure 1, when learning a new task t i , EXSSNET follows three steps: (1) We learn a supermask M i for the task; (2) We use all the previous tasks' masks M 1 , . . ., M i−1 to create a free parameter mask M f ree i , that finds the parameters selected by the mask M i that were not selected by any of the previous masks; (3) We update the weights corresponding to the mask M f ree i as this avoids parameter interference.Now, we formally describe all the step of our method EXSSNET (Exclusive Supermask SubNEtwork Training) for a Multi-layer perceptron (MLP).
Notation: During training, we can treat each layer l of an MLP network separately.An intermediate layer l has n l nodes denoted by V (l) = {v 1 , . . ., v n l }.For a node v in layer l, let I v denote its input and Z v = σ(I v ) denote its output, where σ(.) is the activation function.Given this notation, I v can be written as I v = u∈V (l−1) w uv Z u , where w uv is the network weight connecting node u to node v.The complete network weights for the MLP are denoted by W .When training the task t i , we have access to the supermasks from all previous tasks {M j } i−1 j=1 and the model weights W (i−1) obtained after learning task t i−1 .
(2020), we use the algorithm of Ramanujan et al.
(2019) to learn a supermask M i for the current task t i .The supermask M i is learned with respect to the underlying model weights W (i−1) and the mask selects a fraction of weights that lead to good performance on the task without training the weights.To achieve this, we learn a score s uv for each weight w uv , and once trained, these scores are thresholded to obtain the mask.Here, the input to a node v is I v = u∈V (l−1) w uv Z u m uv , where m uv = h(s uv ) is the binary mask value and h(.) is a function which outputs 1 for top-k% of the scores in the layer with k being the mask density.
Next, we use a straight-through gradient estimator (Bengio et al., 2013) and iterate over the current task's data samples to update the scores for the corresponding supermask M i as follows, Finding Exclusive Mask Parameters: Given a learned mask M i , we use all the previous tasks' masks M 1 , . . ., M i−1 to create a free parameter mask M f ree i , that finds the parameters selected by the mask M i that were not selected by any of the previous masks.We do this by -(1) creating a new mask M 1:i−1 containing all the parameters already updated by any of the previous tasks by taking a union of all the previous masks {M j } i−1 j=1 by using the logical or operation, and (2) Then we obtain a mask M f ree i by taking the intersection of all the network parameters not used by any previous task which is given by the negation of the mask M 1:i−1 with the current task mask M i via a logical and operation.Next, we use this mask M f ree i for the exclusive supermask subnetwork weight training.

Exclusive Supermask Subnetwork Weight
Training: For training the subnetwork parameters for task t i given the free parameter mask M f ree i , we perform the forward pass on the model as model where ⊙ is the elementwise multiplication.Hence, Mi allows us to use all the connections in M i during the forward pass of the training but during the backward pass, only the parameters in M f ree i are updated because the gradient value is 0 for all the weights w uv where m f ree uv = 0.While during the inference on task t i we use the mask M i .In contrast, SSNET uses the task mask M i both during the training and inference as model(x, W (i−1) ⊙ M i ).This updates all the parameters in the mask including the parameters that are already updated by previous tasks that result in CF.Therefore, in cases where the sparse overlap is high, EXSSNET is preferred over SS-NET.To summarize, EXSSNET circumvents the CF issue of SSNET while benefiting from the subnetwork training to improve overall performance as shown in Figure 3.

KKT: Knn-Based Knowledge Transfer
When learning multiple tasks, it is a desired property to transfer information learned by the previous tasks to achieve better performance on new tasks and to learn them faster (Biesialska et al., 2020).Hence, we propose a K-Nearest Neighbours (KNN) based knowledge transfer (KKT) module that uses KNN classification to dynamically find the most relevant previous task (Veniat et al., 2021) to initialize the supermask for the current task.To be more specific, before learning the mask M i for the current task t i , we randomly sample a small fraction of data from task t i and split it into a train and test set.Next, we use the trained subnetworks of each previous task t 1 , . . ., t i−1 to obtain features on this sampled data.Then we learn i − 1 independent KNN-classification models using these features.Then we evaluate these i − 1 models on the sampled test set to obtain accuracy scores which denote the predictive power of features from each previous task for the current task.Finally, we select the previous task with the highest accuracy on the current task.If this accuracy is better than random then we use its mask to initialize the current task's supermask.This enables EXSSNET to transfer information from the previous task to learn new tasks better and faster.We note that the KKT module is not limited to SupSup and can be applied to a broader category of CL methods that introduce additional parameters for new tasks.
Metrics: We follow Chaudhry et al. (2018) and evaluate our model after learning task t on all the tasks, denoted by T .This gives us an accuracy matrix A ∈ R n×n , where a i,j represents the classification accuracy on task j after learning task i.We want the model to perform well on all the tasks it has been learned.This is measured by the average accuracy, A(T ) = 1 N N k=1 a N,k , where N is the number of tasks.Next, we want the model to retain performance on the previous tasks when learning multiple tasks.This is measured by the forgetting metric (Lopez-Paz and Ranzato, 2017), ).This is the average difference between the maximum accuracy obtained for task t and its final accuracy.Higher accuracy and lower forgetting are desired.
Sparse Overlap to Quantify Parameter Interference: Next, we propose sparse overlap, a measure to quantify parameter interference for a task i, i.e., the fraction of the parameters in mask M i that are already updated by some previous task.For a formal definition refer to Appendix A.1 Previous Methods and Baselines: For both vision and language (VL) tasks, we compare with: (VL.1)Naive Training (Yogatama et al., 2019): where all model parameters are sequentially trained/finetuned for each task.(VL.2) Experience Replay (ER) (de Masson d'Autume et al., 2019): we replay previous tasks examples when we train new tasks; (VL.3)Multitask Learning (Crawshaw, 2020): where all the tasks are used jointly to train the model; (VL.4)Individual Models: where we train a separate model for each task.This is considered an upper bound for CL; (VL.5)Supsup (Wortsman et al., 2020).For natural language (L), we further compare with the following methods: (L.6) Regularization (Huang et al., 2021): Along with the Replay method, we regularize the hidden states of the BERT classifier with an L2 loss term; We show three Adapter BERT (Houlsby et al., 2019)  based methods, (V.6) Online EWC (Schwarz et al., 2018), (V.7) Synaptic Intelligence (SI) (Zenke et al., 2017); one knowledge distillation method, (V.8) Learning without Forgetting (LwF) (Li and Hoiem, 2017), three additional experience replay method, (V.9) AGEM (Chaudhry et al., 2018), (V.10) Dark Experience Replay (DER) (Buzzega et al., 2020), (V.11) DER++ (Buzzega et al., 2020), and a parameter isolation method (V.12) CGATE (Abati et al., 2020).models' performance.The average sparse overlap of EXSSNET is 19.4% across all three datasets implying that there is a lot more capacity in the model.See appendix Table 11 for sparse overlap of other methods and Appendix A.4.1 for best-performing methods results on Imagenet Dataset.Note that, past methods require tricks like local adaptation in MBPA++, and experience replay in AGEM, DER, LAMOL, and ER.In contrast, EXSSNET is simple and does not require replay.

Q2. Can KKT Knowledge Transfer Module
Share Knowledge Effectively?In Table 3, we show that adding the KKT module to EXSSNET, SSNET, and SupSup improves performance on vision benchmarks.The experimental setting here is similar to Table 2.We observe across all methods and datasets that the KKT module improves average test accuracy.Specifically, for the Split-CIFAR100 dataset, the KKT module results in 5.0%, and 3.2% improvement for Sup-Sup and EXSSNET respectively; while for Split-TinyImageNet, EXSSNET + KKT outperforms the individual models.We observe a performance decline for SSNET when using KKT because KKT promotes sharing of parameters across tasks which can lead to worse performance for SSNET.Furthermore, EXSSNET + KKT outperforms all other methods on both the Split-CIFAR100 and Split-TinyImageNet datasets.For EXSSNET + KKT, the average sparse overlap is 49.6% across all three datasets (see appendix Table 11).These results suggest that combining weight training with the KKT module leads to further improvements.

Q3. Can KKT Knowledge Transfer Module
Improve Learning Speed of Subsequent Tasks?Next, we show that the KKT module enables us to learn new tasks faster.To demonstrate this, in Figure 4 we plot the running mean of the validation accuracy vs epochs for different tasks from the Split-CIFAR100 experiment in Table 3.We show curves for EXSSNET with and without the KKT module and omit the first task as both these methods are identical for Task 1 because there is no previous task to transfer knowledge.For all the subsequent tasks (Task 2,3,4,5), we observe that -(1) EXSSNET + KKT starts off with a much better initial performance compared to EXSSNET (2) given a fixed number of epochs for training, EXSSNET + KKT always learns the task better because it has a better accuracy at all epochs; and (3) EXSSNET + KKT can achieve similar performance as EXSSNET in much fewer epochs as shown by the green horizontal arrows.This clearly illustrates that using the KKT knowledge-transfer module not only helps to learn the tasks better (see Table 3) but also learn them faster.For an efficiency and robustness analysis of the KKT module, please refer to Appendix A.4.2.

Additional Results and Analysis Q4. Effect of Mask Density on Performance:
Next, we show the advantage of using EXSSNET when the mask density is low.In Figure 5, we show the average accuracy for the Split-CIFAR100 dataset as a function of mask density.We observe that EXSSNET obtains 7.9%, 18.4%, 8.4%, and 4.7% improvement over SupSup for mask density values 0.02, 0.04, 0.06, 0.08 respectively.This is an appealing property as tasks select fewer parameters which inherently reduces sparse overlap allowing EXSSNET to learn a large number of tasks.

Q5. Can EXSSNET Learn a Large Number of
Tasks? SupSup showed that it can scale to a large number of tasks.Next, we perform experiments to learn 100 tasks created by splitting the Tiny-ImageNet dataset.In Table 4, we show that this property is preserved by EXSSNET while resulting in a performance improvement over SupSup.We note that as the number of task increase, the sparse overlap between the masks also increases resulting in fewer trainable model weights.In the extreme case where there are no free weights, EXSSNET by design reduces to SupSup because there will be no weight training.Moreover, if we use larger models there are more free parameters, leading to even more improvement over SupSup.

Q6. Effect of Token Embedding Initialization
for NLP: For our language experiments, we use a pretrained BERT model (Devlin et al., 2019) to obtain the initial token representations.We perform ablations on the token embedding initialization to understand its impact on CL methods.In Table 5, we present results on the S2 1 task-order sequence of the sampled version of WebNLP dataset (see Section 4.1, Datasets).We initialize the token representations using FastText (Bojanowski et al., 2016), Glove (Pennington et al., 2014), and BERT embeddings.From Table 5, we observe that -(1) the performance gap between EXSSNET and SupSup increases from 0.8% → 7.3% and 0.8% → 8.5% when moving from BERT to Glove and FastText initializations respectively.These gains imply that it is even more beneficial to use EXSSNET in absence of good initial representations, and (2) the performance trend, EXSSNET > SSNET > Sup-Sup is consistent across initialization.

Related Work
Regularization-based methods estimate the importance of model components and add importance regularization terms to the loss function.Zenke et al. (2017) regularize based on the distance of weights from their initialization, whereas Kirkpatrick et al. (2017b); Schwarz et al. (2018) use an approximation of the Fisher information matrix (Pascanu and Bengio, 2013) to regularize the parameters.In NLP, Han et al. (2020); Wang et al. (2019) use regularization to constrain the relevant information from the huge amount of knowledge inside large language models (LLM).Huang et al. (2021) first identifies hidden spaces that need to be updated versus retained via information disentanglement (Fu et al., 2017;Li et al., 2020) and then regularize these hidden spaces separately.
Replay based methods maintain a small memory buffer of data samples (De Lange et al., 2019;Yan et al., 2022) or their relevant proxies (Rebuffi et al., 2017) from the previous tasks and retrain on them later to prevent CF.Chaudhry et al. (2018) use the buffer during optimization to constrain parameter gradients.Shin et al. (2017); Kemker and Kanan (2018) 2019) trains a language model to generate a pseudo-sample for replay.
Architecture based methodscan be divided into two categories: (1) methods that add new modules over time (Li et al., 2019;Veniat et al., 2021;Douillard et al., 2022); and (2) methods that isolate the network's parameters for different tasks (Kirkpatrick et al., 2017a;Fernando et al., 2017;Mallya and Lazebnik, 2018;Fernando et al., 2017).Rusu et al. (2016) introduces a new network for each task while Schwarz et al. (2018) distilled the new network after each task into the original one.Recent prompt learning-based CL models for vision (Wang et al., 2022a,b) assume access to a pre-trained model to learn a set of prompts that can potentially be shared across tasks to perform CL this is orthogonal to our method that trains from scratch.Mallya and Lazebnik (2018) allocates parameters to specific tasks and then trains them in isolation which limits the number of tasks that can be learned.In contrast, Mallya et al. (2018) use a frozen pretrained model and learns a new mask for each task but a pretrained model is crucial for their method's good performance.Wortsman et al. (2020) removes the pretrained model dependence and learns a mask for each task over a fixed randomly initialized network.EXSSNET avoids the shortcomings of Mallya and Lazebnik (2018); Mallya et al. (2018) and performs supermask subnetwork training to increase the representational capacity compared to (Wortsman et al., 2020) while performing knowledge transfer and avoiding CF.

Conclusion
We introduced a novel Continual Learning method, EXSSNET (Exclusive Supermask SubNetwork Training), that delivers enhanced performance by utilizing exclusive, non-overlapping subnetwork weight training, overcoming the representational limitations of the prior SupSup method.Through the avoidance of conflicting weight updates, EXSSNET not only improves performance but also eliminates forgetting, striking a delicate balance.Moreover, the inclusion of the Knowledge Transfer (KKT) module propels the learning process, utilizing previously acquired knowledge to expedite and enhance the learning of new tasks.The efficacy of EXSSNET is substantiated by its superior performance in both NLP and Vision domains, its particular proficiency for sparse masks, and its scalability up to a hundred tasks.

Limitations
Firstly, we note that as the density of the mask increases, the performance improvement over the SupSup method begins to decrease.This is due to the fact that denser subnetworks result in higher levels of sparse overlap, leaving fewer free parameters for new tasks to update.However, it is worth noting that even in situations where mask densities are higher, all model weights are still trained by some task, improving performance on those tasks and making our proposed method an upper bound to the performance of SupSup.Additionally, the model size and capacity can be increased to counterbalance the effect of higher mask density.Moreover, in general, a sparse mask is preferred for most applications due to its efficiency.
Secondly, we have focused on the task incremental setting of continual learning for two main reasons: (1) in the domain of natural language processing, task identities are typically easy to obtain, and popular methods such as prompting and adaptors assume access to task identities.(2) the primary focus of our work is to improve the performance of supermasks for continual learning and to develop a more effective mechanism for reusing learned knowledge, which is orthogonal to the question of whether task identities are provided during test time.
Moreover, it is worth noting that, similar to the SupSup method, our proposed method can also be extended to situations where task identities are not provided during inference.The SupSup paper presents a method for doing this by minimizing entropy to select the best mask during inference, and this can also be directly applied to our proposed method, ExSSNeT, in situations where task identities are not provided during inference.This is orthogonal to the main questions of our study, however, we perform some experiments on Class Incremental Learning in the appendix A.4.3.

A.3 Experimental setup and hyperparameters
Unless otherwise specified, we obtain supermasks with a mask density of 0.1.In our CNN models, we use non-affine batch normalization to avoid storing their means and variance parameters for all tasks (Wortsman et al., 2020).Similar to (Wortsman et al., 2020), bias terms in our model are 0 and we randomly initialize the model parameters using signed kaiming constant (Ramanujan et al., 2019).We use Adam optimizer (Kingma and Ba, 2014) along with cosine decay (Loshchilov and Hutter, 2016) and conduct our experiments on GPUs with 12GB of memory.We used approximately 6 days of GPU runtime.For our main experiment, we run three independent runs for each experiment and report the averages for all the metrics and experiments.For natural language tasks, unless specified otherwise we initialize the token embedding for our methods using a frozen BERT-base-uncased (Devlin et al., 2018) model's representations using Huggingface (Wolf et al., 2020).We use a static CNN model from Kim (2014) as our text classifier over BERT representations.The model employs 1D convolutions along with Tanh activation.The total model parameters are ∼110M Following Sun et al. (2019); Huang et al. (2021), we evaluate our model on various task sequences as provided in Appendix Table 6, while limiting the maximum number of tokens to 256.Following (Wortsman et al., 2020), we use LeNet (Lecun et al., 1998) for SplitMNIST dataset, a Resnet-18 model with fewer channels (Wortsman et al., 2020) for Split-CIFAR100 dataset, a ResNet50 model (He et al., 2016) for TinyImageNet dataset.Unless specified, we randomly split all the vision datasets to obtain five tasks with disjoint classes.We use the codebase of DER (Buzzega et al., 2020) to obtain the vision baselines.In all our experiments, all methods perform an equal number of epochs over the datasets.We use the hyperparameters from Wortsman et al. (2020) for our vision experiments.
For the ablation experiment on natural language data, following Huang et al. (2021), we use a sampled version of the WebNLP datasets due to limited resources.The reduced dataset contains 2000 training and validation examples from each output class.The test set is the same as the main experiments.The dataset statistics are summarized in Table 7.For WebNLP datasets, we tune the learning rate on the validation set across the values {0.01, 0.001, 0.0001}, for GLUE datasets we use the default learning rate of the BERT model.For our vision experiments, we use the default learning rate for the dataset provided in their original implementation.For TinyImageNet, SplitCIFAR100, SplitMNIST dataset, we run for 30, 100, and 30 epochs respectively.We store 0.1% of our vision datasets for replay while for our language experiments we use 0.01% of the data because of the large number of datasets available for them.Table 11: We report the average sparse overlap for all method and dataset combinations reported in Table 3.

A.4.1 Results on Imagenet Dataset
In this experiment, we take the ImageNet dataset (Deng et al., 2009) with 1000 classes and divide it into 10 tasks where each task is a 100-way classification problem.In Table 8, we report the results for ExSSNeT and the strongest vision baseline method, SupSup.We omit other methods due to resource constraints.We observe a strong improvement of 6.7% of EXSSNET over SupSup, indicating that the improvements of our methods exist for large scale datasets as well.
A  is 173 minutes which is a very small difference.Second, there are two main hyperparameters in the KKT module -(1) k for taking the majority vote of top-k neighbors, and (2) the total number of batches used from the current task in this learning and prediction process.We present additional results on the splitcifar100 dataset when changing these hyperparameters one at a time.
In Table 9, we use 10 batches for KKT with a batch size of 64, resulting in 640 samples from the current task used for estimation.We report the performance of EXSSNET when varying k.From this table, we observe that the performance increases with k and then starts to decrease but in general most values of k work well.
Next, in Table 10, we use a fixed k=10 and vary the number of batches used for KKT with a batch size of 64 and report the performance of EXSS-NET.We observe that as the number of batches used for finding the best mask increases the prediction accuracy increases because of better mask selection.Moreover, as few as 5-10 batches work reasonably well in terms of average accuracy.
From both of these experiments, we can observe that the KKT module is fairly robust to different values of these hyperparameters but carefully selecting them hyperparameters can lead to slight improvement.

A.4.3 Class Incremental Learning
We performed Class Incremental Learning experiments on the TinyImageNet dataset (10-tasks, 20classes in each) and used the One-Shot algorithm from SupSup (Wortsman et al., 2020) to select the mask for inference.Please refer to Section-3.3 and Equation-4 of the SupSup paper (Wortsman et al., 2020) for details.From do not use Experience Replay by at least 2.75%.Moreover, even with the need for a replay buffer, EXSSNET outperforms most ER-based methods and is comparable to that of DER.

A.4.4 Sparse Overlap Numbers
In Table 11, we report the sparse overlap numbers for SupSup, SSNET, and EXSSNET with and without the KKT knowledge transfer module.This table corresponds to the results in main paper Table 3.

A.4.5 Average Accuracy Evolution
In Figure 6, we plot i≤t A ti vs t, that is the average accuracy as a function of observed classes.This plot corresponds to the SplitCIFAR100 results provided in the main paper Table 2.We can observe from these results that Supsup and ExSS-NeT performance does not degrade when we learn new tasks leading to a very stable curve whereas for other methods the performance degrades as we learn new tasks indicating some degree of forgetting.
Algorithm 1 EXSSNET training procedure.In this Section, we provide the result to compare the runtime of various methods used in the paper.We ran each method on the sampled version of the WebNLP dataset for the S2 task order as defined in Table 6.We report the runtime of methods for four epochs over each dataset in Table 13.Note that the masking-based method, SupSup, SSNET, EXSSNET takes much lower time because they are not updating the BERT parameters and are just finding a mask over a much smaller CNN-based classification model using pretrained representation from BERT.This gives our method an inherent advantage that we are able to improve performance but with significantly lower runtime while learning a mask over much fewer parameters for the natural language setting.

A.4.7 Validation results
In Table 14, we provide the average validation accuracies for the main natural language results presented in Table 1.We do not provide the validation results of LAMOL (Sun et al., 2019) andMBPA++ (de Masson d'Autume et al., 2019) as we used the results provided in their original papers.For the vision domain, we did not use a validation set because no hyperparameter tuning was performed as we used the experimental setting and default param-

Figure 2 :
Figure 2: Test accuracy versus the mask density for 100way CIFAR100 classification.Averaged over 3 seeds.

Figure 4 :
Figure4: We plot validation accuracy vs Epoch for EXSS-NET and EXSSNET + KKT.We observe that KKT helps to learn the subsequent tasks faster and improves performance.
uses a generative model to sample and replay pseudo-data during training, whereas Rebuffi et al. (2017) replay distilled knowledge from the past tasks.de Masson d'Autume et al. (2019) employ episodic memory along with local adaptation, whereas Sun et al. (

Figure 6 :
Figure 6: Average Accuracy of all seen tasks as a function of the number of learned classes for the Split-CIFAR100 dataset.

Input:▷
Tasks T , a model M, mask sparsity k, exclusive=True Output: Trained model ▷ Initialize model weights W (0) initialize_model_weights(M) forall i ∈ range(|T |) do ▷ Set the mask Mi corresponding to task ti for optimization.mask_opt_params = Mi ▷ Learn the supermask Mi using edge-popup forall em ∈ mask_epochs do Mi = learn_supermask(model, mask_opt_params, ti) end ▷ Model weight at this point are same as the last iteration W (i−1) if i > 1 and exclusive then ▷ Find mask for all the weights used by previous tasks.M1:i−1 = ∨ i−1 j=1 (Mj ) ▷ Get mask for weights in Mi which are not in {Mi} Learn the free weight in the supermask Mi forall em ∈ weight_epochs do W (i) = update_weights(model, weight_opt_params, ti) end end A.4.6 Runtime Comparison across methods

Table 2 :
Average accuracy ↑ (Forgetting metric ↓) on all tasks for vision.For our method, we report the results are averaged over three random seeds.

Table 3 :
Average test accuracies ↑ [and gains from KKT] when using the KKT knowledge sharing module.

Table 5 :
Ablation result for token embeddings.We report average accuracy ↑ [and gains over SupSup]

Table 7 :
Statistics for sampled data used from Huang et al. (2021) for hyperparameter tuning.The validation set is the same size as the train set.Class means the number of output classes for the text classification task.Type is the domain of text classification.|W | * 1 bits in total as in the worst case we need to store all |W | model weights.

Table 8 :
Comparision between EXSSNET and the best baseline SupSup on Imagenet Dataset.

Table 9 :
Effect of varying k while keeping the number of batches used for the KKT module fixed.

Table 10 :
Effect of varying the number of batches while keeping the k for top-k neighbours fixed for KKT module fixed.

Table 12 :
Table 12, we observe that EXSSNET outperforms all baseline methods that Results for CIL setting.