Continual Training of Language Models for Few-Shot Learning

Recent work on applying large language models (LMs) achieves impressive performance in many NLP applications. Adapting or posttraining an LM using an unlabeled domain corpus can produce even better performance for end-tasks in the domain. This paper proposes the problem of continually extending an LM by incrementally post-train the LM with a sequence of unlabeled domain corpora to expand its knowledge without forgetting its previous skills. The goal is to improve the few-shot end-task learning in these domains. The resulting system is called CPT (Continual PostTraining), which to our knowledge, is the first continual post-training system. Experimental results verify its effectiveness.


Introduction
Recent work has shown that large LMs have the ability to perform few-shot (or even zero-shot) learning well (Brown et al., 2020b;Rae et al., 2021;Smith et al., 2022).Post-training (a.k.a., domain-adaptive pre-training or pre-finetuning) an LM with a large unlabeled domain corpus before end-task fine-tuning in the domain achieves better results (Xu et al., 2019;Gururangan et al., 2020a) than directly fine-tuning the LM.This paper goes a step further to study the problem of improving an LM's ability to handle new and ever emerging domains.For this, one needs to continually posttrain the LM with a sequence of domains.A key issue associated with this problem is catastrophic forgetting (CF). 2 This paper thus investigates how to continually extend the LM's knowledge without suffering from CF. From a broader perspective, since training a large LM from scratch is extremely expensive and computation intensive, incrementally updating the LM with the latest language data reflecting the ever changing development of the language itself, social events and the knowledge from different fields is becoming more and more critical.As humans are very effective at incremental learning, if we can imitate this human capability with little or no forgetting, we will be pushing the AI research forward significantly.
The proposed system, called CPT, is a continual learning (CL) system for post-training.Starting from a pre-trained LM (e.g., RoBERTa (Liu et al., 2019b)), it incrementally post-trains the LM with a sequence of domains using their unlabeled corpora.Once a task (a domain in our case)3 is trained, its data is no longer accessible.At any time, the resulting continually post-trained LM can be used by end-tasks in the trained domains.This is in the task-incremental learning (TIL) setting of CL, where the task id (domain id in our case) is provided when the learned model of a task needs to be used later (the use of domain id is discussed in Sec.2.1). 4 This paper proposes an effective approach called CPT and focuses on the challenging and practical scenario of few-shot end-task learning after post-training a sequence of domains.
Continual post-training is different from conventional CL (Chen and Liu, 2018).The key difference is that in conventional CL, each task is an end-task, but in our case the end-task involves fine-tuning the continual post-trained LM (called p-LM).This causes major forgetting, which we call the catastrophic butterfly effect (CBE) and does not happen in conventional CL.Our proposed system, CPT, can solve both CF and CBE, based on a novel hard masking mechanism (Sec.2.2) and can achieve no forgetting.As shown in Sec.3.3, naively ap-plied existing CL systems cannot effectively prevent CF (even though some existing techniques have shown almost perfect CF prevention ability in conventional CL).
CPT is closely related to ELLE (Qin et al., 2022), which does continual pre-training.The key difference is that ELLE starts from random initialization, while our CPT starts from a pre-trained LM.We tried to adapt ELLE for continual post-training by learning from a pre-trained RoBERTa but it fails to converge.This also indicates it is non-trivial to do well in our setting.Readers can refer to Appendix A for a full coverage of the related work.

Proposed CPT System
CPT continually post-trains RoBERTa (Liu et al., 2019b).This is achieved by two continual learning plug-in (called CL-plugin) modules inserted into each transformer layer of RoBERTa.CLplugin is inspired by adapters in (Houlsby et al., 2019).While adapters can isolate different tasks, one needs to allocate a new adapter for each task and no knowledge can be shared among different tasks' adapters.The CL-plugin, however, is a CL system that learns a sequence of tasks with adapters shared by all domains.Figure 1 gives the CPT architecture with two CL-plugins added to RoBERTa.
Sequential vs. Parallel CL-plugin.Instead of following the original sequential adapter (Houlsby et al., 2019), CL-plugin adopts the parallel adapter idea in (He et al., 2021).The difference is that the former inserts an adapter after the FFN/attention layer while the latter inserts it before FFN/attention layer (see Fig. 1).We choose the parallel version as it performs better (see Sec. 3.3).
In post-training, only the two CL-plugins are trained.The components of the original pre-trained RoBERTa are fixed.In end-task fine-tuning, all components are trainable.A CL-plugin is a twolayer fully connected network with a task mask mechanism.It takes two inputs: (1) hidden states h (t) from the feed-forward layer in a transformer layer and (2) task ID t needed by task incremental learning (TIL).Inside a CL-plugin, task masks (TMs), which indicate task-specific neurons, are used to deal with CF.Since TMs is differentiable, the whole CPT can be trained end-to-end.

Task Masks (TMs)
In each layer of a CL-plugin, task masks are used to protect those neurons that are important for pre-vious tasks to overcome CF.The masks basically forbid gradient updates to those neurons during backpropagation in learning a new task.Note that a task is also a domain in our case.
Learning a new task/domain consists of two main steps: (1) apply the mask in each layer for each old task to block off the gradient flow to protect the model for the old task, and (2) learn domain t and its masks for future use.We present (2) first.
Learning Task Masks for Overcoming CF.In learning each task t, a mask (a "soft" binary mask) l is trained for the task at each layer l in CLplugin, indicating the neurons that are important for the task.We borrow the hard attention idea in (Serrà et al., 2018) and leverage the task ID embedding to train the mask.For a task ID t, its embedding e (t) l consists of differentiable deterministic parameters that can be learned together with other parts of the network.To generate the task mask m where τ is a temperature variable, linearly annealed from 1 to τ min (a small positive value).
In the forward pass, given the output of each layer l, k (t) l , we element-wise multiply mask m (2) The masked output o l of the last layer in CLplugin is fed to the next layer of the RoBERTa with a skip-connection.After learning task t, the final l is saved and added to the set {m (t) l }.Applying Task Masks.Before learning a new task t, we first accumulate and set the masks m (iprev) l on the neurons in each layer l for all old tasks i prev so that in backpropagation, the gradient g l for task t will not flow to these neurons.Since m (iprev) l is pseudo binary, we use max-pooling to achieve the accumulation and condition the gradient: Those gradients corresponding to the 1 entries in MaxPool({m (iprev) l }) are set to 0 (to block off gradient flow) while the others remain unchanged.In this way, neurons in old tasks are protected.Note that we expand (copy) the vector m (ta) l to match the dimensions of g (t) l .

Catastrophic Butterfly Effect in Fine-tuning
To perform an end-task in a post-trained domain, we fine-tune the mask-protected model of the domain, which is indicated by the task/domain id.The fine-tuning uses the corresponding domain neurons for the specific end-task by setting τ = τ min and condition the output via Eq. 2. With the masks, there should be no forgetting for continual posttraining and the end-task fine-tuning performance should be similar to post-train each domain separately.However, we found that this is not the case. 5ur investigation found that the problem is due to the pseudo-gate function in Eq. 1.No matter how small τ is, Eq. 1 can only gives us a mask almost 0 (or 1).This causes the following: (1) During posttraining, the gradients for used neurons in Eq. 3 are not exactly 0 but a very small number.
(2) During fine-tuning, we cannot make use of the corresponding neurons for the specific end-task by simply setting τ = τ min .The small change in the neurons for old domains during post-training caused by ( 1) is neglect-able in conventional CL because in conventional CL we evaluate the model using test sets and no weights update involved.However, in CPT, the end-task needs to fine-tune the continually post-trained LM model (p-LM), which involves weight updating.A small change to the p-LM during continual post-training can result in a different initialization for the end-task fine-tuning and give totally different fine-tuning results.We call this butterfly effect inspired by the term indicating a small state change in nature (e.g., the flap of a butterfly's wings in Brazil) can result in large differences in a later state (e.g., a tornado in Texas).
We propose a simple method to solve it, i.e., adding a threshold θ to the m (t) l to make it a hard binary mask, We then apply it to Eq. 3 in gradient manipulation and Eq. 2 in end-task fine-tuning.θ can be easily set (we use 0.5) since Eq. 1 already gives a pseudobinary mask.Note that this has almost no effect on post-training as it is used to block the backward pass gradient flow during post-training and select the corresponding neurons during fine-tuning.

Experiments
The proposed paradigm uses a different evaluation from that of conventional continual learning (CL).After unsupervised continual post-training of an LM (RoBERTa in our case) with a sequence of domains, the resulting p-LM is used to fine-tune an end few-shot classification task from any posttrained domain.There is no CL during end-task fine-tuning.Each fine-tuning task is done separately.

Datasets and Baselines
Datasets: We use 4 unlabeled domain datasets: Yelp Restaurant (Xu et al., 2019), AI Papers (Lo et al., 2020), ACL Papers (Lo et al., 2020) and AG-News (Zhang et al., 2015) and their 4 corresponding end-task classification datasets. 6aselines.Since no existing method can perform our task, we use 6 non-CL and 7 adapted CL methods as our baselines.The non-CL baselines include (1) RoBERTa and (2) Adapter where we directly fine-tune the pre-trained model or adpater (without any post-training); (3) RoBERTa-ONE, (4) Adapter-ONE and (5) Prompt-ONE, where we build a model for each task using a separate network.It has no knowledge transfer or CF.(6) DEMIX (Gururangan et al., 2021) trains a separate adapter for each task and initializes the adapter from its most similar previous task adapter.The 7 adapted CL baselines include (7) RoBERTa-NCL and (8) Adapter-NCL, where we post-train the domains one by one with no mechanism to deal with CF/transfer.Other are state-of-the-art CL baselines and we adapt them for continual post-training.7

Implementation Details
Architecture.We adopt RoBERTa BASE as our backbone LM.A masked language model head is applied for post-training.The fine-tuning follows the standard practice (Devlin et al., 2019), where we pass the final layer </s> token representation to a task-specific feed-forward layer for prediction.The feed-forward layer with softmax output is used as the classification heads, together with the categorical cross-entropy loss.Note that for the aspect sentiment classification task (see Table 3), we adopt the ASC formulation in (Xu et al., 2019), where the aspect (e.g., "sound") and review sentence (e.g., "The sound is great") are concatenated via </s>.
Hyperparameters.Unless otherwise stated, the same hyper-parameters are used in all experiments.We use 0.0025 for τ min in Eq. 1 and θ is set to 0.5 in Eq. 4 in the main paper.As shown in Figure 1, there are two CL-plugins for each Transformer layer (one at the bottom in parallel with attention and one at the top in parallel with FFN).We search the CL-plugin size within {128, 256, 512, 768, 1024} and adopt 512 for the bottom one and 768 for the top one based on validation experiments.The task id embeddings have the same size as the hidden layer dimension of the CL-plugin.The maximum input length is set to 164 which is long enough for all datasets.We use Adam optimizer and set the learning rate to 1e-4 for post-training and 5e-5 for fine-tuning.The batch size is set to 48 for post-training and 20 for fine-tuning.Since each of our domain-specific dataset has a different size, we train CPT on each task/domain for 1 epoch for post-training, which is approximately 13K steps, following (Gururangan et al., 2020b;Xu et al., 2019).We train on end-task fine-tuning datasets for 20 epochs and take the results for the last epoch, assuming no validation sets.We empirically found 20 epochs can give us a relatively stable results.

Evaluation Results and Analysis
We report the average results of the 4 different finetuning tasks (or datasets) in accuracy and Macro-F1 after post-training on all unlabeled domain datasets in Table 1.The forgetting rate (forget R.) (Liu et al., 2020a)  Comparing with CL baselines, CPT achieves no forgetting (we can see the forgetting rate is 0), indicating the high effectiveness of the proposed approach.We also note that CPT is even slightly better than those ONE baselines, indicating some positive knowledge transfer in CPT.

Ablations
In Table 2, we give the ablation results.We are interested in the following: (1) Catastrophic butterfly effect (CBE).The third row with "w/o butterfly" shows results without the hard binary mask mechanism in Eq. 4. Clearly, the results are worse and the model suffers from forgetting.This indicates CBE and our approach is effective.
(2) Different Architecture.CPT is based on CL-plugin, which is inspired by adapters.Another popular way to use adapters is to make it sequential (Houlsby et al., 2019).Sequential adapter (first row) is clearly poorer than our parallel one.This conforms to the observation in (He et al., 2021).
Table 1 only reports the results of one fixed domain order (Restaurant→AI→ ACL →AGNews).We are interested in how the order impacts CPT results.We give the detailed results for all the other baselines and detailed domain orders in Appendix E. We can see the results of CPT does not change much and it still outperforms other baselines.This indicates the CPT's robustness to domain orders in post-training.

Conclusion
This paper proposed to continually post-train an LM with a sequence of domains using their unlabeled domain corpora.An effective method (CPT) is also proposed.An end-task from any post-trained domain can fine-tune the resulting LM.Experimental results demonstrate the effectiveness of CPT.

Limitations
We list two limitations of CPT.First, CPT adds CLplugins for continual post-training with no change to the underlying LM in training.Although a CLplugin is small compared to an LM, it is still interesting and may be more effective to explore the idea of updating the LM directly without any additional modules.Second, domain ids are needed in both training and testing for CPT.In some applications, it may be hard to provide a domain id for each fine-tuning end-task.We will explore these in our future work as specializing and/or incrementally improving an LM is an important problem.

A Related Work
Our work is related to continual learning, posttraining and few-shot learning.
Post-training is an effective approach to mitigate the discrepancies between pre-trained domains and the target domain.Researchers have applied post-training to many domains, e.g., reviews (Xu et al., 2019;Sun et al., 2019), news and academic papers (Gururangan et al., 2020b), and shown improved end-task results.However, none of them consider the continual learning paradigm.
Few-shot learning (FL) aims to learn tasks with a few labeled examples.The main issue of FL is over-fitting, due to the scarcity of labeled training data.Existing methods to overcome over-fitting fall in three main families: (i) model-based methods try to reduce the hypothesis space of the few-shot task (Triantafillou et al., 2017;Hu et al., 2018), (ii) data-based methods try to augment additional data to the few-shot set (Benaim and Wolf, 2018;Gao et al., 2020), and (iii) algorithm-based solutions try to improve strategies for searching for the best hypothesis.Recently, a new paradigm using prompts achieves promising results for fewshot language learning as shown in GPT-3 (Brown et al., 2020a), PET (Schick and Schütze, 2021) and LM-BFF (Gao et al., 2021).However, none of them does few-shot fine-tuning in continual posttraining.
Continual few-shot learning.Several researchers have studied this problem recently (Antoniou et al., 2020;Qin and Joty, 2021;Jin et al., 2021;Xia et al., 2021;Yin et al., 2022).It continually learns a sequence of few-shot tasks.However, this is very different from our continual posttraining because our continual learning happens in the post-training stage instead of the end-task finetuning stage.We only evaluate the proposed CPT system after continual post-training by conducting few-shot learning tasks individually by fine-tuning the post-trained language model (p-LM) in each of the post-trained domains.No continual learning is involved in few-shot learning.training domain/task 1, we obtain its useful neurons indicated by the 1 entries.Before training domain/task 2, those useful neurons for domain 1 are first masked (those previous 1's entries are turned to 0's).After training domain 2, two neurons with 1 are used by the domain.When domain t arrives, all used neurons by domains 1 and 2 are masked before training, i.e., their entries are set to 0. After training domain t, we see that domains t and 1 have a shared neuron (the cell with two colors, red and green), which is used by both of domains.After continual post-training, we evaluate CPT by individual fine-tuning.During fine-tuning (Figure 2 (B)), we only make use of those neurons that are useful for domain/task id t (red cells) and freeze all other neurons (grey cells).

C Dataset Statistics
Table 3 shows the statistics of the unlabeled domain datasets and end-task classification datasets.Note that the full AGNews is very large.We use only its author provided training split as our domainspecific datasets as our unlabeled AGNews dataset for continual post-training.The remaining testing set is used as the labeled end-task (AGNews-FT).The other three corresponding end task datasets are SemEval-res (Xu et al., 2019), ACL-ARC (Jurgens et al., 2018), and SCIERC (Luan et al., 2018).(1,2) RoBERTa, Adapter (Liu et al., 2019b;Houlsby et al., 2019)  (3) RoBERTa-ONE is the existing post-training method in (Gururangan et al., 2020b).To our knowledge, the existing post-training systems are all based on the MLM loss.
(4) Adapter-ONE (Madotto et al., 2020;Houlsby et al., 2019) adds small adapter layers between layers of Transformer for post-training.We follow the adapter design in (Madotto et al., 2020;Houlsby et al., 2019).An adapter is simply two fully connected layers.During post-training, the Transformer is fixed, only the added adapters are trainable.The bottleneck size (adapter size) is set to 128.During end-task fine-tuning, both RoBERTa and the adapters are trainable to ensure fair comparison.
(5) Prompt-ONE (Lester et al., 2021) adds a sequence of real vector tokens (called virtual tokens or prompt tokens) to the end of the original input sequence.In post-training, RoBERTa (the LM) is fixed and only the prompt tokens are trained.In end-task fine-tuning, both LM and the trained prompt are trainable.We initialize 100 tokens and set the learning rate of the prompt token to 0.3 in   post-training, following the setting in (Lester et al., 2021).( 6) DEMIX (Gururangan et al., 2021) is a recent model to adapt a pre-trained LM with new domains.It adds a new adapter once a new domain arrives (network expansion is needed) and initializes the new adapter with the parameters of the previous trained adapter nearest to the new domain data.They use the perplexity on held-out samples to choose the most probable adapter.
(7) RoBERTa-NCL (Naive continual learning) is a naive extension of (Gururangan et al., 2020b), which continually/incrementally post-trains the LM to learn all domains using the MLM loss with no mechanism to deal with forgetting or CF.
(8) Adapter-NCL (Houlsby et al., 2019) is similar to the Adapter based system.The only difference is that the same set of adapters is shared across all domains, rather than using a new adapter for each new domain.
(9) Hard attention to overcome forgetting (HAT) is derived from HAT (Serrà et al., 2018), the state-of-the-art parameter-isolation based method with almost no forgetting.However, HAT suffers from forgetting in continual post-training due to the catastrophic butterfly effect.
(10) BCL (Ke et al., 2021c) is a continual learn-ing model that can avoid forgetting and encourage knowledge transfer.It is similar to Adapter-NCL.
The difference is that its adapters consist of two modules, one is a capsule network (a new capsule is added once a new domain arrives) to encourage transfer, and the other is similar to HAT to avoid forgetting.Similar to HAT, task/domain information is needed in end-task fine-tuning.We replace the backbone network from BERT with RoBERTa for fair comparison.
(11) Knowledge distillation (KD) (Hinton et al., 2015) minimizes the representational deviation between the learned representation and the new representation in post-training.We compute the KL divergence between the representations (the output before the masked language model prediction head) of each token of the previous post-trained LM and current LM as the distillation loss.
(12) EWC (Buzzega et al., 2020) is a popular regularization-based continual learning method that adopts elastic weights consolidation to add L 2 regularization to penalize parameter changes.
(13) DER++ (Buzzega et al., 2020) is a recent replay method using distillation to regularize the new task training.We store 16.4K tokens for each learned domain as the memory, which is the largest memory we can use for the system to run.

E Results for Different Domain Orders
Table 1 in the main paper reported the results for the order Restaurant → AI → ACL → AGnews.
We now look at how the order affects the results.Table 6 shows baselines and CPT's results of 4 different orders.Note that the results for the Non-CL baselines are the same across different orders (and the same as those in Table 1) because they

Figure 1 :
Figure 1: Architecture of CPT, which has two CLplugins inserted in the transformer layers of RoBERTa in a parallel manner (FFN: feed-forward network).(A) CPT for continual post-training.It uses a masked language model (MLM) head for unsupervised posttraining of the CL-plugins only.(B) CPT for individual fine-tuning.CPT is evaluated by the corresponding individual end-task performance of all post-trained tasks.Each CL-plugin has numbers and colors indicating its masks and is illustrated in Appendix B.

Figure 2
Figure 2 illustrates the CPT architecture and the task mask learning.Note that fine-tuning is for evaluating the domain post-training and should not affect any parameters of post-training.During continual post-training (Figure 2 (A)), after training domain/task 1, we obtain its useful neurons indicated by the 1 entries.Before training domain/task 2, those useful neurons for domain 1 are first masked (those previous 1's entries are turned to 0's).After training domain 2, two neurons with 1 are used by the domain.When domain t arrives, all used neurons by domains 1 and 2 are masked before training, i.e., their entries are set to 0. After training domain t, we see that domains t and 1 have a shared neuron (the cell with two colors, red and green), which is used by both of domains.After continual post-training, we evaluate CPT by individual fine-tuning.During fine-tuning (Figure 2 (B)), we only make use of those neurons that are useful for domain/task id t (red cells) and freeze all other neurons (grey cells).

Figure 2 :
Figure 2: Architecture of CPT, which has two CL-plugins inserted in the transformer layers of RoBERTa in a parallel manner.(A) CPT for continual post-training.It uses a masked language model (MLM) head for unsupervised post-training of the plugins only.(B) CPT for individual fine-tuning.The performance of CPT is evaluated by the corresponding individual end-task performance of all post-trained tasks using the final post-trained model (with different mask).Each CL-plugin module (to the right of the transformer) has two fully connected layers and a skip connection.On top of each fully connected layer, there is a mask computed from task ID t with the same size as the fully connected layer.
use the original RoBERTa/Adapter for the end-task fine-tuning without any post-training.These are the only two without any post-training.All the following baselines use the masked language model loss (MLM) for post-training.

Table 1 :
is also reported.The higher the forgetting rate is, the more forgetting it has.Negative rates indicate positive knowledge transfer.8Superiority of CPT.Clearly, CPT outperforms all baselines and achieves no forgetting.More specifically, CPT markedly outperforms the two End-task macro-F1 (MF1), accuracy and forgetting rate results for all domains after continual post-training of all domains.The results are averages of 5 random seeds (the domain training order is as they appear in the first row).Due to space limits, the results for different domain orders and the standard deviations are reported in Appendix E and Appendix F, respectively).Non-CL baselines has no forgetting.

Table 3 :
Statistics for unlabeled domain datasets and end task supervised classification datasets.

Table 4 :
Standard deviations of the corresponding metrics of the proposed CPT system and the baselines.

Table 5 :
Standard deviations of the corresponding metrics of the proposed CPT system and the ablations.