Towards Anytime Fine-tuning: Continually Pre-trained Language Models with Hypernetwork Prompts

Continual pre-training has been urgent for adapting a pre-trained model to a multitude of domains and tasks in the fast-evolving world. In practice, a continually pre-trained model is expected to demonstrate not only greater capacity when fine-tuned on pre-trained domains but also a non-decreasing performance on unseen ones. In this work, we first investigate such anytime fine-tuning effectiveness of existing continual pre-training approaches, concluding with unanimously decreased performance on unseen domains. To this end, we propose a prompt-guided continual pre-training method, where we train a hypernetwork to generate domain-specific prompts by both agreement and disagreement losses. The agreement loss maximally preserves the generalization of a pre-trained model to new domains, and the disagreement one guards the exclusiveness of the generated hidden states for each domain. Re-markably, prompts by the hypernetwork alleviate the domain identity when fine-tuning and promote knowledge transfer across domains. Our method achieved improvements of 3.57% and 3.4% on two real-world datasets (including domain shift and temporal shift), respectively, demonstrating its efficacy.


Introduction
Pre-trained language models (LMs), such as GPT-3 (Brown et al., 2020) and BERT (Devlin et al., 2019a), have revolutionized a wide spectrum of downstream natural language processing (NLP) tasks.Being initially pre-trained on a vast unlabeled corpus (e.g., C 0 in Fig. 1), unfortunately, they struggle to keep up to date with language evolution (e.g., emerging internet slang, expanded meaning of "Omicron") and domain shift (e.g., electronic health records for medical diagnosis).
Figure 1: Illustration of continual pre-training and the evaluation protocol of anytime fine-tuning, in which a i j in the accuracy table denotes the fine-tuned accuracy of the LM at any i-th stage, i.e., B i , on the j-th pre-trained (blue), current (red), and unseen domains (orange).
Continual pre-training methods (Jin et al., 2022;Ke et al., 2023) have recently emerged to address it by continually adapting an LM to a sequence of domains (e.g., T domains in Fig. 1).Two major lines of existing approaches, including knowledge distillation (Jin et al., 2022) and parameter isolation (Ke et al., 2023(Ke et al., , 2022a)), make strides toward (1) maximizing the adaptability, i.e., the performance of an LM (e.g., B 2 in Fig. 1) when fine-tuning it onto the domain where it is pre-trained (e.g., D 2 in Fig. 1), and (2) avoiding catastrophic forgetting (CF), which is measured by the fine-tuned performance of an LM (e.g., B 2 in Fig. 1) on the already pre-trained domains (e.g., D 1 in Fig. 1).
Beyond the above two criteria, in practice, a continually pre-trained LM is also anticipated to offer non-decreasing generalization capability on unseen domains.As illustrated in Fig. 1, it is likely that the unlabeled corpus for the domain of interest (e.g., electronic health records as D T ) remains inaccessible to an LM (e.g., B 2 ) beforehand, while this LM should be superior or at least on par with its preceding models (e.g., B 1 ) on the T -th domain.On  (Ke et al., 2023), and ours.Detailed settings are available in Sec. 5.2.this account, we propose the comprehensive evaluation protocol named anytime fine-tuning that subsumes all the three aspects, where a continually pre-trained LM can be fine-tuned and evaluated on either previously pre-trained, current, or unseen domains.The effectiveness of current methods in terms of anytime fine-tuning remains largely unclear.
In this paper, we first conduct an empirical investigation of existing pre-training approaches under anytime fine-tuning (see Fig. 2) and identify the following two prominent unresolved research questions.
(1) Parameter-efficient pre-training, such as training adapters (Ke et al., 2021b) and prompts (Razdaibiedina et al., 2023;Smith et al., 2023) only for each individual domain, does not even contribute greater adaptability than that before pre-training (i.e., evidenced in negative diagonal values of Fig. 2(d To address the above issues, we propose a Hypernetwork Prompt guided Continual Pre-Training method (namely HPrompt-CPT1 ) that strikes a balance between forgetting, adaptability, and generalization.First, inspired by recent success of prompt engineering paired with full finetuning in domain adaptation (Radford et al., 2019;Brown et al., 2020), we introduce the hnet-prompt module consisting of a hypernetwork to automatically generate domain-specific prompts without handcrafted engineering.Different from parameterefficient pre-training that train prompts only, we optimize both the hypernetwork and the full LM so as to fully adapt to the current domain.An added benefit of hypernetwork prompts is that they eliminate the reliance on the domain identity to pinpoint prompts when fine-tuning.Second, we maximally preserve the generalization while mitigating CF of a continually pre-trained LM via the agreement and disagreement losses.We prompt the previous and current LM with a random prompt that simulates generic or learned domains and introduce the agreement loss to enforce consistency between their predictions to avoid forgetting while preserving model plasticity on other prompts.On the other hand, the disagreement loss promotes the exclusiveness of generated hidden states for the current domain, thus minimizing interference to the established knowledge and encouraging generalization during fine-tuning through diverse domain knowledge.Noteworthy, the hypernetwork also favors knowledge generalization, compared to disparate prompts of different domains.
Main Findings and Contributions.(1) We establish a continual pre-training evaluation protocol, called anytime fine-tuning, and empirically verify that existing parameter-efficient approaches lose their competitive edge in adaptability and almost all methods are at risk of impairing generalization to unseen domains (see Fig. 2).(2) We further conquer the two challenges by proposing a hypernetwork prompt guided continual pre-training (HPrompt-CPT) scheme where we train the hypernetwork with both the agreement and disagreement losses.HPrompt-CPT is effective, achieving the state-of-the-art on two real-world datasets.

Related Work
Continual Learning (CL) focuses on the problem of sequential learning from a stream of data that comes in different distributions.It has achieve a great success in computer vision (Wang et al., 2022a,c;Smith et al., 2023), natural language pro-cessing (Sun et al., 2019;Ke et al., 2023), and data mining (Hao et al., 2023;Xue et al., 2023).In this paper, we focus on one of the important aspects, continual pre-training and present recent progresses below.More related works are given in Appendix A. Continual Pre-training.Previous studies (Gururangan et al., 2020;Dery et al., 2022) have demonstrated that the fine-tuned performance of LM on downstream tasks can be enhanced by continued training on a domain-related corpus.Recent works take this concept further by introducing Continual Pre-training (CPT), where LM continually learns from streaming domain corpora.Jin et al. (2022); Jang et al. (2022) investigate conventional CL methods in CPT using real-world datasets and highlight the final LM can be fine-tuned to serve any task in pre-trained domains, leading to improved performance, while (Hu et al., 2022a) finds CPT is comparable with joint pre-training.To improve upon this, ELLE (Qin et al., 2022) progressively expands LMs with function-preserving initialization to inject knowledge from new corpus, while CPT (Ke et al., 2022a) designs specific adapters and utilizes a hard-masking to avoid CF.Additionally, DGA (Ke et al., 2022b) and DAS (Ke et al., 2023) adopt soft-masking to directly controls the update of the entire LM and contrast the previous and current representations.
Though these methods alleviate CF during CPT, they ignore the importance of adaptation to domain knowledge for better fine-tuned performance (Gururangan et al., 2020;Dery et al., 2022) and generalization to unseen domains (Wortsman et al., 2022;Andreassen et al., 2022).Our work utilizes the potential of LM and improves all three aspects.

Preliminaries
Our language model B is constructed using the Roberta architecture (Liu et al., 2019), which is based on a bi-directional Transformer structure.LM takes a text sentence x 1:T = [x 1 , x 2 , ..., x T ] as input and encodes it into a contextual embedding h = [h 1 , h 2 , ..., h T ] = B(x 1:T ).

Pre-training and Fine-tuning Tasks
During pre-training, the model is trained to predict missing words in a given text sentence x and thus acquires a general understanding of languages, such as syntax, semantics, and context.The pre-training task is called masked language modeling (MLM) (Devlin et al., 2019a), and the objective is ℓ mlm (x, W) = − x∈m(x) log p x | x \m(x) , W , where W denotes the parameters of language model B, m(x) and x \m(x) the masked words from x and the remain words, respectively.The conditional probability is calculated by a prediction layer g mlm as p x | x \m(x) , W = g mlm B W (x \m(x) ) .
After pre-training, the model is fine-tuned using a smaller dataset specific to a downstream task, which enables it to learn the intricacies and details of the task.In our study, the downstream task contains labeled samples (x, y) (e.g., in a hashtag prediction task, x is the user's twitter and y is the selected hashtag).Its objective function is to minimize ℓ down (x, W) = − log p (y | x, W).

Soft Prompt Learning
Prompt tuning (Lester et al., 2021) is a lightweight alternative to the full fine-tuning that introduces a trainable prompt P = [p 1 , p 2 , ..., p L ] as a prefix to the input embedding E = [e(x 1 ), e(x 2 ), ..., e(x T )] to replace the update on entire model.The prompt length is L, e represents the embedding layer in LM, and p i ∈ R d has the same dimension d as the token embedding.During prompt tuning, the concatenated matrix [P; E] ∈ R (L+T )×d is used as the input to the LM, expressed as B(x, P).The downstream task optimization is represented as ℓ down (x, P) = − log p (y | x, P) = − log g down (B(x, P)), where g down is the prediction layer for the task and the model B does not update in conventional soft prompt learning.

Continual Pre-training for Anytime
Fine-tuning Continual pre-training (Jang et al., 2022;Meng et al., 2023) is a way to efficiently adapt to the new domain while maintaining learned knowledge.The problem formulation is as follows (see Fig. 1): assume a stream of new domains (e.g., latest news about "Omicron") sequentially appears as D 1 , ..., D N , where D i is the distribution of i-th domain over a finite vocabulary of tokens X .Initially, we have an LM that has been well pre-trained on the general corpus C 0 , such as Roberta.Then at each stage i, a collection of new unlabeled corpus  to get greater capacity when fine-tuned on tasks from all pre-trained, current, and unseen domains.Each domain has its labeled dataset where F * ∈ Y provides ground-truth labels for classification.During the evaluation, the LM B i , pre-trained up to the ith domain, is fine-tuned on a train set D tr j and then tested on D te j to measure its domain performance, as illustrated in Fig. 1.The resulting accuracy, denoted as Acc B i D j (simplified as a i j ), indicates the model capacity on task D j as well as the degree of knowledge of j-th domain maintained by LM after being sequentially trained up to C i .
Through the integration of results, an accuracy table is generated, allowing for the computation of three crucial metrics in anytime fine-tuning as discussed in Sec.1: adaptability, generalization, and forgetting.The values used to calculate these metrics are indicated by different colors in Fig. 1.Red cells along the diagonal of the table represent adaptability, indicating the degree to which the LM learns knowledge relevant to current domain.Yellow cells in the upper triangle represent generalization, signifying the ability to perform effectively in future domains.Blue cells in the lower triangle represent forgetting, reflecting a reduction in previously learned knowledge during training.

Method
A successful algorithm of continual pre-training for anytime fine-tuning should meet the following requirements: (1) effective adaptation to the current domain and capturing more domain knowledge, (2) strong generalization to tasks in unseen domains, and (3) minimal catastrophic forgetting of previ-ously learned knowledge.To achieve this, we propose a framework, dubbed HPrompt-CPT, which consists of two components: the Hnet-Prompt module and Agreement and Disagreement losses.The overview is presented in Fig. 3.

Hnet-Prompt for Pre-training and Fine-tuning
Previous soft prompt methods (Qin and Joty, 2022;Zhu et al., 2022;Razdaibiedina et al., 2023) have made great success in the CL, with almost no catastrophic forgetting.However, these parameterefficient methods fall short in model adaptation during the pre-training stage and fail to exhibit generalization capabilities when faced with new domains, as shown in Fig. 2. On the other hand, prompt engineering has shown exceptional performance in pre-training language models to better learn domain-specific knowledge (Radford et al., 2019;Brown et al., 2020).However, the use of hard-coded prompts makes it difficult to implement and less relevant to generalization.Therefore, inspired by previous meta-learning approaches (Qiao et al., 2018;Yao et al., 2019), we propose a prompt module with a meta hypernetwork (Hnet-Prompt) for automatic knowledge adaptation and cross-domain generalization.Specifically, when a batch of data x 1 , ..., x n in a specific domain D i comes, the hypernetwork generates a prompt P for each sample (see Fig. 3(b)), taking into account both domain and sample properties while generalizing knowledge from learned domains.The process is parameterized as: where E refers to a text encoder, F corresponds to a hypernetwork, and ĥi represents the contextual embedding, which captures both the sentence and implicit domain information.Hypernetwork F encodes the domain feature of input samples (we use a 6-layer Transformer) and then projects the pooled feature to obtain the prompt (see Fig. 3(b)).Rather than directly generating the prompt, we set M prompt components V m ∈ R L×d and generate a weight vector α ∈ R M to get the final prompt P = M m=1 α m V m .Vector α controls the contribution of each prompt component, which corresponds to a basic domain.This approach reduces the parameter of the linear layer for projection and alleviates forgetting by shifting the learning problem from remembering the entire embedding to a weight vector.
Prompt components V, analogous to a set of basis vectors, are a set of prompt embeddings that are randomly initialized, trainable and optimized through gradient descent.The well-trained prompt components are supposed to offer greater generalization to future domains as long as the prompt components are as mutually exclusive as possible.For example, a prompt embedding directly optimized for the domain of "ACL papers" does not directly apply to the domain of "AI papers" due to the domain difference; however, one of the prompt components learned on "ACL papers", e.g., "deep learning", can be combined with another component of "statistics" to generalize to the domain of "AI papers".
During pre-training, the language model is conditioned on the prompt generated by the hypernetwork, which models p(output | input, domain) and injects the domain knowledge into the model in an explicit way.Then, we optimize the language model and hypernetwork in an end-to-end manner by minimizing the following equation: and Θ is the parameter of F .This approach allows for qualified and automatic adaptation to domain knowledge and enables the transfer of this knowledge across domains through hypernetwork.
During downstream task fine-tuning, domain identity is not required anymore.Hypernetwork will automatically map the input samples to their unique prompt embedding with the knowledge generalized from learned domains.Given a task t, the entire model will be fine-tuned on the smaller labeled dataset, using the objective ℓ down (x, W, Θ) = − log p (y | x, W, Θ).Here hypernetwork F is also trainable to get the best adaptation to downstream tasks.The fine-tuned performance on the task shows the degree of domain knowledge maintained by the LM.

Agreement and Disagreement Losses for Prompted Language Model
While preventing the forgetting of learned knowledge is always the key challenge in continual pretraining, they are at the cost of adaptability and generalization.To overcome it, we propose a novel approach, named agreement and disagreement losses.Agreement loss.While knowledge distillation (KD) has been demonstrated to perform well in overcoming CF (Chuang et al., 2020;Dong et al., 2021), its alignment on the entire feature space can limit the adaptation to new domains.To alleviate it, we propose to align the output p(output | input, domain) of the prompted language model instead p(output | input) used in conventional KD.We term this approach the agreement loss.Specifically, we begin with the prior learned LM B i−1 .Then, initialize the random prompt P rand and generate prompted hidden states using both current LM B i and previous LM B i−1 (see Fig. 3(c)).
We then minimize the distance metrics M between the outputs of two models, as shown below: where P rand simulates the condition to active generic or learned domain knowledge.The agreement loss, which operates on B(•, P rand ) , effectively prevents forgetting by enforcing consistency on multiple randomized conditions and preserves the plasticity to new domains by maintaining model capacity conditioned on other prompts, as demonstrated by a comparison to KD.A smaller M indicates a closer distance between the two inputs.In this article, we use cosine similarity to calculate M, which performs better than the KL distance between logits in the experiments in Sec.5.4.Disagreement loss.Besides the consistency achieved by agreement loss, we also expect the exclusiveness of the generated hidden states for the current domain.It brings two advantages: (1) it reduces interference to established knowledge, which mitigates forgetting (Farajtabar et al., 2020;Wang et al., 2021b); (2) it encourages generalization when fine-tuning by incorporating a wider range of domain knowledge (Pagliardini et al., 2023).To achieve this exclusiveness, we add a loss function called disagreement loss.Specifically, when a sample comes, we generate the prompt using hypernetwork F and train the prompted LM to maximally disagree with the output of the previous LM, which is also promoted by the same embedding (see Fig. 3(c)).This involves minimizing the agreement metric A(•, •) to push apart the two prompted hidden states: thereby increasing the exclusiveness of the output of LM for the current domain.In Sec.5.4, we compare various implementation of A including orthogonal constrain (Smith et al., 2023), softmax variant (Pagliardini et al., 2023), opposite value of KL-divergence.Ultimately, we select the orthogonal constraint, which can be calculated using the equation Finally, the loss function of our HPrompt-CPT during pre-training can be summarized as follows: where N is the batch size, and λ 1 , λ 2 are the tradeoff hyper-parameters.The loss input x i is omitted.

Experiment
In this section, we conduct experiments on two benchmarks to investigate the adaptability, generalization, and degree of forgetting of HPrompt-CPT.

Benchmarks
DAPset.It is a benchmark for continual domain adaptive pre-training, originally constructed by (Ke et al., 2023).It consists of six domains, each with an unlabeled corpus and a corresponding end-task classification dataset.Each domain contains a corpus size of over 100 million tokens, and we follow the original data construction and task order.
TWEET.We develop a new benchmark based on a tweet dataset (Jin et al., 2022) to simulate the distribution shift over time.The dataset includes tweets from 2015 to 2019 and is split into five time periods to form five domain corpora, each with over 50 million tokens.The tweet texts are preprocessed following Nguyen et al. (2020).For the downstream task, we build a single-label hashtag prediction dataset for each domain following Gong and Zhang (2016).TWEET keeps the chronological order of domains to simulate the updating in the real-world system.Please refer to Appendix B for more information about the two benchmarks.

Metrics and Baselines
Metrics.We introduce three attributes of continual pre-training in Sec.3.3 and provide an explanation of their evaluation methods.Formally, we utilize the adaptation accuracy A_Acc = 1 T T i=1 a i i to measure adaptability, the out-of-domain accuracy O_Acc = 2 T * (T −1) T i=1 T j=i+1 a i j to evaluate generalization, and the final accuracy F _Acc = 1 T T i=1 a T i to assess the degree of catastrophic forgetting.Here, a j i represents the fine-tuned accuracy on the i-th downstream task, after being sequentially trained up to corpus C j in the j-th domain.
Baselines.We first evaluate the algorithms that build separate model for each domain, including: (1) Initial is fine-tuned on the initial pre-trained point.(2) Multi-Task is domainadaptively pre-trained on the mixture of all domains.(3) One-Full is domain-adaptively pretrained with the updates on the full model.(4) One-Adapter is domain-adaptively pre-trained with an adapter layer (Houlsby et al., 2019).( 5) One-Prompt is domain-adaptively pre-trained with a new prompt (Lester et al., 2021).Additionally, we test 7 continual pre-training methods: (6) NCL is sequentially pre-trained without any CL methods.( 7) EWC (Kirkpatrick et al., 2017) is a regularization method that penalizes changes to important neurons.( 8) DERpp (Buzzega et al., 2020) is a replay method in both sample and feature levels.( 9 For HPrompt-CPT, we adopt a 6-layer Transformer as our hypernetwork and frozen Roberta as text encoder.We set the prompt length to 50, and the size of prompt components to 100.In addition, we implement a replay loss to the hypernetwork with a memory buffer storing 300 samples to get the best performance, while removing it resulting in a minimal drop of 0.24% in F _Acc on DAPset.During fine-tuning, we train each task for 15 epochs with an early stopping mechanism using the validation data (30% of testing data).We include additional Implementation Details in Appendix C.

Results and Analysis
Comparison with the state-of-the-art.Table 1 shows the continual pre-training performance of different methods on three dimensions.From these results, we make the following observations: Observation 1: HPrompt-CPT outperforms baselines in terms of adaptability, generalization, and avoidance of catastrophic forgetting.Our approach achieves new state-of-the-art results across all three metrics, with increases of 1.38% and 1.09% on the DAPset in terms of generalization and final performance compared to the most recent algorithm, DAS, as depicted in the last row of Table 1.These results highlight the advantages of injecting domain knowledge into the LM with the hnet-prompt module, which aids in adaptation and promotes knowledge transfer.
Observation 2: Naive multi-task learning is sub-optimal for continual pre-training.Our hnetprompt method achieves a relative improvement in F _Acc of 1.69% on DAPset and 2.35% on TWEET, suggesting that it can alleviate negative transfer between conflicting domains and minimize forgetting.It is worth noting that the O_Acc metric of multi-task learning cannot be compared fairly with other algorithms since it has already observed all domains.Nevertheless, our algorithm still achieves a 1.50% gain on TWEET, which may result from the generalization of the diverse domain knowledge in HPrompt-CPT.
Observation 3: Full model tuning achieves better results in learning and transferring domain knowledge.Our proposed method and NCL outperform parameter-efficient methods such as One-Adapter, One-Prompt, and CoDA-Prompt.Interestingly, methods that incorporate regularization terms on parts of neurons, such as EWC and DAS, also result in lower A_Acc.This suggests that injecting a large amount of domain knowledge into the LM requires a sufficient number of trainable parameters.Our prompted LM, with all parameters trainable and no empirical constraints on updates, shows the best adaptation performance.Data-efficient pre-training.Note that we hypothesize that HPrompt-CPT is especially effective in the setting of anytime fine-tuning.Its performance on a small subset of the corpus is worth referring to, for the model can be utilized for fine-tuning in cases where a domain is not finished training.Fig. 4 illustrates the performances trained on different sizes of datasets and highlights the effectiveness of our method in low-resource environments, particularly in terms of generalization ability.Our design of the hnet-prompt module successfully promotes knowledge transfer across domains, and besides we ob- serve that the structure of the hypernetwork matters in such settings.Transformers may underfit facing smaller datasets, resulting in poor performances compared to the linear structure.
Analysis on the distributions of hnet-prompt embeddings and hidden states.We perform qualitative analyses on prompts and hidden states generated by HPrompt-CPT to investigate whether the hypernetwork can generalize domain information.
As depicted in Fig. 5, We use t-sne map (van der Maaten and Hinton, 2008) to visualize the model output before and after training on all six domains in DAPset.For prompts, we observe that the generated prompt embeddings can effectively cluster similar domains together (e.g., overlapping embeddings for corpora C 2 , C 3 , and C 5 from the same paper dataset) while also achieving differentiation for dissimilar domains (e.g., distant embeddings for C 1 (restaurant) and C 5 (bio-chem)).This is an impressive result, i.e., it transfers the information across domains, making it easier for the LM to effectively adapt and generalize knowledge.
For hidden states, our model generates distinguishable hidden states for downstream task based on pre-trained domain information, i.e.,the initially mixed downstream representation (D 1 -D 6 in Fig. 5 top right) are successfully separated in Fig. 5 top left.For instance, the model assigns overlapping representations to similar tasks D 2 and D 3 (belonging to ACL and AI, respectively), while providing effective differentiation for unrelated tasks D 1 (restaurant) and D 5 (biology).

Ablation Study
Table 2 and 3 present the results of different designs of HPrompt-CPT on DAPset, where hyperparameters are fixed across all settings.Effectiveness of the main components.To assess the impact of the hypernetwork, we replace the hnet-prompt with progprompt (Razdaibiedina et al., 2023), which generates a new soft prompt for each domain and concatenates it and previously learned prompts while requiring domain-id during fine-tuning.As shown in Table 2 (rows 1 and 3), it results in a significant decrease in performances, particularly in adaptability, with an almost 1.77% decrease.It highlights the effectiveness of hnetprompt in adapting and generalizing domain knowledge, providing great capacity for fine-tuning.
To examine the effect of the agreement and disagreement losses, we compare the results of training progressive prompt and hnet-prompt with and without them.It shows that incorporating the agreement and disagreement losses lead to a 1.15% and 1.20% improvement in F _Acc for the two models, respectively, demonstrating its efficiency in preventing CF.Furthermore, we observe that introducing the disagreement loss results in a 1.33% gain in O_Acc, which is attributed to the incorporation of a wider range of domain knowledge for adaptation, as discussed in Sec.4.2.Hypernetwork structure.We further investigate the different designs of hypernetwork and present the results in Table 3 (top).First, we compare the network structure with the Linear layer or Multilayer Perceptron (MLP) (the top two rows), but they show poor adaptability and a higher level of CF.Interestingly, we find that the linear structure is more stable when facing a low-resource setting.Besides, we examine the performance of generating prompt embedding directly to show the significance of the component-based method introduced in Sec.4.1.The results reveal that the componentbased approach outperforms in generalization and preventing forgetting, benefiting from shifting the learning problem from remembering prompt to the weight vector which is a simple task.Agreement and disagreement loss objective.We first replace the agreement loss with the conventional KD and the result are presented in the first row of Table 3 (middle).It shows agreement loss leads to a 1.06% improvement in adaptability while Table 3: Ablation results on the hypernetwork structure and agreement/disagreement loss objective.Here, ℓ a and ℓ da denote the two losses.The content in parentheses represents the applied objective.The logit refers to minimizing the mean square error on logits.The KL distance aims to maximize the KL distance between hidden states.The softmax variant is to maximize the softmax on logits, following (Pagliardini et al., 2023).

Conclusion
This paper introduces HPrompt-CPT, a novel prompt-guided continual pre-training method towards anytime fine-tuning, which enables better performance when fine-tuned on seen and unseen domains.By training a hypernetwork to generate domain-specific prompts with agreement and disagreement losses, it results in (i) greater capacity on pre-trained domains by learning domain knowledge with generated prompts while preserving previous knowledge with random prompts, (ii) improved performance on unseen domains by retaining model plasticity with agreement loss and the ability of knowledge transfer with hypernetwork, and (iii) no need for domain-id during fine-tuning.We set a new SOTA on both well-established benchmark and a temporal shift benchmark.

Limitations
While we have evaluated our approach on two continual pre-training benchmarks, it remains unknown how well our method would perform on benchmarks with severe domain conflicts.The domains in the benchmarks used in our paper are mostly transferable to each other.For example, the Domain "ACL" and "AI" in DAPset are highly related.We are not sure how will our method perform in a sequence of domains with little to no shared knowledge or even conflicts.In addition, we currently only test our method on the classification task, while the exploration of more types of downstream tasks is also important.Our future work will extend the benchmark to cover such cases.
Another problem for HPrompt-CPT is the selection of hypernetworks.Our experiments in Sec.5.3 demonstrate that decreasing the size of the unlabeled corpus can cause the Transformer structure to underfit, while the Linear structure cannot capture all the information from a large corpus.In addition, we find the fine-tuning of hypernetwork is sensitive to the learning rate and weight decay.We aim to enhance the capacity and stability of our hypernetwork.Moreover, it is best to get a hypernetwork that can generalize well on downstream tasks without fine-tuning.

A Additional related work
Continual learning.Continual Learning (CL) focuses on the problem of sequential learning from a stream of data that comes in different distributions.It is desired to extend the acquired knowledge to future tasks while avoiding catastrophic forgetting (CF) (McCloskey and Cohen, 1989) of the past tasks, and this has been successfully implemented in various fields, including computer vision (De Lange et al., 2021;Cha et al., 2021;Wang et al., 2022b), natural language processing (Sun et al., 2019;Huang et al., 2021;Qin and Joty, 2022), and Robotics (Wołczyk et al., 2021;Wolczyk et al., 2022).Traditional CL approaches fall into three types: regularization methods (Kirkpatrick et al., 2017;Li and Hoiem, 2017;Zenke et al., 2017;Aljundi et al., 2018), replay methods (Aljundi et al., 2018;Buzzega et al., 2020;Sun et al., 2022), and architecture-based methods (Serra et al., 2018;Li et al., 2019;Kang et al., 2022).However, designing CL methods for Language Modeling (LM) is different due to the two-stage training scheme (Devlin et al., 2019b;Qiu et al., 2020), where the model first learns universal language representations on a large corpus (pre-training stage), and then fine-tunes on downstream tasks (fine-tuning stage).Therefore, CL methods for LM can be categorized into Continual Pre-training and Continual Fine-tuning.In this paper, we focus on the continual pre-training.
Continual Fine-tuning.Continual Fine-tuning (FT) entails training a model on a stream of downstream tasks after its initialization.This approach requires the model to transfer knowledge to new tasks, avoid forgetting, and perform consistently well on all tasks learned before.Differing from the traditional CL, continual FT will mostly utilize the ability of a pre-trained language model.For instance, in replay methods, MbPA+ (de Masson D'Autume et al., 2019) uses BERT (Devlin et al., 2019a) to get the key representation of old examples for local adaptation when inference, while LAMOL (Sun et al., 2019) generates old examples through the generative ability of GPT2.IDBR (Huang et al., 2021) prevents forgetting by learning generic and task-specific knowledge using disentanglement-based losses.Recently, researchers have explored parameter-efficient tuning (Houlsby et al., 2019;Lester et al., 2021;Hu et al., 2022b) for CL.One kind of approaches (Ke et al., 2021b,a;Jin et al., 2021) involves using adapter modules (Houlsby et al., 2019) to incorporate task-specific parameters into frozen transformer layers, while the other (Qin and Joty, 2022;Zhu et al., 2022) uses soft prompts (Brown et al., 2020) to activate model ability to solve different tasks.
Soft Prompt Learning.Recent works (Wang et al., 2021a;Vu et al., 2022) have shown the potential of parameter-efficient learning in achieving multitask performance at low cost by stacking lightweight units.Their success attracts a surge of attention in adapting them to CL, especially for soft prompt tuning (Lester et al., 2021;Liu et al., 2022).These methods aim to transfer knowledge between tasks using prompts while avoiding forgetting.For example, LFPT5 (Qin and Joty, 2022) employs a large soft prompt that is continually trained on all tasks while also distilling previous knowledge.Continual Prompt Tuning (Zhu et al., 2022) uses initialization, memory replay, and AGEM (Chaudhry et al., 2019) technologies to facilitate prompt forward/backward transfer.ProgPrompt (Razdaibiedina et al., 2023) leverages previously learned prompts by concate-nating them with new embeddings.While most of these methods require task-id to select the appropriate prompt, DualPrompt (Wang et al., 2022a) and L2P (Wang et al., 2022c) create a prompt pool and learn a cluster-based mapping from input data to a specific prompt.Furthermore, CodaPrompt (Smith et al., 2023) suggests learning a composition of the prompt pool that replaces index operations with a backpropagation-based approach.Konwledge distillation.Knowledge distillation (KD) (Hinton et al., 2015) is a widely-used technique for improving performance and efficiency in various tasks, such as model compression (Meng et al., 2021;Chen et al., 2022) and transfer learning (Xu et al., 2020;Fang et al., 2021).KD has also been applied in continual learning to transfer knowledge learned from old tasks to new ones and thus prevent forgetting (Chuang et al., 2020;Dong et al., 2021;Ke et al., 2021a).However, previous approaches mainly focused on aligning the entire feature space, which can limit the adaptation ability of the model.Instead, we propose using an agree and disagree loss to decompose KD into prompt and feature space, achieving a better balance between plasticity and stability.

B Dataset Details
DAPset.It is a benchmark for continual domain adaptive pre-training, constructed by (Ke et al., 2023).It consists of six domains, each with an unlabeled corpus and a corresponding downstream task classification dataset.They are from two large datasets, while 3 of them are about reviews: Yelp Restaurant (Xu et al., 2019)/ Restaurant (Ding et al., 2008), Amazon Phone (Ni et al., 2019)/ Phone (Ding et al., 2008), Amazon Camera (Ni et al., 2019)/ Camera (Ding et al., 2008) and 3 of them are academic papers: ACL papers (Lo et al., 2020)/ ACL-ARC (Jurgens et al., 2018), AI papers (Lo et al., 2020)/ SCIERC (Luan et al., 2018), and PubMeb papers (Lo et al., 2020)/ CHEMPROT (Kringelum et al., 2016).The front one is the unlabeled corpus and the latter one is the corresponding downstream task.We show the statistics of these datasets in Table 4.
The downstream tasks in DAPset can be divided into the following 4 types.(1) Aspect sentiment classification: given an aspect (e.g., environment in a restaurant review) and the corresponding review text, classify it into positive, negative, or neutral sentiment.(2) Citation intent classification: given a sentence contains a citation, classify the intent of this citation.(3) Relation classification: given a sentence together with its entities, classify the relation of these entities.(4) Chemical-protein interaction classification: given a sentence containing a pair of chemicals and proteins, classify the interaction between these two.
TWEET.We develop a new benchmark, TWEET, following (Jin et al., 2022), to simulate the distribution shift over time.The dataset is collected by the Archive team2 and we select the text data from 2015 to 2019, dividing it into five time periods and creating five domain corpora.The text data was pre-processed the text data according to (Nguyen et al., 2020) with the hashtags removed.We randomly select 300 MB of data from each period, and the statistics of each domain is present in Table 4.
For the downstream task, we build a single-label hashtag prediction task for each domain following Gong and Zhang (2016).We count the 10 most frequently hashtags (e.g., "#hiring", "#music") in each time period and extract 700 tweet texts for each label, including 200 tweets for training, 200 tweets for validation, and 300 tweets for testing.Before the text is input into the model, we remove the hashtags themselves and ask the model to predict the most appropriate hashtag for the current sentence.

C Implementation Details
We adopt Roberta-BASE (Liu et al., 2019) as our backbone language model and 6-layer Transformer (Vaswani et al., 2017) as our hypernetwork.In the pre-training phase, we apply a masked language model head after the LM, which is then replaced with a classification head during fine-tuning.Each downstream task has its own classification head.In pre-training, the trainable parameters depend on the algorithm design, while in fine-tuning, all parameters, including the language model and added model, are trainable.
The maximum input sequence length is set to 164, following (Ke et al., 2023), and we use an Adam optimizer with a weight decay of 0.01 for both pre-training and fine-tuning.During pretraining, the learning rate is set to 1e-4 and batch size to 128.We train for 5K and 2.5K steps for each domain in DAPset and Tweet, respectively, which is roughly a full pass through the domain data.We set the prompt length to 50, the size of prompt components to 100, and the size of memory buffer to 300.As for the trade-off hyperparameters, we set λ r to 1, λ a to 0.01, and λ da to 0.01.During fine-tuning, the learning rate is set to 3e-5 and batch size to 16.We train on the downstream datasets for 15 epochs with an early stopping mechanism.Unless otherwise stated, the same hyper-parameters are used in all experiments.

D Robustness on different orders
As several works (Yoon et al., 2020;Evron et al., 2022) suggest that the task order may significantly affect the performance of CL approaches, we also conduct experiments to test our robustness to different task orders.We run the methods on several orders of the DAPset benchmark and list results in Table 5.Our method shows an average improvement of 0.99%, 1.22%, and 1.36% on the three metrics compared to DAS, demonstrating its robustness.

Figure 2 :
Figure 2: Evaluation of separate and continual pre-training methods under anytime fine-tuning, where we modify each value a i j by subtracting a 0 j as the fine-tuned accuracy of the initial LM B 0 .(a)-(e) show the accuracy tables by pre-training each domain separately w.r.t.different sets of parameters (e.g., top layers); (f)-(h) are by the naively continual pre-training method (NCL), DAS (Ke et al., 2023), and ours.Detailed settings are available in Sec.5.2.
)(e)).Likewise, pretraining parts of parameters for each domain, may also diminish adaptability, through comparison of Fig. 2(b)(c)(g) with (a).(2) Continual pre-training is likely at the cost of sacrificing generalization to unseen domains, shown by large negative values in the third column of Fig. 2(f)(g).

Figure 3 :
Figure 3: An overview of the model structure, with dotted lines indicating trainable modules and solid lines indicating frozen modules.(a) denotes the soft prompt tuning (Sec.3.2).(b) shows the pre-training on domain 4 with the hnet-prompt module (Sec.4.1).The hypernetwork takes the contextual embedding ĥ as input and automatically generates a prompt P considering domain and sample properties, which clusters P for similar domains (D 2 ,D 3 ,D 4 ) together and facilitates knowledge generalization.(c) computes the agreement and disagreement losses (Sec.4.2).

Figure 4 :
Figure 4: Performances on DAPset with different sizes of the corpus.The implementations of "ours (trans/lin)" refer to utilizing transformer/linear hypernetwork in HPrompt-CPT, respectively.

Figure 5 :
Figure 5: The t-sne map about prompt embedding and hidden state of the last layer.C i and D i denote the corpus and downstream task in i-th domain, respectively.

Table 1 :
Performance of baseline results on DAPset/TWEET benchmarks (all results reported in this paper are averaged over 4 random seeds).The symbol "−" in the table is because F _Acc is the same as the average accuracy A_Acc in the separate pre-training settings.We also report the results for different domain orders in Appendix D.

Table 2 :
Ablation results on the main components.

Table 4 :
Statistics of datasets for DAPset and TWEET.

Table 5 :
Performance on different order of domains on DASSET.