Recyclable Tuning for Continual Pre-training

Continual pre-training is the paradigm where pre-trained language models (PLMs) continually acquire fresh knowledge from growing data and gradually get upgraded. Before an upgraded PLM is released, we may have tuned the original PLM for various tasks and stored the adapted weights. However, when tuning the upgraded PLM, these outdated adapted weights will typically be ignored and discarded, causing a potential waste of resources. We bring this issue to the forefront and contend that proper algorithms for recycling outdated adapted weights should be developed. To this end, we formulate the task of recyclable tuning for continual pre-training. In pilot studies, we find that after continual pre-training, the upgraded PLM remains compatible with the outdated adapted weights to some extent. Motivated by this finding, we analyze the connection between continually pre-trained PLMs from two novel aspects, i.e., mode connectivity, and functional similarity. Based on the corresponding findings, we propose both an initialization-based method and a distillation-based method for our task. We demonstrate their feasibility in improving the convergence and performance for tuning the upgraded PLM. We also show that both methods can be combined to achieve better performance. The source codes are publicly available at https://github.com/thunlp/RecyclableTuning.


Introduction
The emergence of pre-trained language models (PLMs) has revolutionized the entire field of natural language processing (NLP) (Bommasani et al., 2021).Through downstream adaptation, PLMs effectively stimulate the knowledge acquired during pre-training and achieve remarkable success in various downstream tasks (Devlin et al., 2019;Liu et al., 2019;Raffel et al., 2020).Such adaptation can be achieved by either full-parameter fine-tuning or parameter-efficient tuning (Houlsby et al., 2019), and the latter enables learning lightweight adapted modules for downstream tasks.Currently, a de facto paradigm for handling NLP tasks has been formed, dividing practitioners into two groups: (1) upstream suppliers, who pre-train PLMs on taskagnostic data and release them on public platforms, e.g., HuggingFace (Wolf et al., 2020), and (2) downstream consumers, who download the PLM and conduct personalized adaptation using taskspecific data.The corresponding adapted weights might then be shared with third parties via platforms such as AdapterHub (Pfeiffer et al., 2020).
In real-world scenarios, PLMs may constantly get upgraded and released by the supplier.Correspondingly, the customer-side compatible update of adapted weights becomes necessary.Continual pre-training (Qin et al., 2022c) is a typical scenario where PLMs continually acquire fresh knowledge from growing data and gradually get upgraded.Before an upgraded PLM is released, consumers may have tuned the original PLM for various tasks and stored the adapted weights.However, when tuning the upgraded PLM, these outdated adapted weights will typically be ignored and discarded.This can lead to a loss of knowledge about downstream tasks encapsulated in the outdated weights, as well as a potential waste of computational resources.In this paper, we bring this issue to the forefront and argue that proper algorithms for recycling outdated adapted weights should be developed.To this end, we formulate the task of recyclable tuning for continual pre-training, which is illustrated in Figure 1.
Due to the parameter change during continual pre-training, one potential concern for recycling outdated adapted weights is their mismatch with the upgraded PLM.However, our pilot studies reveal that directly applying the outdated weights to the upgraded PLM yields substantial performance improvements as compared to zero-shot inference of the PLM.This shows that the upgraded PLM remains compatible with the outdated weights to some extent, indicating a close connection between continually pre-trained PLMs.Intuitively, such a connection provides a strong basis for our assertion that outdated weights are recyclable and useful.
To uncover hints for solving our task, we further investigate such a connection from two aspects: (1) linear mode connectivity (Qin et al., 2022b).We demonstrate that after adapting both the upgraded PLM and the original PLM to the same task, linearly interpolating the parameters of both adapted models could produce a series of checkpoints with high task performance (low loss).Such a property indicates a close parametric connection of both PLMs in the loss landscape; (2) functional similarity.After adapting both PLMs to the same task, we observe that their corresponding attention heads exhibit similar patterns given the same input.Such representational proximity implies that both PLMs own similar functionalities during text processing.
Both analyses above demonstrate the close connections between continually pre-trained PLMs.Based on the corresponding findings, we propose two methods for recyclable tuning: (1) Initialization-based method, which leverages the adapted weights of the original PLM as the initialization for the upgraded PLM.This method is motivated by their close parametric connection in the loss landscape.We demonstrate that for a target task, initializing the tunable parameters with the outdated weights from a similar source task could accelerate the convergence and improve the training efficiency, compared to using random initialization.In addition, after sufficient training, this method generally improves the final performance.We also observe that the benefits of this method in terms of convergence and performance are greater when the source and target tasks are more similar.
(2) Distillation-based method, which distills the knowledge stored in outdated weights for tuning the upgraded PLM.We demonstrate that knowledge distillation can effectively facilitate knowledge transfer between continually pre-trained PLMs.Using only a small number of labeled examples, the upgraded PLM can outperform the original PLM when trained with far more examples.We also show that both initialization-based and distillation-based methods can be combined to further improve the performance.This means knowledge transfer through parameter space and model outputs are complementary to each other.
In a nutshell, these results highlight the practical benefits of recyclable tuning and point to an important future direction in sustainable NLP.

Related Work
Continual Pre-training.Conventionally, PLMs are trained on static data, ignoring that streaming data from various sources could continually grow.Continual pre-training requires PLMs to accumulate new knowledge in a continual manner (Gururangan et al., 2020), meanwhile alleviating the catastrophic forgetting problem.Prior works in this field focus on building benchmarks and analyses (Jang et al., 2021(Jang et al., , 2022)).Later works explored the applicability of traditional continual learning algorithms under this setting (Jin et al., 2022;Wu et al., 2021).Recent efforts were also spent on continual pre-training in a computationally efficient way (Qin et al., 2022c).
Previous works focus on improving the capabilities of PLMs during pre-training from the standpoint of upstream suppliers.Instead, we shift the focus to downstream adaptation from the perspective of customers.We highlight a previously overlooked issue of the incompatibility between upgraded PLMs and the existing adapted weights.For the first time, we examine the connections between continually pre-trained models and demonstrate the potential benefits of recycling outdated weights.
Knowledge Transfer for PLMs.Transfer learning for PLMs has gained increasing attention recently.Some works study task-level transferability for an individual PLM and find that fine-tuning on certain source tasks conduces to the performance on similar target tasks (Vu et al., 2020;Poth et al., 2021;Aghajanyan et al., 2021).Differently, we also study cross-task knowledge transfer for two different PLMs under the continual pre-training scenario ( § 5.1).Besides, researchers also investigate cross-model knowledge transfer.They try to recycle lightweight adapted weights of the same task between two independently pre-trained PLMs, e.g., PLMs with distinct data (Su et al., 2022).As we would show later, unlike independently trained PLMs, continually pre-trained PLMs are guaranteed close connections.This distinction determines our setting is unique to previous works and may require different solutions.

Problem Formulation
Continual Pre-training.Following Qin et al. (2022c), we simulate the scenario where new data from 4 domains is gathered sequentially, i.e., biomedical papers (BIO, D 1 ) (Lo et al., 2020), amazon reviews (REV, D 2 ) (He and McAuley, 2016), computer science papers (CS, D 3 ) (Lo et al., 2020), and news articles (NS, D 4 ) (Zellers et al., 2019).Starting from the official RoBERTa BASE (Liu et al., 2019) (denoted as M 0 ), we continually pre-train M 0 on 4 domains.For each domain, we set the pre-training steps to 12.5k and the batch size to 2048.Denote M i as the PLM that finishes training on D i , and M i (t) as the PLM that starts from M i−1 and is trained on D i for t steps.We assume the suppliers only release the PLM that finishes training on each domain, i.e., {M 1 , • • • , M 4 } are developed and released.The pre-training details are described in appendix D.1.Downstream Adaptation.At the same time, we have a set of downstream tasks to handle.To adapt M i (0 ≤ i ≤ 4) towards a task T j , we conduct supervised training using the loss function L T j .Denote the pre-trained weights of M i as θ 0 i , we obtain its adapted weights ∆ T j i for T j after training.By assembling both θ 0 i and ∆ i , the resultant model θ i can be deployed to handle T j .Throughout this paper, we consider two tuning methods: full-parameter fine-tuning and a representative parameter-efficient tuning method, adapter tuning (Houlsby et al., 2019) (see appendix A.1 for more backgrounds).For the former, we have Recyclable Tuning.Before the release of an upgraded PLM M i ′ (i < i ′ ), we have obtained adapted weights ∆ T j i of an old PLM M i for task T j .Recyclable tuning aims at transferring the knowledge of ∆ T j i to assist tuning M i ′ (i.e., learning new weights ∆ T j i ′ ).We denote the above process as ∆ i encapsulates abundant knowledge about the task T j , which should benefit learning ∆ T j i ′ if exploited properly.Such benefits may include improving training efficiency or performance.To gain insights of solving the task, we first conduct a series of empirical analyses in § 4 to understand the connections among M i , M i ′ , ∆ T j i , and ∆

Empirical Analysis
We first investigate the compatibility of outdated weights and the upgraded PLM ( § 4.1), then we explore the (1) parametric connections and (2) representational connections of continually pre-trained PLMs from two aspects: (1) linear mode connectivity ( § 4.2) and (2) functional similarity ( § 4.3).The implementation details are left in appendix D.2.

Model Compatibility Analysis
We explore to what extent the outdated weights are compatible with the upgraded PLM and how this compatibility changes during continual pretraining.Specifically, we directly apply outdated weights to the upgraded PLM and record the performance variation during continual pre-training.
Settings.We first investigate the process when upgrading M 0 to M 1 on the BIO domain (D 1 ).For downstream evaluation, we choose two classification tasks: CHEMPROT (Kringelum et al., 2016), which is a relevant downstream task to the BIO domain, and MNLI (Williams et al., 2018).Denote the model continually pre-trained on D 1 for t steps as M 1 (t), its pre-trained weights as θ 0 1 (t), and the adapted weights of M 0 for the downstream task as ∆ T 0 .We directly apply ∆ T 0 to the upgraded PLM M 1 (t), i.e., θ 0 1 (t) ⊕ ∆ T 0 , and evaluate the performance on the test set of the downstream task.In experiments, t is selected from 1.25k to 12.5k with an interval of 1.25k.We also report M 1 (t)'s zero-shot inference performance by testing θ 0 1 (t).Results.From the results in Figure 2 (a, b), we observe that for both adapter and fine-tuning: (1) with t increasing, the performance of θ 0 1 (t)⊕∆ T 0 drops quickly at first.This means that ∆ T 0 becomes outdated shortly after the backbone model M 1 (t) changes.(2) After sufficient pre-training steps, the performance converges to a plateau which is still much higher than the zero-shot inference performance of M 1 (t).This implies that continually 0 1 .2 5 2 .5 3 .7 5 5 6 .2 5 7 .5 8 .7 5 1 0 1 1 .2 5 1 2 .5 Step (k) pre-trained PLMs are intrinsically connected with their "ancestors", otherwise the ancestor's adapted weights ∆ T 0 would not improve the performance of its offspring M 1 (t).
Extension to Multiple Domains.Next, we extend the above experiments to 4 sequentially released PLMs as mentioned in § 3 by directly applying ∆ T 0 to {M 1 , • • • , M 4 }.We derive from Figure 2 (c, d) that: (1) applying outdated weights consistently performs better than zero-shot inference even if the backbone PLM is trained over multiple domains; (2) the performance of M 4 is the best among {M 1 , • • • , M 4 } though M 4 is trained for the longest time.This may be because the NS domain (D 4 ) is the most similar one to M 0 's pre-training data (Gururangan et al., 2020), and continual pre-training on a similar domain of the original PLM mitigates the incompatibility.

Linear Mode Connectivity Analysis
Backgrounds.Linear mode connectivity measures whether two sets of model weights can be connected via a linear parametric path, along which the performance (loss) of the downstream task remains high (low) (Frankle et al., 2020).In other words, it tests whether linear interpolations of two model weights perform comparably to both endpoints.If this property holds, then both model weights probably lie in the same loss basin, which indicates a close connection between them in the parameter space (Qin et al., 2022b).For more de- tailed backgrounds, please refer to appendix A.2.
Settings.Following most of the settings in § 4.1, we adapt both M 0 and M 1 (t) towards the task CHEMPROT and obtain the weights θ T 0 and θ T 1 (t), where θ T 0 = θ 0 0 ⊕ ∆ T 0 and θ T 1 (t) = θ 0 1 (t) ⊕ ∆ T 1 (t).Then we linearly interpolate both θ T 0 and θ T 1 (t) as: where µ ∈ (0, 1).In experiments, we evaluate the performance of 25 evenly distributed interpolations and two endpoints (i.e., µ = 0 and µ = 1).If there does not exist a significant performance drop along the linear path, we deem both endpoints linearly mode connected.We choose M 1 (t) that is continually pre-trained for {2.5, 5.0, 7.5, 10.0, 12.5}k steps and evaluate mode connectivity for each M 1 (t) and M 0 .In addition, we pre-train a new RoBERTa BASE (dubbed as M IND ) from scratch (details in appendix D.1) and test its connectivity with M 0 , i.e., θ(µ) In this way, we can compare the difference between continually pre-trained models (M 0 and M 1 (t)) and independently pre-trained models (M 0 and M IND ).
Results.We illustrate the performance of the interpolations and two endpoints in Figure 3, from which we conclude that: (1) for continually pretrained PLMs, although there exists a small performance drop in the midpoint, the interpolations generally achieve comparable performance to endpoints; (2) the connectivity does not vary much with t increasing, which means within a reasonable range, the connectivity is not sensitive to longer pre-training; (3) while for independently trained PLMs, the performance drops significantly in the middle, which means the adapted weights of these PLMs cannot be linked by a high-performance lin- ear path; (4) the above conclusions hold for both adapter and fine-tuning.
The above findings imply that when learning the same task, two continually pre-trained PLMs would probably be optimized into two minima lying in the same loss basin, or at least the optimal regions corresponding to both minima have a substantial intersection; otherwise, there should exist a significant performance drop in between.
Intuitively, the existence of a high-performance (low-loss) path between two optimal regions implies that model weights can be easily optimized from one optimal region to another without incurring a loss barrier.In this regard, it is promising to use outdated adapted weights as the initialization to find the optimal solution for the upgraded PLM, which would be explored in § 5.1.In this way, we explicitly facilitate cross-model knowledge transfer through the parameter space.
Extension to Multiple Domains.Next, we evaluate linear mode connectivity between the initial M 0 and M i (1 ≤ i ≤ 4) using the task CHEMPROT.We derive from the results in Figure 4 that although the performance tends to drop slightly near the midpoint, the connectivity of all continually pretrained models is still far better than independent PLMs (i.e., M IND in Figure 3).We also observe that the performance drop between M 0 and M 2 is larger than M 0 and M 4 , though M 4 is trained for a longer time than M 2 .This means longer pretraining does not necessarily result in poorer connectivity; rather, the pre-training domain has a great impact.

Functional Similarity Analysis
The close parametric connection revealed by linear mode connectivity does not guarantee that continually pre-trained PLMs share similar functionalities 0 1 8 0 4 10 0 9 8 1 1 8 1 4 10 1 9 8 2 1 8 2 4 10 2 9 8 Figure 5: Visualization of attention heads in fine-tuned M 0 , M 1 , and M 2 given the same input.For instance, "L4 H10" refers to the 10-th head in the 4-th layer.An attention head of M i+1 is trained from that of M i in the same column.In the heatmap, the color of the i-th element in the j-th row indicates the attention value from the j-th token to the i-th token.For more visualizations (including M IND ), please refer to appendix C.2.
when processing the text information.Following Gong et al. (2019), we explore functional similarity through the lens of attention distribution.Specifically, we investigate three continually pre-trained models (M 0 , M 1 , and M 2 ) and fine-tune them on CHEMPROT to obtain adapted models (θ T 0 , θ T 1 , and θ T 2 ).We feed the same input sampled from CHEMPROT to the three adapted models.Then we select attention heads from the same position (i.e., the h-th head in the l-th layer) in three models, and visualize their attention distribution.Note the selected head of M i+1 is trained from that of M i .
From Figure 5, it is found that the attention patterns of M 1 and M 2 are quite similar to those of their "ancestor" M 0 .Such representational proximity indicates that the corresponding modules of continually pre-trained PLMs own similar functionalities.Since adapted weights play a pivotal role in stimulating PLM's abilities and functionalities (Ding et al., 2022), such functional similarity partially explains why the outdated adapted weights can be directly applied to the upgraded PLM and achieve non-trivial performance in § 4.1.
In a nutshell, all the analyses in this section validate the close connection between continually pretrained PLMs.Intuitively, such a connection im- plies that the adaptation process of these PLMs towards downstream tasks should be closely related and transferable as well, which serves as the strong basis for our recyclable tuning.

Methods and Experiments
Based on the findings in § 4, we propose two ways to explore the practical benefits of recyclable tuning: initialization-based method ( § 5.1) and distillation-based method ( § 5.2).The training details of this section are discussed in appendix D.3.

Initialization-based Recyclable Tuning
We first investigate directly using outdated weights as the initialization for tuning the upgraded PLM.
Framework.Without loss of generality, we experiment when the initial PLM M 0 is continually pre-trained on the BIO domain (D 1 ) and upgraded to M 1 .Before the release of a new PLM M 1 , assume we have tuned M 0 on N tasks {T 0 , • • • , T N } and obtained the corresponding adapted weights When tuning M 1 on a target task T t , instead of using the random initialization for tunable weights, we initialize them using M 0 's adapted weights ∆ Ts 0 trained on a source task T s .Considering that in practice, it is possible that the outdated weights of exactly the same task are not available, i.e., T t ̸ = T s .Thus we explore whether initialization from the outdated weights of a different task would suffice for our goal.Specifically, we consider three types of source tasks: (1) T same , which is the same task as the target one; (2) T sim , which denotes a task similar to T t , both T sim and T t typically belong to the same task type; (3) T diff , which belongs to a different task category from T t .
We compare the proposed initialization strategies with random initialization and record (1) the test performance variation (w.r.t.training steps) during the early stage of downstream adaptation (Figure 6), and (2) the best test performance after the adaptation converges (Table 1).For adaptation, we mainly investigate adapter tuning and leave the experiments of fine-tuning in appendix C.3.
Results.The observations and corresponding conclusions are summarized as follows: (1) Faster convergence: we observe from Figure 6 that compared with the random initialization baseline, our method significantly accelerates the convergence of downstream adaptation.This suggests that the outdated weights provide a more effective initialization, allowing the PLM to be more easily optimized to the desired local optima.In practice, this method could improve the training efficiency of tuning the upgraded PLM, which saves the computations needed for adaptation.
(2) Improved task performance: we also conclude from Table 1 that after sufficient training, initialization from the outdated weights of each type of source tasks (even for T diff ) could improve the final performance (up to +1.9 average improvement).This demonstrates that initialization serves as a valid way for cross-model knowledge transfer.
(3) Similar source tasks benefit more: comparing the results of initialization from different source tasks, we find that the improvement in both convergence and performance can be generally ranked as T same > T sim > T diff .This is because the knowledge required by more similar tasks has a greater overlap.Thus the knowledge transfer benefits more when the target task and source task are more similar.In practice, this finding expands the selection scope of source adapted weights, broadening the application scenarios for our initialization-based method.

Distillation-based Recyclable Tuning
According to Lin et al. (2021), model outputs often contain sufficient supervision that is complementary to the knowledge stored in parameters.Therefore, besides the initialization-based method, we also explore knowledge distillation (Hinton et al., 2015) to recycle the outdated weights.
Framework.Given a task T j , assume we have optimized an outdated PLM M i and obtained its adapted weights ∆ T j i .Our goal is to distill the knowledge stored in ∆ T j i to optimize an updated PLM M i+1 .We follow Sun et al. (2019) to construct our framework.For each data point x from T j , denote P(x, θ T j i ) as the probability distribution the adapted M i assigns over the label space, where θ We minimize the KL divergence between probabilities predicted by M i and M i+1 .In addition, M i+1 mimics M i 's intermediate hidden representations of each layer.Specifically, given the same input x, denote h k (x, θ ) as the normalized hidden states of the k-th layer of M i and M i+1 , we minimize the mean-square loss of hidden states together with the KL divergence as follows: where α denotes a hyper-parameter.During optimization, only ∆
Setting (a): Setting (b): i+1 denotes distilling the knowledge of M i to M i+1 for task T i .Here AP / FT refers to adapter / fine-tuning.
we also introduce the original task loss L T j , which is calculated using supervised training examples from task T j , with another hyper-parameter β: (3) Settings.We consider the sequentially released PLMs {M 0 , • • • , M 4 } as mentioned in § 3. Following Gururangan et al. (2020), we choose three tasks T 1 : CHEMPROT, T 2 : IMDB (Maas et al., 2011) and T 3 : ACL-ARC (Jurgens et al., 2018), which are relevant to domain D 1 , D 2 and D 3 , respectively.We mainly consider recyclable tuning between adjacent PLMs, i.e., M i and M i+1 , and also evaluate non-adjacent PLMs (e.g., M i and M i+2 ) in appendix C.7.For each task T i (i ∈ {1, 2, 3}), we consider two settings: (a) First, we recycle M i 's outdated weights to M i+1 , which is denoted as Here the evaluated task T i is relevant to the pre-training domain D i of the original PLM M i .During continual pre-training on D i+1 , M i+1 suffers from catastrophic forgetting of D i .Hence M i+1 should perform worse on T i than M i .In experiments, both PLMs are adapted using the same 32-shot dataset.
(b) Second, we evaluate the recyclable tuning from M i−1 to M i , which is denoted as Different from setting (a), here the evaluated task T i is relevant to the pre-training domain D i of the newly released PLM M i .M i performs better than M i−1 since M i has acquired more knowledge related to T i when learning D i .In light of this, we explore whether M i could achieve better performance than M i−1 even when trained with fewer supervised examples.Specifically, the data size of M i−1 is set to {32, 256, 32}-shot for {T 1 , T 2 , T 3 }, and the data size of M i is set to {16, 32, 16}-shot, respectively.We also evaluate our method under the zero-shot setting in appendix C.4.
We compare our method with utilizing only the task loss (L final -L KD ) to validate the benefits of knowledge distillation.Further, we explore combining both distillation-based and initializationbased recyclable tuning (L final +Init.).This is implemented by first using the outdated weights as the initialization then tuning with L final .We also report teacher performance (Teacher) as a reference.
Results.It can be concluded from Table 2 that: (1) compared with optimizing only the task loss (L final -L KD ), distilling knowledge from the outdated weights (L final ) significantly improves the performance, which shows that knowledge distillation is an effective way for recyclable tuning.
(2) In general, L final +Init.leads to better performance than L final .This finding reveals that both distillation-based and initialization-based methods are complementary to each other and can be further combined to fully exploit the knowledge in outdated weights.(3) In Table 2 setting (a), M i+1 performs worse than M i on task T i , which is because M i+1 forgets some knowledge of domain D i when learning D i+1 .However, such forgetting can be mitigated by designing better continual pre-training algorithms (Qin et al., 2022c).(4) In Table 2 setting (b), M i outperforms M i−1 despite being trained with fewer examples.This shows that the newly acquired knowledge on domain D i conduces to M i 's performance in D i 's relevant task T i , and improves the data efficiency.We further discuss the difference between distillation-based and initialization-based methods in appendix F.

Discussion
Training-free Weight Recycling.Both methods proposed in § 5 necessitate tuning the upgraded PLM.Such a process often relies on abundant computational costs and may be infeasible practically.Given the close connections among continually pretrained PLMs, we contend that weight recycling can be realized without training.As a preliminary exploration, we show in appendix B that it is possible to learn a cross-task generalizable projection to directly upgrade the outdated weights and make them compatible with the new PLM.Upgrading outdated weights using such a projection requires far fewer computations (< 0.002‰) and still achieves satisfactory performance.

Downstream-compatible
Continual Pretraining.From another angle, recyclable tuning addresses the incompatibility between outdated adapted weights and the upgraded PLM from the customer perspective, analogous to the concept of forward compatibility in software engineering.In fact, the responsibility for maintaining compatibility can also be shifted to upstream suppliers during PLM upgrading (i.e., backward compatibility).Potential solutions include adding regularization terms during continual pre-training to maintain compatibility with existing adapted weights.In this way, we solve the incompatibility problem once and for all, which is more customer-friendly.However, modifying pre-training objectives may come at the cost of reduced model performance.
Broader Application Scenarios.Although we primarily focus on recyclable tuning for one specific scenario (i.e., continual pre-training), PLMs may be subject to various types of evolution in practice.For instance, the expansion of model size (e.g., from T5 BASE (Raffel et al., 2020) to T5 LARGE ), the upgrading of model architecture (Chen et al., 2022;Lee-Thorp et al., 2022), the alteration of optimization objective (e.g., from T5 to T0 (Sanh et al., 2021) and UNIFIEDQA (Khashabi et al., 2020)), etc.Once the backbone infrastructure is upgraded, massive adapted weights would become outdated and potentially wasted.Hence we believe recyclable tuning in fact has broader application scenarios and we hope our findings and solutions could inspire more future research in this area.

Conclusion
In this paper, we formulate the task of recyclable tuning for continual pre-training.We conduct empirical analyses for this task through the lens of model compatibility, linear mode connectivity, and functional similarity.Inspired by the corresponding findings, we explore the practical benefits of recyclable tuning through parameter initialization and knowledge distillation.We also envision our setup to serve as the testbed for other topics, e.g., crossmodel knowledge transfer and continual learning.2022ZD0116312), Institute Guo Qiang at Tsinghua University, Beijing Academy of Artificial Intelligence (BAAI).
Yujia Qin and Cheng Qian designed the methods.Yujia Qin wrote the paper.Cheng Qian conducted the experiments.Yankai Lin, Huadong Wang, Zhiyuan Liu, Maosong Sun, and Jie Zhou advised the project.All authors participated in the discussion.

Limitations
We only experiment with two kinds of PLMs (RoBERTa BASE and RoBERTa LARGE (appendix C.3 and appendix C.8)), leaving more diverse kinds of PLMs unexplored.While this allows us to demonstrate the effectiveness of our approach on these specific PLMs, it is important for future work to extend our problem setup to a wider range of PLMs in order to fully understand the generalizability of our findings.

Ethical Statement
In this research, we consider the following ethical issues: • Privacy.Outdated adapted weights may contain information about the data and tasks they were trained on.Thus it is important to consider the potential privacy implications when recycling these weights.Efforts should be taken to ensure that personal or sensitive information is not disclosed during weight recycling.• Fairness.It is crucial to guarantee that the recycling of adapted weights does not introduce biases or unfairly advantage certain tasks or domains.Thorough analysis and testing are needed to make sure that recyclable tuning does not perpetuate or amplify existing inequalities.

Appendices A Additional Backgrounds
A.1 Parameter-efficient Tuning Conventional downstream adaptation of PLMs involves optimizing all parameters (i.e., fine-tuning), which may cause a heavy burden on the computational infrastructure and storage space.To efficiently utilize the knowledge contained in PLMs, parameter-efficient tuning (PET) is proposed, which optimizes only a few parameters and freezes the majority of parameters (Houlsby et al., 2019).Despite extensively reducing the tunable parameters, PET achieves comparable performance to fine-tuning.Besides, due to its lightweight nature, adapted weights produced by PET are easier to train, store, and share among consumers.Thus we deem PET as an essential component in our problem setup.Without loss of generality, we consider a representative PET algorithm, i.e., adapter (Houlsby et al., 2019) in this paper.
Adapter inserts tunable modules into both the feedforward module and multi-head attention module of each Transformer (Vaswani et al., 2017) layer.

A.2 Mode Connectivity
Mode connectivity measures whether two minima in the parameter space can be connected by a parametric path, where the loss (performance) remains low (high) (Garipov et al., 2018;Freeman and Bruna, 2017;Draxler et al., 2018).Such a property implies that different minima can potentially form a connected manifold in the loss landscape.For two connected minima, we can interpolate them to obtain a series of high-performance solutions.These solutions can be ensembled to achieve performance (Garipov et al., 2018) that is better than the endpoints.
Prior works in mode connectivity show that under most cases, in neural networks, there exists a non-linear low-loss path between different minima.However, only occasionally a linear low-loss path could connect different minima.Later works further contend that it is non-trivial if both minima can be connected by a linear path (Frankle et al., 2020;Mirzadeh et al., 2020).The linearity indicates that both minima may probably lie in the same loss basin (Qin et al., 2022b), which is a more favorable property and indicates a closer connection between both minima.In view of this, we focus on analyzing the linear mode connectivity in this paper.
Previous efforts were mainly spent on investigating mode connectivity for non-pre-trained models, until recently, Qin et al. (2022b) explore such property for PLMs.They focus on tuning one static base model with different adaptation strategies.Differently, we take the first step to explore mode connectivity for different backbone models (continually pre-trained PLMs) and reveal novel insights.Following Qin et al. (2022b), we present the results of task performance (e.g., accuracy) to evaluate the mode connectivity in the main paper and also report the results of task loss in appendix E.

B Training-free Weight Recycling
Although we have shown that initialization-based recyclable tuning could accelerate the convergence and improve the training efficiency, tuning the upgraded PLM still requires abundant training computations.Especially considering the massive number of tasks to handle, conducting adaptation for all of them whenever the PLM is upgraded is computationally expensive.
In this section, we explore whether we could alleviate the burden of supervised training, and directly upgrade the outdated weights at a small cost.A desired algorithm should consume significantly lower computations than that of training the new PLM from scratch.Meanwhile, this algorithm should achieve satisfactory task performance.

B.1 Framework
Inspired by Qin et al. (2021); Yi et al. (2022), we propose a training-free weight recycling method.Specifically, we learn a cross-task generalizable projection that could directly produce upgraded adapted weights for a specific task, omitting the labor of supervised training.We contend that although there exist massive downstream tasks, a large percentage of them are intrinsically similar and can be categorized into the same task type (e.g., sentiment analysis, question answering, etc.).Intuitively, the upgrading of a certain task T 1 should provide a referential experience for that of a similar task T 2 .In view of this, we propose to make the upgrading process of T 1 recyclable so that the upgrading of T 2 can be achieved efficiently.
For two sequentially released PLMs M i and M i+1 , assume we have the adapted weights of M i for both T 1 and T 2 .We aim to recycle these adapted weights for tuning M i+1 on both tasks.As illustrated in Figure 7, our framework consists of two stages: (1) projection learning and (2) projection transferring.We learn an upgrading projection using task T 1 in the first stage, and then apply (transfer) the learned projection to task T 2 in the second stage.Note the first stage requires training while the second stage is training-free.Next, we introduce the details of the two stages.
Projection Learning.Instead of directly optimizing the parameters in ∆ T 1 i+1 , we learn a low-rank decomposition ∆ T 1 i+1 = Proj(∆ T 1 i ) as follows: where i is kept frozen and only the parameters in Proj are tuned.Note the dimensions of ∆ T 1 i and ∆ T 1 i+1 are the same, i.e., ) is then applied to the upgraded PLM M i+1 to compute the loss L.
Projection Transferring.When upgrading the outdated weights of a similar task T 2 , we directly apply the projection Proj * i→i+1 learned on T 1 to ∆ T 2 i and obtain the approximated updated weights We formulate the downstream tuning as prompt learning (Schick and Schütze, 2021), instead of introducing additional classification heads for different tasks.Hence the number of parameters in ∆ T 1 i and ∆ T 2 i is the same, i.e., |∆ Note that only applying the projection to compute the upgraded weights consumes very limited computations (see Figure 8), hence we significantly reduce the computations of learning ∆ T 2 i+1 , compared with the conventional tuning-based method.
Besides, since the projection Proj comprises an integral multiple (d×) of ∆ T 1 i 's parameters, our solution is only feasible for parameter-efficient tuning.While for fine-tuning, it is computationally intractable to train the projection due to the tremendous size of parameters in ∆ T 1 i and Proj.Being the first attempt in this research direction, we leave corresponding explorations for fine-tuning as future work.
We consider the zero-shot setting for the first stage (projection learning) and use the knowledge distillation loss function L KD .Here the teacher model weights are the adapted M 1 , and Table 3: Performance evaluation for our training-free upgrading algorithm.We distill the knowledge of outdated weights using both unlabeled task data (L T KD ) and Wikipedia (L wiki KD ).We also report the performance of directly tuning the upgraded PLM using the full dataset (FD) or 32-shot data (FS), and the demonstration learning baseline (Demo.).
the student model weights are obtained by applying Proj(∆ T 1 1 ) to the pre-trained weights of M 2 .For the unlabeled corpus used for distillation, we evaluate both the target task data (denoted as L T KD ) and Wikipedia corpora (L wiki KD ).Note for the former, we only use the input x and discard the corresponding label y (i.e., the zero-shot setting).The former can be seen as the upper bound for the latter since the data format of the latter may not be compatible with the target task.After that, we directly utilize the learned projection to upgrade the outdated weights of similar target tasks.
Baselines.We consider demonstration learning (Brown et al., 2020) as the baseline, which integrates a few labeled examples into the input text as additional context.The PLM directly performs inference on the test set without incurring any training.For reference, we also report the performance when M 2 is adapted using the full dataset (FD) and the 32-shot dataset (FS).Instead, our method requires no labeled data.
Efficiency Evaluation.We compare the computational costs needed for our training-free method and the conventional tuning-based method in Figure 8.For the former, we record the time needed in projection transferring (i.e., computing the upgraded weights ∆ T 2 * i+1 ).For the latter, we record the training time needed until an adaptation converges.It can be derived that our method requires significantly fewer computations, which demonstrates its efficiency.In practice, such a projection can be trained once and for all.As long as we have obtained the projection, we can directly upgrade  potentially massive outdated weights in an efficient manner, and the computations involved during projection learning can be neglected.Although currently we only support projection transferring for a similar target task that belongs to the same category of the source task, we expect future work to explore how to train a universal projection that could be applied to an arbitrary task.
Performance Evaluation.The results are shown in Table 3, from which we find that: (1) our method generally outperforms the demonstration baseline, and could surpass the supervised performance (FD and FS) under certain cases, despite not using any labeled data.Hence, besides being computationally efficient, our method achieves satisfactory performance in general.This also validates our intuition that for continually pre-trained PLMs, the upgrading of a specific task could provide referential experience for similar tasks; (2) using the task data (L T KD ) for distillation generally performs better than using Wikipedia (L wiki KD ), showing the importance of proper data distribution used for knowledge distillation.
smaller than that brought by continual pre-training (||θ 0 1 (t ′ ) − θ 0 1 (t)||).This is because downstream adaptation converges shortly.After convergence, the model parameters generally stay in a specific optimal region.While continual pre-training constantly pushes the model weights away from the previous checkpoints in the parameter space.Another reason is that continual pre-training uses a large batch 2048, while downstream adaptation often uses a much smaller batch size (e.g., 16).

C.2 More Visualization for Functional Similarity Analysis
In the main paper (Figure 5), we visualize three different attention heads of M 0 , M 1 , and M 2 .
In this section, we present more visualizations to further support our claim.We also visualize the attention pattern of an independently trained PLM M IND .The results in Figure 9 again demonstrate our claim that continually pre-trained PLMs exhibit similar attention patterns, which independently trained PLMs do not have.

C.3 Initialization-based Recyclable Tuning for
Fine-tuning and RoBERTa LARGE In § 5.1, we mainly evaluate initialization-based recyclable tuning using RoBERTa BASE and adapter tuning.Here we extend the experiments to either fine-tuning (Table 4) or RoBERTa LARGE (Table 5).
We choose 3 tasks in Table 1 and follow most of the settings.From Table 4 and Table 5, we find that the main conclusions are generally consistent with those mentioned in the main paper.This implies that the initialization-based method can be applied to different tuning methods and PLMs.

C.4 Distillation-based Recyclable Tuning under the Zero-shot Setting
We extend our distillation-based recyclable tuning to the zero-shot setting where there is no labeled data for tuning the upgraded PLM.We show that it is able to utilize unlabeled raw corpora to distill the knowledge of outdated weights.Specifically, we remove the task loss L T in L final and only retain L KD .Instead of using supervised examples, we sample unlabeled data x from Wikipedia to compute L KD .We evaluate recyclable tuning between M 1 and M 2 and choose 4 downstream tasks, i.e., CHEMPROT, IMDB, SST-2, and MNLI.For each task, the outdated weights of M 1 are obtained with the full dataset, and our goal is to distill their knowledge and optimize M 2 's weights.
Two training-free baselines are considered: (1) manual prompting (Schick and Schütze, 2021), which restructures the input into templates by inserting prompts, and (2) demonstration learning, which has been introduced in appendix B.2.For both baselines, the PLM directly performs inference on the test set without incurring any training.Moreover, we also evaluate the performance when knowledge distillation is combined with the initialization-based method.
We list the results in Table 6, from which it can be derived that: (1) our method surpasses manual prompting and demonstration learning by a large margin, which shows the benefits of recycling outdated adapted weights in the zero-shot setting; (2) initializing tunable weights with the outdated weights could further improve the performance of L KD , which again demonstrates that both

C.5 Interpolation Distillation
Traditional knowledge distillation frameworks have no assumptions about the parametric connection between the teacher and the student, and resort to pulling closer their predictions (P) or inner representations (h).As we have shown in the main paper, continually pre-trained PLMs are guaranteed with close parametric connections.Therefore, traditional knowledge distillation methods may fail to exploit the parametric knowledge contained in the teacher model's parameters.Here we explore another way for more effective distillation-based recyclable tuning under our setting.
Framework.Inspired by MC-SGD (Mirzadeh et al., 2020), we propose an interpolation distillation technique to fully exploit the parametric knowledge contained in outdated adapted weights.Specifically, for recyclable tuning between M i and M i+1 , instead of optimizing the overall loss function using the only endpoint checkpoint (θ ) for task T j , we linearly interpolate θ After that, we feed data into θ(µ) and minimize the corresponding using full data is much better, which shows the benefits of learning from a more advanced teacher.

C.8 Distillation-based Recyclable Tuning Experiments using RoBERTa LARGE
Previous experiments for distillation-based recyclable tuning are based on RoBERTa BASE , now we turn to RoBERTa LARGE to show that our proposed methods are model-agnostic.We experiment with M 1 and M 2 using the task CHEMPROT.Other settings are kept the same as those in appendix C.7.
In Table 10, we show that the results are generally aligned with our conclusions before.These results also reflect that our proposed method is agnostic to the specific PLM chosen.
C.9 Effects of Data Size for Distillation-based Recyclable Tuning Taking a step further, we study the performance of our distillation-based recyclable tuning at different data scales.Specifically, we focus on T 2 (IMDB) for recycling M 1 's outdated weights to M 2 , where M 1 is adapted using the full dataset, and M 2 is trained with {8, 16, 32, 64}-shot dataset, are set to 1 × 10 −4 and 2, respectively.We warm up the learning rate for the first 8% percentage of total training steps.We report the average results over 3 different random seeds.As for other hyper-parameters discussed in § 5.2, we perform grid search for β over {0.1, 0.3}, and α(1 − β) over {0, 1, 5, 10, 50, 100}.We also conduct a grid search for the temperature in knowledge distillation loss over {10, 20} when calculating KL(P(x, M i )||P(x, M i+1 )).We select the bestperforming combination of these hyper-parameters and then report the performance.Our grid search is performed for our method and all the baseline methods for a fair comparison.

E The Visualization of Loss for Linear Mode Connectivity Analysis
When conducting experiments for the mode connectivity analysis in the main paper, we mainly resort to performance as the evaluation protocol for the interpolations following Qin et al. (2022b).In this section, we show the corresponding visualization of loss for Figure 3 and Figure 4, see Figure 12 and Figure 13.From these figures, we conclude that a significant loss barrier generally indicates the existence of a large performance drop.

F Comparison of Initialization-based and Distillation-based Recyclable Tuning
Both initialization-based and distillation-based methods serve as powerful ways for recyclable tuning under the continual pre-training scenario.Both methods have their own advantages, where the initialization-based method can bring faster convergence and performance improvement, while the distillation-based method can bring improvement in performance as well (but may be less efficient).
In addition, both methods can be combined with each other to further improve performance.
In terms of practical application scenarios, both methods are slightly different.For one thing, the initialization-based method requires that the architectures of the new PLM and the old PLM are the same.This requirement may be infeasible for broader application scenarios, such as recyclable tuning between different PLMs as discussed in § 6.For another, the initialization-based method typically requires access to the parameters of the outdated adapted weights.This can be a practical issue due to model privacy concerns.While some customers are willing to share their adapted weights on public platforms like Adapter-Hub (Pfeiffer et al., 2020), a majority of adapted weights are publicly unavailable.In contrast, the distillation-based method can be achieved without access to the model weights, but through receiving model inference from the owner (e.g., API-based online knowledge transfer (Krishna et al., 2019)).In this sense, the distillation-based method could protect the model privacy to a certain degree.

Broader Impacts
This research has the potential to have a broad impact in several ways.• First, recyclable tuning could improve the efficiency of adapting PLMs to new tasks.By recycling adapted weights from previous tasks, the need for costly retraining can be reduced, potentially making it more feasible to apply PLMs in a wider range of scenarios.• Second, the results of this research could have implications for the sustainability of machine learning systems.Reusing adapted weights rather than discarding them can help us reduce the carbon footprint and resource consumption of PLM adaptation, making it more environmentally friendly.• Third, this research has the potential to benefit a wide range of stakeholders, including researchers, developers, and users of PLMs.Researchers can use the proposed task and benchmark to develop and evaluate new techniques for recyclable tuning, while developers can apply these techniques to improve the efficiency and sustainability of PLM-based systems.Finally, users of PLMs can benefit from the reduced costs and improved performance made possible by recyclable tuning.
Overall, this research on recyclable tuning for continual pre-training has the potential to have a wide-ranging impact on the efficiency, sustainability, and practicality of machine learning systems.

Figure 1 :
Figure 1: Task formulation.The original PLM M i is upgraded to M i+1 through continual pre-training on emerging data D i+1 .Our goal is to recycle the existing adapted weights ∆ i of M i for tuning M i+1 .

Figure 6 :
Figure 6: The performance variation in the early stage of adapter tuning for 6 target tasks from different initialization.Different tasks are evaluated with different intervals of training steps, see appendix D.3 for details.

Figure 7 :
Figure 7: Illustration of our training-free weight recycling algorithm.We learn a cross-task generalizable projection (i.e., the projection learning stage) that could directly upgrade outdated adapted weights (i.e., the projection transferring stage).

Figure 8 :
Figure 8: Time needed for upgrading outdated weights on different target tasks.Our proposed trainingfree weight recycling method (Proj) consumes < 0.002‰ computations than the conventional tuningbased method (Tune).

Figure 9 :
Figure 9: More attention visualization of attention heads in fine-tuned M 0 , M 1 , and M 2 given the same input.Apart from the above PLMs, we also present the attention pattern of an independently trained PLM M IND .

Table 1 :
The best test performance on 6 target tasks with adapter tuning from different initialization.

Table 2 :
Experiments of distillation-based recyclable tuning.For each task T i , we evaluate two settings

Table 5 :
The best test performance on 3 target tasks with adapter tuning from different initialization using RoBERTa LARGE .

Table 6 :
Zero-shot experiments for distillation-based recyclable tuning between M 1 and M 2 .The outdated weights of M 1 are trained using the full dataset.Both manual prompting (Prompt) and demonstration learning (Demo.)are compared as the baseline.

Table 7 :
Performance of interpolation distillation.We follow the settings in § 5.2.The results of L final -L KD , L final , and L final +Init.areborrowedfrom Table2.

Table 10 :
Experiments of distillation-based recyclable tuning for RoBERTa LARGE (M 1 , M 2 ).We follow the setting (a) in § 5.2.The teacher model is trained using either the 32-shot data or the full data.The student model is trained using the 32-shot data.