ColD Fusion: Collaborative Descent for Distributed Multitask Finetuning

Pretraining has been shown to scale well with compute, data size and data diversity. Multitask learning trains on a mixture of supervised datasets and produces improved performance compared to self-supervised pretraining.Until now, massively multitask learning required simultaneous access to all datasets in the mixture and heavy compute resources that are only available to well-resourced teams. In this paper, we propose ColD Fusion, a method that provides the benefits of multitask learning but leverages distributed computation and requires limited communication and no sharing of data. Consequentially, ColD Fusion can create a synergistic loop, where finetuned models can be recycled to continually improve the pretrained model they are based on.We show that ColD Fusion yields comparable benefits to multitask training by producing a model that (a) attains strong performance on all of the datasets it was multitask trained on and (b) is a better starting point for finetuning on unseen datasets. We find ColD Fusion outperforms RoBERTa and even previous multitask models. Specifically, when training and testing on 35 diverse datasets, ColD Fusion-based model outperforms RoBERTa by 2.19 points on average without any changes to the architecture.

In this paper, we propose ColD Fusion, a method that provides the benefits of multitask learning but leverages distributed computation and requires limited communication and no sharing of data. Consequentially, ColD Fusion can create a synergistic loop, where finetuned models can be recycled to continually improve the pretrained model they are based on. We show that ColD Fusion yields comparable benefits to multitask training by producing a model that (a) attains strong performance on all of the datasets it was multitask trained on and (b) is a better starting point for finetuning on unseen datasets. We find ColD Fusion outperforms RoBERTa and even previous multitask models. Specifically, when training and testing on 35 diverse datasets, ColD Fusionbased model outperforms RoBERTa by 2.45 points on average without any changes to the architecture.

Introduction
Faced with a task and some data, improved performance can often be attained by finetuning a pretrained model, i.e., further training the pretrained model on the task data. Consequently, improving a pretrained model has the potential to improve every model finetuned on it. However, pretraining is often computationally expensive, so practitioners rarely seek to pretrain new and improved models. In contrast, finetuning is typically dramatically cheaper, and a given pretrained model may therefore be finetuned many times (for example, there are thousands of finetuned BERT variants on the Hugging Face Hub. 1 ) Motivated by this, we study whether finetuned models can be "recycled" to create better pretrained models (c.f., Raffel, 2021).
In multitask learning, a single model is trained over multiple datasets at once, to fulfill one of two goals: (a) to single-handedly perform the tasks, that would otherwise require multiple dedicated models.
(b) to provide a better starting point than the pretrained. Given the availability of many finetuned models, our aim is to obtain the benefits of multitask learning by mixing multiple models rather than multiple datasets.
Towards this end, we focus on the setting of collaborative multitask learning, which involves performing multitask learning in a constrained environment: We assume that multiple contributors each finetunes a model on a single dataset. The contributors do not share their datasets with one another or change the way they finetune, but they do agree to share their produced models. This setting fits preexisting finetuning pipelines typically used by practitioners.
With a method for collaborative multitask learning, we can recycle finetuned models to improve pretrained models. ColD multitask fits the common finetuning scenario, where each contributor finetunes for their own benefit. However, by requiring only the finetuned model to be shared, the finetuning step can be recast as a training step for the collective's benefit. In doing so, our method allows reusing compute and data consumed by practitioners and researchers. We call this method Collaborative Descent, or ColD for short.
Our approach of combining finetuned models not only produces a better pretrained model but also allows the model to keep evolving. Instead of pretraining or multitasking on a predefined amount of data, we suggest constantly accumulating finetuned models to improve the model continuously. Our method is hence limited only by the amount of finetuned models that are shared by the entire community. We discuss limitations in ( §8).
At a high level, ColD Fusion works by fusing multiple finetuned models iteratively ( §2). In each iteration, contributors finetune the most up-to-date model (which is presumably also the most performant) on their tasks. Then, those models are fused together (Choshen et al., 2022b) by averaging their parameters to create the base model for the next iteration.
We show that this method produces a model that performs well on the finetuning tasks, despite never manipulating more than one task at a time neither by constituent models nor their fusing ( §4). Moreover, we show that ColD Fusion increases performance of the base model substantially, outperforming the pretrained by 2.45% on average on 35 datasets. Through additional analysis, we fur-ther show that improvements are similar on tasks seen and unseen in training ( §4.2) and that accumulating data is beneficial ( §5.1, §5.2).

Goals of Multitask Learning
Multitask learning is typically used towards one of two goals: Either to produce a single model that performs well on many tasks, or to produce a base model that performs well on new tasks after adaptation (e.g., finetuning).
Single model. To produce a single multitask model, a set of datasets D is used to train a base model M ∈ R w to find parameters θ ∈ R w that minimize the loss over each dataset in D. This reflects the traditional objective of multitask learning -to produce a set of weights that performs well on multiple tasks (Caruana, 1997).
Base model. An alternative goal of multitask learning (and the primary goal in our work) is to produce a base model that attains strong performance after adaptation. Multitask learning does not directly optimize towards this goal, but has been found to do so indirectly (Aghajanyan et al., 2021a;Liu et al., 2022). In this setting, the outof-the-box performance of the produced model on seen tasks is less important than the performance after finetuning over it, i.e., initializing with the found weights θ ∈ R w and then finetuning on a desired dataset d . In §4.2, we empirically show that our method works well both when d ∈ D or d / ∈ D, i.e., whether d was used during the multitask training or not.
We note that our formulation gave no restrictions on the datasets group D. Thus, a common scenario might be that some datasets do not have the same label space, number of examples, etc. It is also possible that some datasets are complementary samples from the same distribution. In this case, our approach is similar to training on it distributively, without communicating every batch as in federated learning (Yang et al., 2019). We demonstrate that our approach also works well in this setting in §5.1.

Collaborative Constraints
In this work, we target the goals for multitask learning discussed above, but focus on a specific setting with additional constraints. In our setting, multiple contributors have access to datasets that they do not share. A central Repository can only perform minimal computation (i.e., does not perform any training). Communication happens between the contributors and the Repository and only occurs when a given contributor finishes finetuning on their data. Our goal in setting these constraints is to restrict our focus to collaborative and distributive multitask learning.

ColD Fusion
Our proposed method, called ColD (Collaborative Descent) Fusion, is an iterative process that aims to perform multitask learning in the constrained setting listed above. Specifically, ColD Fusion involves an iterative process where each contributor finetunes the current model on their dataset, communicates the resulting model back to the Repository, and the Repository fuses (Choshen et al., 2022b) all of the contributor's models and sets it as the current model.
At first, the Repository initializes the shared model parameters θ 0 using a preexisting pretrained model. Then, at each iteration i ∈ {0, 1, 2, . . .}, each contributor c ∈ C receives a dataset d ∈ D and finetunes θ i over it to produce parameters θ c i . For the purposes of our study, finetuning is any optimization process that aims to minimize the loss over the dataset d. Typically, finetuning involves minimizing the loss using a variant of gradient descent. After finetuning, each contributor sends their model's parameters θ c i to the Repository. Next, the Repository fuses the contributor's models by averaging all of the contributor's model's parameters to produce a new shared model as θ i+1 = 1 |C| c θ c i . Finally, the process repeats for iteration i + 1.

Experimental Setup
In this section, we detail the datasets, models, baselines, general experiment setup, and specific experiments settings.

Datasets
In all of our experiments, we define the dataset group D to be a group of 36 English-language datasets, including most GLUE and Super-GLUE datasets, in addition to other datasets including additional NLI, sentiment and topic classification datasets as well as datasets based on Twitter data. A full list of datasets we use is provided in App. A.
For efficiency reasons, we randomly sampled 5 datasets to act as a consistent test set. We state when using the test set instead of the full datasets group.

Models and Baselines
For experiments in the main text, we use RoBERTabase (Liu et al., 2019) as our initial model θ 0 . To demonstrate the generality of our approach, we additionally replicate some results on T5 (Raffel et al., 2020, see App. §D).
For baseline pre-trained models, we consider RoBERTa-base as well as RoBERTa-base multitask finetuned on all datasets (except STSB, that being a regression task incurred technical difficulties). The multitask variant trains a dedicated classification head for each dataset. In addition, we consider the MUPPET (Aghajanyan et al., 2021a) model, a highly optimized multitask model trained on more datasets than we consider. MUPPET is the current state-of-the-art base pretrained model that uses the RoBERTa-base architecture (Choshen et al., 2022a).

Finetuning Process
Finetuning is used in this paper for two reasons: (a) As a way to infer and evaluate the performance of a base model and (b) as a part of the ColD Fusion scheme. We follow the exact same finetuning procedure in either case. Finetuning hyperparameters and time and memory estimates are provided in App. B

ColD Fusion Procedure
The general course of the experiments is as follows: On each iteration, several datasets are sampled and the latest base model is finetuned separately on each dataset. Then the resulting finetuned models are fused to create the next base model. This new model is evaluated on the test datasets at each iteration. When we mention ColD Fusion without specifying the iteration explicitly, we refer to the model that corresponds to the final iteration.
The evaluation reflects both multitask goals ( §2.2): (a) To evaluate the single model goal, we train only the classification head (equivalent to Linear Probing; Alain and Bengio, 2016), freezing the rest of the layers. We refer to it as ColD Frozen. (b) For evaluating the base model goal, we take the ColD model and use it as initialization for finetuning. We finetune separately on each dataset and report the results on the corresponding test. We refer to it as ColD.

ColD Multitask Results
In this section, we show ColD Fusion can produce multitask models. We show in §4.1 that ColD Fusion fulfils both multitask objectives defined in §2. We find that base model improvements are even more apparent in few shot settings ( §4.3). Finally, we verify that improvements replicate on datasets that were not seen during training ( §4.2).

Collaborative Multitask
We show that ColD Fusion achieves the two multitask objectives (see Fig. 2). We train and test ColD Fusion for 30 iterations using the entire datasets group except STSB to provide a fair comparison to the multitask baseline. We simulate 8 contributors by sampling 8 datasets at each iteration and repeat the whole experiment using 5 different random seeds. We consider the importance of the sampling hyperparameter in §4.4.
We find that ColD Fusion creates a superior base model (see Fig. 2b). The average result after finetuning the ColD Fusion model is superior to the RoBERTa pretrained model by up to 2.45 points on average over the 35 datasets (see App. §C for full results). The result can be deemed significant with a difference of over 20 standard errors of the mean between the original pretrained model and the model produced by ColD Fusion.
In comparison, the standard multitask model outperforms the original RoBERTa pretrained model by only 1.31 points. We also consider the highly optimized MUPPET model, trained on more datasets and without the ColD multitask restrictions. MUP-PET indeed outperforms our standard multitask baseline model, but is outperformed by our ColD Fusion model.
Another important comparison is the consistency of the improvement. We find (see App. C) that the model produced by ColD Fusion is better than the pretrained model on 75% of the datasets and attains only 1.73% worse performance on the worst-case dataset. In contrast, MUPPET hurts as many models as it helps and is 40% worse on some datasets.
ColD Fusion also achieves the single model goal: The ColD model has high inference scores on the datasets seen in training (see Fig. 2a), higher in fact than those of the standard multitask baseline. Moreover, it is not far from the pretrained model when finetuned on each task separately. This implies that despite learning in a distributed way and fusing by averaging the non-linear weights of the model, the process incorporates the data well.

Unseen Datasets
We have found ColD Fusion to create a strong base model ( §4). Next, to meet the requirement of improving results for new datasets, we test the ColD fused model on unseen datasets not included in the training (see Fig. 4). We achieve this by performing 3-fold cross-validation. The folds are set arbitrarily such that each fold contains 24 seen datasets (24 contributors) and 12 unseen ones that we keep for evaluation only. This ensures that each dataset has the same weight in the average score of the seen datasets and unseen datasets.
We find that the model performs on unseen datasets just as well as it does on seen ones. The strikingly similar performance between seen and unseen tasks (which is similar to in-domain vs. outof-domain) should raise a red flag in most scenarios. However, in the unique scenario of ColD multitasking, it meets our expectations. Both seen and unseen datasets are exposed at some point -during ColD Fusion iterations or during finetuning. The only difference is that seen datasets are used also to finetune contributed models. Hence, in the seen case, the model trains twice on the same data, first during base model creation and again when finetuning. It is less of a surprise that training twice on the same data doesn't improve results. The improvement over the original pretrained is likely due to positive transfer across datasets.
Where finetuning is restricted to only the classification head (ColD, Frozen in Fig. 4), the model achieves much better performance on the seen datasets than on the unseen datasets. These results are also in line with the fact that the model (apart from the classification head) was never exposed to the unseen datasets, while the entire model's weights were trained on the seen datasets. We further test ColD Fusion's capacity to scale with more data in §5.2. We note that the unseen curve consistently increases, which may suggest that the model has acquired general skills.
Note that the scores in Fig. 4 are a bit lower than in the main experiment in Fig. 2b. This is most likely due to scaling, as here we keep unseen datasets aside and use fewer datasets for training. We show in a controlled experiment in §5.2 that  using more datasets improves results.

Few-shot
In order to assess the benefit of ColD Fusion on few-shot scenarios, we repeat the setting in §4.2, but finetune only on 100 examples from each unseen dataset during evaluation. Fig. 3 shows a great increase in performance over the RoBERTa pretrained model, reaching an improvement of 6.73 points after 20 iterations. This provides an even stronger case for ColD Fusion in the few-shot setting. Figure 4: Fine-tuned and frozen results for ColD Fusion on datasets that were used for training ("Seen", in blue) vs. datasets that were not ("Unseen", in orange). The model produced by ColD Fusion is a good base model for both seen and unseen datasets. While using a frozen model is better for seen datasets, unseen datasets still benefit the ColD Fusion process.

Number of fused datasets
An important factor in ColD Fusion is the number of contributors in each iteration. Having fewer contributors implies effectively training on fewer datasets in each iteration; on the other hand, fusing fewer models may give more importance to each.
We find that the performance as a base model is hardly affected by the number of contributors in each iteration. More specifically, we observe in Fig. 5 that adding contributors makes the process more stable but has little effect on convergence speed. A possible reason is that some of the im- provement comes from the iterations themselves and the ability to correct overfitting done in previous steps by some contributors. We further test the effect of the number of contributors under controlled settings in §5.2.

Single Dataset Analysis
We analyze a unique case of ColD Fusion, where we run the distributed learning over a single dataset. For the single dataset, we chose MNLI  for its size (433K examples). We start by introducing the concept ( §5.1). Then, we use this setting to control interfering factors and examine the algorithm characteristics ( §5.2).

Federated Learning
A special case of ColD multitask is training on data from the same distribution. This case resembles the Federated Learning scenario (Yang et al., 2019), where multiple contributors collaborate to train a model without having to exchange the actual data.
The Federated Learning experiments are run as follows: at each iteration 5 contributors sample 5k examples from MNLI dataset and another such subsample is used for evaluation. This setting simulates the never-ending data flow that often characterizes Federated learning.
As presented in Fig. 6, performance increases throughout the iterations. Thus, we conclude that Note the superiority of ColD-Frozen over ColD in this experiment. A possible explanation is overfitting. In evaluation, finetuning all the parameters on only part of the data is worse than keeping the fused weights that are trained on several splits.

Controlled Experiments
We consider the interacting effects of the core characteristics of ColD Fusion: the number of contributors, the amount of data each contributor has, and the overall amount of data.
To prevent the interference of the sampled datasets' properties (e.g., size, distribution, task, etc.), we experiment in the Federated Learning setting. Moreover, we use disjoint and consistent datasets for all iterations, i.e., we do not sample datasets. We analyze the ColD Frozen performance, as this better reveals the ability of the model to aggregate capabilities of the constituent models during fusion.
Effect of dataset size. We fix the number of contributors to 10 and test how the number of examples each contributor is training on affects results. We experiment with 1.25K, 2.5K, 5K and 10K examples. A priori we would have expected large amounts of data in each contributor's model to obstruct the fusing process, as each model changes Results keep increasing with data size, convergence to centralized finetuning takes more iterations with fewer data per contributed model. more. In Fig. 7, we see the opposite -more data does not hurt results and might even lead to improved stability.
The low data regime, on the other hand, does not reach its full training baseline performance. Presumably, this is an overfitting artifact in evaluation, similar to §5.1 -i.e. the dataset split is too small to fit even the classification head, while the combined data shown to the baseline is enough.
Effect of the number of contributors. We show each contributor 5K examples, and see hoe results change with 2, 5, 10 and 20 contributors. We see in Fig. 8 that increasing the number of contributors increases performance. Moreover, the results are not only better at every step, but also keep on increasing for longer. This is a positive result in terms of the expected end result, but also means that convergence is slower.
To isolate the effect of the contributors from that of the overall data, we fix the overall amount of data to 50K and split it among the contributors evenly. Fig 9 replicates the above with more contributors. In this case, reaching convergence takes longer -approximately 2 more iterations for double the contributors and halve the data seen by each.
We conclude that additional data aids performance and additional contributors hardly change results, but delay convergence.

Related Work
Our work strongly relies on model fusion. Model fusion was first introduced as a way to improve pretrain models by (Choshen et al., 2022b). In parallel, several works such (Matena and Raffel, 2021;Wortsman et al., 2022b) suggested different ways of fusing for other purposes such as improved finetuning.
Low-communication distributed training was proposed in similar settings to ours. Wortsman et al. (2022a) proposed distributed finetuning and model fusing in order to produce better finetuned models. This suggestion is equal to one iteration of ColD Fusion where all models share the same dataset. Li et al. (2022) also share the similarity of distributed training, but during pretraining on unlabeled data.
Understanding why averaging different models improve quality may be related to theoretical works discussing weight and loss spaces. These works state there is a path of minimum loss between models (Garipov et al.  2020) claimed that under some constraints, this path is linear, which suggests that fusing the weights could produce a model that retains the capabilities of the fused models. Although different models on the same task may converge to different locations in the loss space without linear connectivity (Juneja et al., Figure 9: Results when using fixed data on each iteration using ColD Fusion with a varying number of contributors (dotted lines), training on a total of 50k examples from MNLI. The pretrained model performance is highlighted by dashed lines. Convergence times are longer as we split the data more. 2022), and although the case of multitask is more complex (Mirzadeh et al., 2020), we still believe that these works can partially explain why fusing preserves the capabilities gained by the constituent and that iterations fix it when it does not.
The literature also includes methods for better aligning models during training (Javaloy and Valera, 2021; Yu et al., 2020;Chen et al., 2018) or after it (Ainsworth et al., 2022;Jordan et al., 2022) to aid in fusing. We did not use those as we wanted to reduce the load on the repository and avoid restricting the contributors' finetuning. However, these methods may improve results in ColD Fusion-like scenarios when applicable.
We mention that multitask learning does not optimize the base model objective directly ( §2.2). Some works aim to do so (Bansal et al., 2019) through meta learning, finding models that can learn a new task well or efficiently (Hospedales et al., 2021). REPTILE (Nichol et al., 2018) meta learns in a way that resembles ours by iteratively using models trained for several batches.

Conclusion and Discussion
This paper suggested a scheme that can utilize finetuning to improve a pretrained model. The method doesn't assume datasets are shared, but rather assumes that each contributor solely finetuned on their own dataset. Therefore, we believe that applying this scheme as a collaborative pretraining platform is realistic and that doing so would continually improve base models.
To scale this approach, it would be beneficial if the repository was updated asynchronously. Furthermore, if the centralized repository could be removed from the process, making the process be decentralized. In the usual finetuning setting, one might achieve robustness by tuning batch size and learning rate. Taking this metaphor to the ColD Fusion, one can either increase the number of contributors (batch) or perhaps restrict the effect of each iteration (learning rate) (Smith and Le, 2018). Following this line, future work may consider regularizing the distance from the pretrained model (learning rate) when a small number of contributors exist (batch) or consider individual weights to each contributor.
There are many parameters to optimize which might improve the method substantially. For example, fusing the contributions with a weighted average, improving fusing itself (Matena and Raffel, 2021;Ainsworth et al., 2022), controlling the datasets seen in each iterations (related to;Choshen et al., 2021;Hacohen and Weinshall, 2019) and backtracking when a harmful update was done to the model. We would be excited to see work exploring these methods for improving ColD Fusion.

Limitations
Perhaps the most important limitation regarding ColD Fusion is its deployment. This paper presents a method for multitasking, not a platform. In that sense it solves both multitask learning goals under the constraints resulting from collaboration. However, using ColD Fusion in practice might require much more effort -It would require a place to host the models, a way to make sure no malicious or erroneous model was sent, and other aspects of a platform to support this training. This is the first method to tackle Collaborative multitaking and we scaled it to 35 datasets. However, future methods may be found more efficient or scale better with the amount of data and computation.
ColD Fusion with many iterations and models might require more computational effort per a given amount of data ( §5.2) and is hence less efficient than regular multitask learning. As a result, while our bottom line performance is encouraging, ColD Fusion might not be the preferred way under every possible scenario. Still, some of the costs may be alleviated by future work -for example the additional iterations when fusing many models, might be reduced by permuting before fusing (Ainsworth et al., 2022).
While this paper studied the impact of various ColD Fusion parameters, it is unclear how finetuning or even pretraining parameters affect results. However, we do have a reason to believe the method is relatively robust to these refactors through our initial results and the replication on another architecture (App. §D).

C Datasets Accuracy
The full results of the main experiment ( §4) can be found in Table 1. It contains accuracy score for each dataset separately.
For ease of comparison we also supply two figures (Fig.10), comparing MUPPET and COLD multitask models to the pretrained. They show that ColD is much more consistent. It has less datasets that lose from changing from pretrained to ColD and smaller negative effects when there are such datasets. MUPPET however also has larger maximal gain when it does show gains, which shines favourably on the average. This makes ColD a better choice for an off-the-shelf model, but gives MUPPET an advantage when one tests a target dataset on several pretrained domains.

D T5
We present initial results to confirm our method is not unique to RoBERTa. Specifically, we train T5 (Raffel et al., 2020) with default hyperparameters, but 256 batch size and 0.0004 learning rate. We replicate the main experiment ( §4) in a smaller scale, running on seed only and 5 iterations only. For ColD-Frozen, we train only the language model head. Fig. 11 shows the main effect reminds. Both ColD and ColD-Frozen keep increasing with the iterations.

E Multitask Scale
We test the effect of the amount of datasets we use for multitasking on the performance of the resulted model as a base model. We take a random permutation of all the 36 datasets. We ColD fuse on the first 4 datasets, then the first 8, 16, and finally all the datasets. In fig. 12 we see that the 8 datsets performs worse than the 4 datasets, and that the high regime (16 and 36 datasets) performs much better than the low regime (4 and 8 datasets). These results align with (Aghajanyan et al., 2021b) observation that under 15 datasets more datasets decrease the performance, but past some critical point more datasets increase performance.

F Fix Number of Examples
We depict the ColD Fusion process with multiple tasks (Fig. 13), but only 4K examples per each contributor. This simulates a case where contributors keep streaming new information of different kinds. While this can not fully predict the effect of streaming new tasks, it shows initial positive results in this regard.    We can see that although the absolute results are degraded related to the regular configuration, the performance is increasing monotonically both for the CoLD and CoLD Freeze. meaning more data yields better performance.