Where to start? Analyzing the potential value of intermediate models

Previous studies observed that finetuned models may be better base models than the vanilla pretrained model. Such a model, finetuned on some source dataset, may provide a better starting point for a new finetuning process on a desired target dataset. Here, we perform a systematic analysis of this intertraining scheme, over a wide range of English classification tasks. Surprisingly, our analysis suggests that the potential intertraining gain can be analyzed independently for the target dataset under consideration, and for a base model being considered as a starting point. This is in contrast to current perception that the alignment between the target dataset and the source dataset used to generate the base model is a major factor in determining intertraining success. We analyze different aspects that contribute to each. Furthermore, we leverage our analysis to propose a practical and efficient approach to determine if and how to select a base model in real-world settings. Last, we release an updating ranking of best models in the HuggingFace hub per architecture https://ibm.github.io/model-recycling/.


Introduction
Finetuning pretrained models (Devlin et al., 2019), is currently the standard and best approach for adjusting such models to perform a downstream task (Chen et al., 2022).The resulting finetuned models are typically used for inferring the labels of new examples that are reminiscent of the data used for finetuning.However, it was shown (Phang et al., 2018a) that finetuned models, trained on some source dataset, may represent better base models, namely a better starting point for a new finetuning process on a desired target dataset.This scheme, often referred to as intertraining, is the focus of the present work.
Given a target dataset, one may wonder what could be the intertraining gain, to determine whether it is worthwhile spending resources on selecting a base model.Assuming the potential gain is high, the following natural question is which base models are most promising, out of countless options available through hubs HuggingFace (e.g.Wolf et al., 2020).We propose pragmatic methods to answer both questions, supported by extensive experiments.
We begin with two observations: (i) some target datasets are intertraining-sensitive, i.e., have the potential to gain significantly from intertraining, while others are not, and are typically indifferent to the base model selection.Furthermore, revealing this property of the target dataset can be done efficiently, by examining the gains obtained when using a single representative base model as a starting point; (ii) some base models are of high quality, i.e. finetuning on them provides consistent improvements on target datasets, but most base models are inferior and degrade performance.Furthermore, ranking base models by quality can be done on one target task -and efficiently, via linear probing, namely training only the base model classification head, over a single representative dataset.Thus, we argue that a preferable base model can be selected independently of the target dataset.This is in contrast to the common perception (c.f.§7) that the alignment of the target dataset and the source dataset -used to generate the base modelis a major factor in determining intertraining success.We substantiate our observation of independence by conducting experiments on a comprehensive set of target datasets and base models, comprising models obtained under controlled conditions as well as models from HuggingFace.In addition to these findings, we analyze attributes of the source and target datasets that affect gains ( §6).
As some models are just better, not due to the choice of a current dataset, it makes sense to rank the models once and pick the best ones.But even ranking a thousand models is costly.In §8, we rely on our analysis to propose a practical approach to efficiently select models in a real-world setting.Moreover, instead of expecting others to rank the models, we share an updating site site featuring the best models currently found.So far, we tested over 2.5K models.

Preliminaries
In this paper, we use the following terminology.A dataset is a set of examples and labels.Our goal is to maximize accuracy on the test of the target dataset, "target" for short.We discuss the difference between domain, task, and dataset in App. A.
A pretrained (PT) model is a self-supervised model, e.g., RoBERTa (Liu et al., 2019).A finetuned model is a PT model that was further trained over some source dataset, denoted henceforth as "source".We assume access to many such models, e.g., through HuggingFace.A base model can be either a PT model or a finetuned model.When finetuning over the target train data, one can start from any base model.Intertraining refers to starting from a finetuned model as a base model, and in this case, we refer to this base model as an intermediate model.We denote by S t m , the accuracy score obtained over the target test set, t, after finetuning some base model m over the target train set.The intertraining gain of model m w.r.t.using the PT model, is thus defined via gain (m, t) = s t m − s t P T .Note that the gain may be negative.Given a set of intermediate models, M = m 1 . . .m n , the intertraining max-gain is defined as max m∈M s t m − s t P T .Thus, theoretically, max-gain is achieved by finetuning all the available intermediate models and picking the one best performing on the target test set.To avoid overfitting and reduce costs, our goal is to find an intermediate model with a gain that is as close as possible to the max-gain, without explicitly finetuning all the intermediate models.

Experimental Setup
Our experimental setup is described next.The parameters for reproducibility are detailed in App.B.

Dataset Groups
Our experiments are based on 3 groups of English text classification datasets described next (App.C).
We focus on text classification for ease of evaluation, but assume the tasks are diverse enough for our conclusions to extend to other settings.
General.Containing GLUE (Wang et al., 2018) and SuperGLUE classification datasets (Wang et al., 2019a), excluding test only and regression datasets.The datasets cover a wide range of tasks, from sentiment analysis through linguistic acceptability to natural language inference.It is the most commonly used benchmark in related work ( §7).
NLI. Containing Natural Language Inference and Entailment tasks.Datasets of this group all share the same task.There is some overlap between NLI and General; in Fig. 1 and mean comparison we subtract the overlapping datasets from General.
Twitter.Containing 6 Twitter datasets collected by TweetEval (Barbieri et al., 2020).The tasks range from irony detection to emoji prediction.Datasets of this group all share the same domain.

Models
Unless stated otherwise, our PT model of choice is RoBERTA (Liu et al., 2019).We acquire intermediate models in two ways: In-house.Obtained by finetuning the PT model over General, NLI, and Twitter dataset groups as the source datasets, with 5 seeds.Since we control the datasets and have knowledge about their features, this enables us to find relations between the dataset properties and the intermediate models generated from these datasets.
Off-the-shelf. 66 RoBERTa models downloaded from HuggingFace (see App. §E for more details).Since these models do not carry information excluding their names, this set allows us to validate our claims on a "real-world" model distribution.

Models/Targets experiments
We test many intermediate models on various target datasets.We finetune each intermediate model and the PT on the target train, and report the intertraining gain over the target test.In the In-house models/targets experiment, all 22 datasets from the General, NLI, and Twitter groups act as both source and target and gains are average of 5 seeds.In the Off-the-shelf models/targets experiment, we download the 66 source models from Huggingface and test on the 14 General datasets as targets.

Results
Most models are worse than PT and about 1 in 6 are better, providing positive intertraining gain.The in-house models/targets results are depicted in Fig. 1 and STDs and reference results in App.§D.App.§E reports results with off-the-shelf RoBERTa and T5 intermediate models.
The rows and columns in Fig. 1 are similarly ordered -first the General datasets, then the NLI datasets, and last the Twitter datasets.Loosely speaking, we do not recognize an approximate green block structure across the diagonal; namely, we do not observe clear intertraining gain for similar tasks (NLI); nor for similar domains (Twitter).However, some columns and some rows depict higher intertraining gains, while for others, the impact is minor.Taken together, these observations suggest little dependence between the source used to generate the intermediate model and the perfor- mance over the target.This is in contrast to the common assumption ( §7) that the source and target need to be similar for intertraining to work.Next, we delve deeper into these observations.

Target Sensitivity to Intertraining
Considering Fig. 1 columns, we notice that for some target datasets (e.g., ESNLI) intertraining makes little difference, while for others (e.g., COPA) the impact is quite significant.We argue that this target property can be predicted via an efficient and straightforward method.Specifically, the gains of one strong intermediate model should resemble the max-gains of a group of models.Indeed, MNLI highly correlates both with the max-gain of in-house models tested on the 22 targets in Fig. 1 (Spearman: 0.89, Pearson 0.99) and off-the-shelf models tested on the 14 General targets (Spearman: 0.90, Pearson: 0.94, p < 0.01 for all).The replication on off-the-shelf models shows that this is a general result and not a reflection of MNLI being top of the in-house group.Overall, we find sensitivity is a characteristic of the target dataset separated from the source factor.

Ranking Intermediate Models
Considering Fig. 1 rows, we notice that some intermediate models -e.g., MNLI -provide relatively high gains for many targets.We observe that this is a general phenomenon -stronger models are typically stronger for many datasets.
Identifying such models in advance could be practically valuable, since for a new target, one would consider only the highly ranked intermedi-ate models (see §8).In the following, we propose a simple yet efficient method to obtain such a static ranking -which is made once, without accounting for the target.A more comprehensive ranking alternative is described in App.§F.
Given an intermediate model m, we train a linear probe (LP) -i.e., train only the classification head -over MNLI, and consider the gain, denoted LP (m, M N LI)2 .Evidently, this gain is a good proxy for the quality of m.Specifically, let g avg m be the average gain of m over a set of target datasets.As depicted in Fig. 2, we observe that LP (m, M N LI) and g avg m are highly correlated in the in-house models/targets experiment (Spearman: 0.46, Pearson: 0.78, p < 0.01) and the off-the-shelf models/targets experiment (Spearman: 0.51, Pearson: 0.66, p < 0.01).In other words, if available intermediate models are ranked by LP on one dataset LP (m, M N LI), highly ranked models represent the most promising starting points on average.The high correlation means this connection holds not only for the top-ranked models, but throughout.Moreover, the replication on off-theshelf models shows this is robust not only across targets but across sources.
To further support this claim, we use this ranking to find the most promising intermediate models.
For each target t, we consider the gain obtained by the top-ranked model and the max-gain obtained by one of the three top-ranked models, denoted g t (1) and g t (3) , respectively.In comparison, we consider the max-gain obtained while considering all available models, denoted g t (max) .We further denote loss t 1 ≡ g t (max) − g t (1) and loss t 3 ≡ g t max − g t (3) .In other words, loss t 1 represents the potential gain loss when using the top statically ranked model versus using the best available model for the target under consideration.Similarly, loss t 3 represents the lost gain when using the best model out of the top 3 ranked models versus using the best available model.Note, that even in this latter case, finding the best model involves finetuning only 3 models over the target train data, which is far less demanding compared to finetuning all available models.
In Table 1, we report statistics for loss t 1 and loss t 3 over the in-house and off-the-shelf models/targets experiments.Although the ranking is target-independent, the top 3 ranked models typ-ically provide most of the potential intertraining gain.For example, instead of checking all 66 available models, using this simple strategy of checking the top 3 ranked models, each of the 14 targets lost at most 1.62 points.

Source and Target Interaction Analysis
Next, we analyse the interaction between the source dataset and the target dataset.Obviously, such interaction may impact the gain.On one extreme, since finetuning on the same data twice does not improve performance, intertraining is not valuable when the source and target data are identical.On another extreme, consider partitions of the same data.Obviously, training over half the data as the source would be beneficial for the other half, the target.Thus, we do not question that interaction may exist, but rather investigate how common and strong it is.
Interaction between dataset groups.Our General dataset group consists of diverse datasets, while the other groups share a domain (Twitter) or a task (NLI).We analyze whether datasets that share a domain or task with the target represent better source datasets than others.
Table 2 depicts the average gain of each source group vs. each target group.Comparing targets (table columns), we find the models trained on similar tasks, as a group, have a distinct behavior (ANOVA p < 0.01).On average, using NLI intermediate models on NLI targets, yields a gain of 0.63, compared to a strong negative gain when using intermediate models from other groups.Similarly, there is a same-group gain of 0.5 on Twitter.prove General datasets even more than NLI ones.
Possibly, NLIs are just good intermediate models.
Twitter models, however, do seem to improve Twitter targets more (ANOVA p < 0.01), by 1 point.Hence, it seems the effects are mixed.
In summary, as a group, a similar source tends to yield greater gains than an unrelated source.However, in the rest of this section, we find this effect is of secondary importance to predicting model gains.A similar source may be more beneficial than a random one, but a well chosen source produces larger benefits regardless of similarity.
Symmetry as a similarity bound.We consider dataset similarity from another perspective.Similarity is a symmetric relation.Hence, if sourcetarget similarity was a main reason for gains, we would expect that when the source A helps the target B, the source B would help A as well.We assess the symmetry of the in-house models/targets experiment.We find that gains are far from symmetric (details in App.§G).Thus, (symmetric) similarity seems like a weak explanation of which source data would help which target. 3egression.As additional support for the relatively small impact of the source-target interaction, we show that the interaction is hardly needed to predict the scores.Specifically, a simple regressor can approximate the gain without considering such interactions.The regression fits 3 sets of coefficients.For each target two coefficients t i , t ′ i -which one may think of as capturing average gain and sensitivity to inter-training, per target; and for each base model b j -which one may think of as capturing average inter-training gain, per base model.We then define gai n(base i , target j ) = (b i + t j ) t ′ j where i and j are the base model and target indices, respectively.Note that by construction, this regressor has 2 Ṅ + n parameters, while trying to model N ṅ interactions; thus, it does not have enough degrees of freedom to explicitly model all source/target interactions.Still, it obtains satisfactory performance, as shown next.As a baseline, we fit the same regressor after randomly shuffling all the gains in the experiment.This shuffling ensures no information comes from the specific source and target, while maintaining the gain value distribution.We minimize Mean Squared Error (MSE) and fit with SGD until convergence.
Before we present the results, it would be beneficial to give some intuition on MSE.As MSE is the square of the error, an average MSE of 4, means that the average prediction was accurate up to 2 ( √ 4) points on average or more accurate than that (as outliers have more effect on MSE than on the average error rate).
We obtain an MSE of 4.2 on the in-house models/targets experiment (3.3), versus 9.6, σ = 0.9 when the gains are shuffled.Thus, knowing the source id and the target id provides significant information about the expected gain, with no need of keeping a designated parameter for each source-target combination.We further compare this regressor with two other regressors, one that considers only base model information (gai n(base i , target j ) = b i ) and one that considers only target related information (gai n(base i , target j ) = t j ).The MSE fit is 10.4 and 8.2, respectively, compared to 10.8, σ = 0.4 on shuffled gains.This suggests both the source and the target impact the intertraining gain, with potentially somewhat stronger influence by the target.
In conclusion, our observations suggest that when considering intertraining potential, loosely speaking it is enough to consider two separate issues -(i) choosing a base model; and (ii) determining the expected effect on the target.

Factors Contributing to Gains
So far, we showed the effects of the intermediate models are largely separate from those of the target.We proceed to examine how specific factors in each contribute to intertraining gains.

Source and target sizes
Following the above observations, we ask what makes a base model good, or a target sensitive?We examine the effects of architecture ( §6.3) and source score ( §6.2), but start by examining the data sizes effect on the gain: First, we identify effects of the datasets sizes in controlled experiments.Next, we assess the extent to which the effect of dataset size is evident in previous experiments.For more related analysis we refer to Phang et al. (2018a); Pruksachatkun et al. (2020).
Effect of dataset size.We first control the source dataset train size.We create intermediate models on 4 source datasets -the top 2 and lowest 2 in-house models, according to the static ranking ( §4.2).For each, we limit the training to 50 − 3200 samples and use the General group datasets as targets.Evidently, for good sources (ANLI, MNLI), more training data yields greater gains (Fig. 3).However, additional data decreases performance for bad sources (MultiRC, QQP).We conclude that the amount of source data amplifies the effect determined by the type of data.
We experiment on the effect of target size, using General sources and General targets with train data of more than 1600 examples.We limit the target train sizes to between 50 -namely, few-shot setting -and 1600.As depicted in Fig. 4, the gain decreases as the target training size increases, implying larger potential for intertraining when the target training data is limited.Interestingly, for 3 targets we see positive gains on small datasets, which drops to negative gain as training data increases, and then seem to rise again towards zero.This 'U-shape' effect hints at two competing effects that should be better examined in future work (c.f.App.H).
Training size effects in practice.We examine whether the effect above is strong in comparison to other unknown factors.Considering the in-house models/targets experiment, the Pearson Correlation between source training size and source average gain is 0.75.Out of the 5 largest sources (ESNLI, MNLI, QQP, ANLI, and QNLI), 3 are also the source tasks with the highest average gain (MNLI, ANLI and ESNLI) and QQP is the source dataset with the second-lowest gain (negative).This is in line with our previous observation that additional source training data magnifies the positive or the negative intertraining effect of the source data.
We observe no significant correlation between target training size and target average gain, where the average is taken across sources.Still, targets with small training data seem to be more sensitive to intertraining.Specifically, the 5 targets with the smallest training data (CB, CoPA, WSC, WNLI, RTE) are also those for which we observe the maximal gain difference across all sources.

Similar Source Score, Different Gain
One can expect that two models with similar performance on a source dataset would also have similar intertraining gains.Our results suggest otherwise.We finetune 20 models over MNLI source and use them as intermediate models on the General target group.We compare the scores on the source test data to the average score obtained on the target datasets.Source task scores vary between 86.5 and 87.5 while General target average varies between 74.5 and 79, without correlation (c.f.App.I).McCoy et al. (2019) found that finetuned models that show similar performance on their test data, still have different performance on out-of-domain challenge sets, suggesting the models learnt different generalizations schemes.Juneja et al. (2022a) suggested that those models converged to different basins in the loss space.They tagged models from one basin that tend to generalize better as good, and the rest as bad.We check whether good models are also better intermediate models for other tasks.We took their BERT models finetuned on MNLI as intermediate models -12 good and 12 bad models -and used the General datasets as targets, finding that good models are indeed better for intertraining (3.65 avg.gain for good vs. 2.16 for bad).
The differences discussed above are due to using different seeds.In App.I.1 we show that hyperparameter selection can also impact the quality of an intermediate model, regardless of the score observed on the source test data.
In summary, similar training and/or similar performance on the source test, do not translate to similar intertraining gains on new target tasks.

Effect of different architectures
We validate our main conclusions across different architectures.To that end, we repeat the inhouse models/targets experiment with BERT (Devlin et al., 2019) and T5 (Raffel et al., 2020) architectures.(see full tables in App.J).
We start by showing that the loose source-target coupling holds across architectures.We then show that different source datasets rank differently across architecture, but that target sensitivity is similar.
To show the source-target independence, we repeat the regression fit ( §5).As before, the fit explains each architecture's gains much better than when data is shuffled (BERT MSE 10.5, random 30.1, σ = 4.17; T5 MSE 8.11, random 13.51, σ = 1.5).Neither the average gain of intermediate models -trained over various sources, nor the average gain for target tasks, correlate across different architectures.However, the target sensitivity, measured by max-gain, is correlated across all architectures (pairwise Pearson 0.6 − 0.94, p < 0.05).Thus, although the source-target decoupling and the target sensitivity are shared across architectures, a source dataset that produces high gains in one architecture might not do so in another.
A notable exception is the MNLI source dataset which achieves the highest gain in all three architectures.Possibly, some data sources provide a strong intermediate model regardless of PT, with MNLI as a prominent example.

Related Work
Various works use intertraining, often following the assumption of task alignment necessity (Ein-Dor et al., 2022;Don-Yehiya et al., 2022a;Awasthy et al., 2020), namely, that the source acts as weak labeled data (Shnarch et al., 2018;Yu et al., 2021).While we consider intertraining through finetuning, adaptation to the target (Shnarch et al., 2022) or the domain (Gururangan et al., 2020) was also suggested.Such adaptation may be applied to any base model, and is complementary to the choice among base models.The need for alignment was also previously debated in the context of pretraining tasks (Krishna et al., 2021;Rothe et al., 2021;Zhang et al., 2020;Ram et al., 2021).
The properties of intertraining were studied in other contexts.Phang et al. (2018a) suggested the intertraining scheme.Pruksachatkun et al. (2020) probed the linguistic knowledge changes after intertraining, noting correlations to some target tasks, and hypothesized why some source tasks are good for intertraining.Mosbach et al. (2020); Chang and Lu (2021) replicated the existence of good sources, but rejected the hypothesis.Others tried to find which tasks have an affinity to each other (Poth et al., 2021;Bassignana et al., 2022a,b;Vu et al., 2020), or when to prefer multitask (Weller et al., 2022).We study a much larger number of sources and targets, aiming to describe their natural distribution (c.f.§9) and also find that while specific choice may help, given enough models, it is safe to identify the models that just excel generally.
Recent work focuses on fusing multiple base models rather than picking just one (Choshen et al., 2022;Matena and Raffel, 2021;Wortsman et al., 2022;Yadav et al., 2023).We expect our understanding to aid in choosing a subset of models to fuse as well.
Multi-task learning is another related field.It studies finetuning on different tasks at once (Aribandi et al., 2021;Aghajanyan et al., 2021) and recently also a way to recycle models to do so was proposed (Don-Yehiya et al., 2022b).In contrast to our analysis, similarity between tasks aids multi-task learning (Abnar et al., 2021).
Deciding which information should accompany publications is an active research field, covering datasets (Holland et al., 2018;Gebru et al., 2021), human annotation sheets (Shimorina and Belz, 2022), and models (McMillan-Major et al., 2021;Mitchell et al., 2019).Our work proposes to report upon sharing a model the aspects shown to affect its quality, such as LP (m, M N LI).For datasets, we propose to report intertraining sensitivity.

Practical recommendations
Based on the observations ( §4) and analysis ( §5), we propose a methodology for efficiently utilizing intertraining in real-world settings.We suggest to collaboratively rank all available base models for intertraining and to utilize this list whenever intertraining is applied to a new target.
New model.When a new model becomes available, we encourage practitioners to assess and share its quality.This can be done efficiently by linear probing on MNLI ( §4.2) or comprehensively (App.§F) by finetuning on various datasets.
We created statically ranked lists for RoBERTabase and T5-small in App.§E.Moreover, we apply our methods to many architectures and 36 test sets in an updating site , reporting the best models.
New target.When considering intertraining on a new task, we recommend checking the target dataset sensitivity, and then choosing the base model.Since the model's rank hardly depends on the target dataset, we suggest using the static ranking.Specifically, we propose to finetune the top-ranked model, and compare the results to those obtained when finetuning the respective PT model.Assuming the gain justifies it, one should consider a few additional intermediate models, in descending order, according to the allocated resources.

Discussion
In §8, we highlighted our practical recommendations for intertraining.Those, together with a systematic analysis of what affects intertraining, cover our main contributions.We hope this analysis would also aid future work on interactions between datasets; intertraining practices; and reusing finetuned models.
Our experiments mainly characterize what is probable rather than what is possible.We do not create specific models or aim to improve a specific task.Instead, we investigate what is likely to be found in typical practical settings.Accordingly, our findings are probabilistic in nature: Most models are not beneficial as intermediate models, but there are enough that are.Mostly, beneficial models are beneficial for many targets.
As a side effect, we do identify specific strong source models.MNLI was already considered a beneficial source dataset (Phang et al., 2018a), a finding which we corroborate in a comprehensive manner.Furthermore, when considering off-theshelf models we find models that outperform it (e.g., STS-B based for RoBERTa, and Quora for T5).To facilitate finding additional and better base models we will continuously report in the website website the best models per architecture.

Limitations
One obvious limitation is that our work is empirical in nature.As such, we report observations, sampled or common, with no theoretical guarantees, and one should recognize the existence of exceptions.Specifically, even though we have not observed it -there might exist target tasks that benefit greatly from a certain type of intermediate models; or intermediate models that help greatly in many targets while degrading performance in others.
Moreover, while testing 22 source datasets, many of which previously untested, we did not find a new top source for intertraining.The best one we found for RoBERTa was already known to be good (MNLI; Phang et al., 2018b;Pruksachatkun et al., 2020).With that, by checking dozens of offthe-shelf models, we did uncover new intermediate models that seem to outperform MNLI (e.g., STS-B for RoBERTa and QQP for T5 -c.f.App.§E).More work is needed to assess the potential intertraining gain of the available datasets and models.
We ran thousands of finetuning experiments, spanning a vast number of tasks and base models.Thus, while we believe it is unlikely that reproducing our experiments will result in different outcomes, the large scale of our experiments places a heavy burden on trying to replicate our results.Moreover, the off-the-shelf models used in our experiments might not be hosted publicly in the future.
Another limitation is that we could not upload all of the models to a shared location.This project was very computationally demanding, but more than that, it was demanding in terms of disk space, hence we had to delete many models along the way.
Finally, for practical reasons, our results are limited to classification tasks in English.We hope that future work will aim to test our conclusions beyond this scope.Overall, in the space of classification, we see our results as robust, testing on 22 datasets (about double the amount of previous works (Pruksachatkun et al., 2020)).We hope the diversity of targets brings large enough differences between datasets that the results would hold in other scopes.

A Task, Domain and Dataset
A task is defined by the input and the output.The input in our context is a text instance.The output could be, e.g., positive/negative/neutral for a sentiment analysis task, entailed/not-entailed for a textual entailment task, etc.
A domain is the type of text found in the examples, regardless of the labels.For example, a domain may be financial or comments for twitter.
A dataset for our purpose is a set of examples and their labels, divided into train, dev, and test folds.Being such, each dataset has a domain (characterizing its examples) and a task (for which its labels are annotated).Hence, we consider a subset of a dataset as an another dataset.Note that in the literature those terms are often not well defined and may even be interchangeable (Wang et al., 2019b).

B Hyperparameters
For RoBERTa, we train for 10 epochs with linear learning rate 5e-5 with warm-up of 0.6% of training, batch size of 256, early stop epsilon 0.001 accuracy points, patience of 20 × 50 × 256 examples, validate every 50 × 256 examples, optimizer ADAMW (Loshchilov and Hutter, 2019), with weight decay 0.01 or 0. For BERT-baseuncased we use 2e-5 learning rate and never use decay.For T5 we use 1e-4 learning rate and train until early stopping occurs.We used A100 and V100 GPUs.Finetuning times vary, but all end within a couple of hours, most in less than an hour, some up to 8 hours.

C Datasets used
All datasets could be downloaded from huggingface datasets.As we used groups of datasets we report here the full list of datasets they contain.
Twitter: EmoInt (Mohammad and Bravo-Marquez, 2017), Emoji (Barbieri et al., 2018), Irony (Van Hee et al., 2018), OffenseEval (Zampieri et al., 2019), HatEval (Basile et al., 2019) , Sentiment Analysis (Rosenthal et al., 2017) Whenever the test set is held out (such as is GLUE and SuperGLUE), we extracted 1K or 10% of the training examples as test set, the smaller.We did not experiment with the small Stance (Mohammad et al., 2016) Twitter dataset originally found in TweetEval (Barbieri et al., 2020) to reduce noise.In the T5 experiment ( §E) we used stance datasets as well, to have a large group.For MNLI we use the mismatched validation set as a test and the matched as a validation set.

D In-house models/targets additional information
We report in Table 3 the score of finetuning RoBERTa without intertraining.
We also report the standard deviation for each cell in the experiment, i.e., taking into account differences due to finetuning the intermediate model and target dataset in figure 5.For each seed, we finetune the PT over the source dataset to produce the base model, and use it to finetune the target task.It means that each seed utilises a different base model.Note that §6.2 suggest different seeds may create base models with different quality.Note that to assess the variability of the averages reported in the main text ( §4) the Standard Error of the Mean is required, this is the STD divided by the square of the number of seeds SE = ST D/ √ 5.

E.1 Models used
We collected manually 66 RoBERTa-base models.
From their names most were finetuned from vanila  RoBERTa, but a few were trained from scratch.They vary in hyperparameters and training details, just as one can expect training approaches to vary between different researchers.We supply the full list of models in Tables 4,5 for RoBERTa and T5 respectively.

E.2 Results and discussion
We report the full results in Fig. 6.T5 models are reported in §7 (off-the-sheld are on Twitter datasets only, as most of General and NLI were part of T5's pretraining, in-house use t5.1.1).Seemingly, some traits of the training that we did not account for are important.It is exemplified by models that are associated with the same datasets but differ in their gains.For example cross-encoder implementations outperform other models and sentencetransformers underperform them.
Notably, most models are not useful for intertraining.Still, many models do.
The best model on average is MUPPET (Aghajanyan et al., 2021) which is a massive multitask learning approach.However, our results show that finetuning only on STS-B (rather than on 40 datasets) yielded similar results.We note that unlike MNLI which is known to be a good source (Phang et al., 2018a), STS-B was previously only considered as a target task only.The results suggest, it might be a good source.
We gathered the models by manually searching for 'RoBERTA-base' models, ignoring ones that were working on languages other than English.It is possible we have missed models that did not clearly state their architecture as part of the model name.We are already aware of such models, for which the PT is not deducible from their title, for example those lately released by Juneja et al. (2022b).

F Rank by Average over Targets
In §4.2 we show training one Linear Probe is enough to rank base models.Although tested on a large number of target datasets, presumably, this method does not always achieve accurate predictions.For example, the target domain might be so different that MNLI would not be relevant.As a more accurate alternative, one can use several datasets to provide a more reliable picture.We show that an average of finetuning gains over different datasets is a reliable way for choosing a base model.As in the simpler case of LP, this supports the decoupling.
We report in Table 6 the lost gain when choosing the highest models, ranked by average gain over the General group.This ranking method generalize well; The 1 or 3 best-ranked models are close to the best possible model overall for each target.For example, only 2 targets lose more than 1 point by choosing the top 3 models.
Practically, we suggest to rank either in this method or by LP.If some use one method and others choose another it might be hard to compare the two rankings.Thus, we report that in our experiments the best predictor of the average score by the LP score is g avg m = LP (m, M N LI) • 0.0822 − 0.940

G Symmetry metric
To measure symmetry of a matrix M, we decompose it to it its symmetrical and skew symmetrical parts: M = S + V where S = 1 2 (M + M T ) and V = 1 2 (M − M T ).S is symmetrical: S = S T and V is skew-symmetrical: V = −V T .The measure

H U-shape
We analyse how the intertraining gains change when more target data is available.We find that while intertraining often improves results for small data size, the effect is decreasing with the size.Surprisingly, the decrease drops below zero and at some size increases again.This suggests an unexplained underlying behaviour, presumably of two competing effects, one that decreases gains with size and one that improves them (in general or towards 0).We produce three examples of the U-shape in Figures 8, 9, 10.

I Scores
We report the source and target scores of training on MNLI datasets with 20 seeds.Target scores are the average over the General datasets.In Fig. 11, we present the results.Evidently the two are not correlated.

I.1 Forgetting
If PT's success comes from honing the parameters, shifting from them and forgetting the knowledge gained during pretraining is inadvisable in general (Chen et al., 2020) and possibly for intertraining (Pruksachatkun et al., 2020).With more training data, comes also more forgetting.This may also explain why most source models have a negative gain and intertraining hurts.Despite that, we observed in §6.1 that more source data empowers intertraining and improves gains.Following that observation, we analyze the importance of forgetting to the choice of a source model.
One common practice that causes forgetting is weight decay (Hanson and Pratt, 1988;Loshchilov and Hutter, 2019) -a regularization term added to the model updates.The decay term penalizes large weights, shrinking PT's large weights that are not necessary for the target objective.
For this experiment we use the following experimental setup: ADAMW (Loshchilov and Hutter, 2019) optimizer with weight decay 1 for decay and 0 (ADAM; Kingma and Ba, 2014) otherwise.L2 regularization is 0.1, results with other rates showed similar tendencies with effect size corresponding to the rate.
We find intermediate models trained with weight decay to be worse, but only if the pretraining did not include weight decay.Specifically, RoBERTa had weight decay during pretraining and T5 had not.We consider g avg M N LI the average gain when source is MNLI and targets are General with and without weight decay.With RoBERTa as PT, the gain with decay was slightly better than without, by 0.02 points, while with T5 decay lost 3.3 points.These changes were not reflected on the source task performance.
Second, we limit the forgetting by adding a regularization forcing the model not to be far from the pretrained model.This can be seen as the complement of the previous method, rather than update to- 1467 ward zero update toward the pretrained model.We find this method did not change much for MNLI (-0.28), but for models that hurt overall performance they prevented some performance loss, e.g.QQP (4.3).We note that while this regularization did not improve the top base model's gain, it did hurt the original finetuning on the source task (-5.2 points on MNLI).We further address the effect of source task score on base model quality in §6.2.
We also followed Kumar et al. (2022) method that should reduce forgetting with LP, but it had little effect.All of the above findings imply that what determines a base model's effectiveness may be hidden in training hyperparameters.For example training on the same data and achieving similar results on the source, may still get quite different results on the target, depending on whether weight decay was used.

J Architectures
Figs. 12, 13 depict the gains of In-house models trained on General datasets over the General datasets.In Fig. 7 we report the gains from training on off-the-shelf T5 models.Interestingly, QQP which is known as a bad source for RoBERTa, is among the best intermediate models for T5.Presumably, this is due to different training, where T5 generates the paraphrases rather than picks between several ones.On a similar note, many of the top base models train on non-classification tasks, such as paraphrasing and question answering.This implies that the model weights converge to something quite general, learning linguistic traits that are not all discarded during finetuning.We say those are linguistic, in the sense that the language, and perhaps common knowledge are what makes the datasets similar, the tasks are quite different.

Figure 1 :
Figure 1: Results of in-house models/targets experiment.Columns correspond to target datasets and Rows correspond to intermediate models generated based on same datasets as source.The 22 datasets come from the General, NLI and Twitter groups.Each value indicates intertraining gain w.r.t. the PT model, averaged over 5 seeds.Sorted by group and source average gain (bottom row).Positive significant cells (>2 STD) are italicized.

Figure 2 :
Figure 2: Linear probing MNLI (x) is enough to predict finetuning gains (y) averaged over 14 General datasets.Each point corresponds to one off-the-shelf base model.

Figure 3 :
Figure 3: For 'good' sources the average gain increase as the source training size increases, while for 'bad' sources it decreases.

Figure 4 :
Figure 4: The average gain across targets decreases as the target training size increases.

Figure 5 :
Figure 5: Standard deviation of in-house models/targets experiment.Rows correspond to intermediate models, generated based on 22 source datasets from the General, NLI and Twitter groups.Columns correspond to the same datasets, now being used as target datasets.Each value indicates standard deviation over 5 seeds.

Figure 6 :
Figure 6: Results of the off-the-shelf models/targets experiment.Rows correspond to off-the-shelf RoBERTa models obtained by downloading from HuggingFace model hub.Columns correspond to the General datasets group.Each value indicates intertraining gain w.r.t.using the PT model,

Figure 8 :
Figure 8: Gains of QNLI from intertraining with different amount of training data (X-axis) and different base models (lines).

Figure 9 :
Figure 9: Gains of SST2 from intertraining with different amount of training data (X-axis) and different base models (lines).

Figure 10 :
Figure 10: Gains of WIC from intertraining with different amount of training data (X-axis) and different base models (lines).

Figure 11 :
Figure 11: Source (MNLI) score against average score on General datasets after intertraining.Each point represents a different MNLI intermediate nodel trained on a different seed.

Figure 12 :
Figure 12: BERT General sources and targets.The intertrain gain over the pretrained model for each source (row) and target (column) datasets.

Table 2 :
Intermediate models trained on sources from the same domain (Twitter) or task (NLI) as the target, yield greater gain.Numbers represent the average gain of intermediate models of a source group (rows) on a given target group (columns) .
Comparing sources (table rows), while NLI is best improved by NLI models, NLI models im-

Table 4 :
RoBERTa models we used, collected from Hugging Face models hub.Models sorted by average gain over the General targets.

Table 5 :
T5 models we used, collected from Hugging Face models hub.Models sorted by average gain over the General targets.

Table 6 :
Lost Gain per target is minimal when choosing the highest models, ranked by average intertraining gain on General datasets.Results are reported when selecting top rank model or best of 3 top rank models (@Top column).Columns represent the aggregation of the lost gain: average, max and the number of target datasets that lose at least one point.Rows represent two sets of experiments, in-house (with 22 models and 22 target datasets) or off-the-shelf (with 66 base modes and 14 target datasets).of symmetry s of the matrix M, considers the relations between S and V , s = (|S|−|V |)/(|S|+|V |), s ∈ [−1, 1] if s is close to -1 it means that M is almost skew-symmetric (or anti-symmetric), if it is almost 1, it means that M is almost symmetric.If it around zero, it means that it neither symmetrical or skew-symmetric.