What to Pre-Train on? Efficient Intermediate Task Selection

Intermediate task fine-tuning has been shown to culminate in large transfer gains across many NLP tasks. With an abundance of candidate datasets as well as pre-trained language models, it has become infeasible to experiment with all combinations to find the best transfer setting. In this work, we provide a comprehensive comparison of different methods for efficiently identifying beneficial tasks for intermediate transfer learning. We focus on parameter and computationally efficient adapter settings, highlight different data-availability scenarios, and provide expense estimates for each method. We experiment with a diverse set of 42 intermediate and 11 target English classification, multiple choice, question answering, and sequence tagging tasks. Our results demonstrate that efficient embedding based methods, which rely solely on the respective datasets, outperform computational expensive few-shot fine-tuning approaches. Our best methods achieve an average Regret@3 of 1% across all target tasks, demonstrating that we are able to efficiently identify the best datasets for intermediate training.


Introduction
Large pre-trained language models (LMs) are continuously pushing the state of the art across various NLP tasks. The established procedure performs self-supervised pre-training on a large text corpus and subsequently fine-tunes the model on a specific target task (Devlin et al., 2019;Liu et al., 2019b). The same procedure has also been applied to adapter-based training strategies, which achieve on-par task performance to full model fine-tuning while being considerably more parameter efficient (Houlsby et al., 2019) and faster to train (Rücklé et al., 2021). 2 Besides being more efficient, adapters are also highly modular, enabling a wider range of transfer learning techniques (Pfeiffer et al., 2020b(Pfeiffer et al., , 2021aÜstün et al., 2020;Vidoni et al., 2020;Rust et al., 2021;Ansell et al., 2021).
Extending upon the established two-step learning procedure, incorporating intermediate stages of knowledge transfer can yield further gains for fully fine-tuned models. For instance, Phang et al. (2018) sequentially fine-tune a pre-trained language model on a compatible intermediate task before target task fine-tuning. It has been shown that this is most effective for low-resource target tasks, however, not all task combinations are beneficial and many yield decreased performances (Phang et al., 2018;Wang et al., 2019a;Pruksachatkun et al., 2020). The abundance of diverse labeled datasets as well as the continuous development of new pre-trained LMs calls for methods that efficiently identify intermediate dataset that benefit the target task.
So far, it is unclear how adapter-based approaches behave with intermediate fine-tuning. In the first part of this work, we thus establish that this setup results in similar gains for adapters, as has been shown for full model fine-tuning (Phang et al., 2018;Pruksachatkun et al., 2020;Gururangan et al., 2020). Focusing on a low-resource target task setup, we find that only a subset of intermediate adapters yield positive gains, while others hurt the performance considerably (see Table 1 and Figure 2). Our results demonstrate that it is necessary to obtain methods that efficiently identify beneficial intermediately trained adapters.
In the second part, we leverage the transfer results from part one to automatically rank and identify beneficial intermediate tasks. With the rise of large publicly accessible repositories for NLP models (Wolf et al., 2020;Pfeiffer et al., 2020a), the chances of finding pre-trained models that yield positive transfer gains are high. However, it is infeasible to brute-force the identification of the best intermediate task. Existing approaches have focused on beneficial task selection for multi-task learning (Bingel and Søgaard, 2017), full fine-tuning of intermediate and target transformer-based LMs for NLP tasks (Vu et al., 2020), adapter-based models for vision tasks (Puigcerver et al., 2021) and unsupervised approaches for zero-shot transfer for community question answering (Rücklé et al., 2020). Each of these works require different types of data, such as intermediate task data and/or intermediate model weights, which, depending on the scenario, are potentially not accessible. 3 In this work we thus aim to address the efficiency aspect of transfer learning in NLP from multiple different angles, resulting in the following contributions: 1) We focus on adapter-based transfer learning which is considerably more parameter (Houlsby et al., 2019) and computationally efficient than full model fine-tuning (Rücklé et al., 2021), while achieving on-par performance; 2) We evaluate sequential fine-tuning of adapter-based approaches on a diverse set of 42 intermediate and 11 target tasks (i.e. classification, multiple choice, question answering, and sequence tagging); 3) We identify the best intermediate task for transfer learning, without the necessity of computational expensive, explicit training on all potential candidates. We compare different selection techniques, consolidating previously proposed and new methods; 4) We provide a thorough analysis of the different techniques, available data scenarios, and task-, and model types, thus presenting deeper insights into the best approach for each respective setting; 5) We provide computational cost estimates, enabling informed decision making for trade-offs between expense and downstream task performance.
2 Related Work 2.1 Transfer between tasks Phang et al. (2018) show that training on intermediate tasks results in performance gains for many target tasks. Subsequent work further explores the effects on more diverse sets of tasks (Wang et al., 2019a;Talmor and Berant, 2019;Liu et al., 2019a;Sap et al., 2019;Pruksachatkun et al., 2020;Vu et al., 2020). Wang et al. (2019a), Yogatama et al. (2019), andPruksachatkun et al. (2020) emphasizes the risks of catastrophic forgetting and negative transfer results, finding that the success of sequential transfer varies largely when considering different intermediate tasks.
While previous work has shown that intermediate task training improves the performance on the target task in full fine-tuning setups, we establish that the same holds true for adapter-based training.

Predicting Beneficial Transfer Sources
Automatically selecting intermediate tasks that yield transfer gains is critical when considering the increasing availability of tasks and models.
Proxy estimators have been proposed to evaluate the transferability of pre-trained models towards a target task. Nguyen et al. (2020), Li et al. (2021) and Deshpande et al. (2021) estimate the transferability between classification tasks by building an empirical classifier from the source and target task label distribution. Puigcerver et al. (2021) experiment with multiple model selection methods, including kNN proxy models to estimate the target task performance. In a similar direction, Renggli et al. (2020) study proxy models based on kNN and linear classifiers, finding that a hybrid approach combination of task-aware and task-agnostic strategies yields the best results. Bingel and Søgaard (2017) find that gradients of the learning curves correlate with multi-task learning success. Zamir et al. (2018) build a taxonomy of vision tasks, giving insights into non-trivial transfer relations between tasks. Multiple works propose using embeddings that capture statistics, features, or the domain of a dataset. Edwards and Storkey (2017)  While many different methods have been pro-posed, there lacks a direct comparison among them. Additionally, previous work has only focus on BERT, which we find to behave considerably different to other model types such as RoBERTa for some methods. In this work we aim to consolidate all methods and experiment with newer model types to provide a more thorough perspective.

Adapter-Based Sequential Transfer
We present a large-scale study on adapter-based sequential fine-tuning, finding that around half of the task combinations yield no positive gains. This demonstrates the importance of finding approaches that efficiently identify suitable intermediate tasks.

Tasks
We

Experimental Setup
We experiment with BERT-base (Devlin et al., 2019) and RoBERTa-base (Liu et al., 2019b), training adapters with the configuration proposed by Pfeiffer et al. (2021a). We adopt the two-stage sequential fine-tuning setup of Phang et al. (2018), splitting the tasks in two disjoint subsets S and T , denoted as intermediate and target tasks, respectively. For each pair (s, t) with s ∈ S and t ∈ T , we first train a randomly initialized adapter on s (keeping the base model's parameters fixed). We then fine-tune the trained adapter on t. 5 For target task fine-tuning, we simulate a lowresource setup by limiting the maximum number of training examples on t to 1000. This choice is motivated by the observation that smaller target tasks benefit the most from sequential fine-tuning while at the same time revealing the largest performance variances (Phang et al., 2018;Vu et al., 2020). Low-resource setups, thus, reflect the most beneficial application setting for our transfer learn-4 The choice for our intermediate and target task split was motivated by previous work (Sap et al., 2019;Vu et al., 2020, inter alia ing strategy and also allow us to more thoroughly study different transfer relations. Figure 2 shows the relative transfer gains and Table 1 lists the absolute scores of all intermediate and target task combinations for RoBERTa. 6 We observe large variations in transfer gains (and losses) across the different combinations. Even though larger variances may be explained by a higher task difficulty (see 'No Transfer' in Table 1), they also illustrate the heterogeneity and potential of sequential fine-tuning in our adapter-based setting. At the same time, we find several cases of transfer losseswith up to 60% lower performances (see Figure 2)potentially occurring due to catastrophic forgetting. Overall, for RoBERTa, 243 (53%) transfer combinations yield positive transfer gains whereas 203 (44%) yield losses. The mean of all transfer gains is 2.3%. However, from our eleven target tasks only five benefit on average (see 'Avg. Transfer' in Table 1). This illustrates the high risk of choosing the wrong intermediate tasks. Avoiding such hurtful combinations and efficiently identifying the best ones is necessary; evaluating all combinations is inefficient and often not feasible.

Results
We further find that the best performing intermediate tasks for BERT and RoBERTa overlap considerably as illustrated in Figure 1, with transfer performances correlating with a Spearman correlation of 0.94 when averaged over all settings, and 0.68 when averaged per target task. 6 We list the corresponding transfer results for BERT in

Methods for the Efficient Selection of Intermediate Tasks
We now present different model selection methods, and later in §5, study their effectiveness in our setting outlined above. We group the different methods based on the assumptions they make with regard to the availability of intermediate task data D S and intermediate models M S . Access to both can be expensive when considering large pretrained model repositories with hundreds of tasks.

Metadata: Educated Guess
A setting in which there exist neither access to the intermediate task data D S nor models trained on the data M S , can be regarded as an educated guess scenario. The selection criterion can only rely on metadata available for an intermediate task dataset.
Dataset Size. Under the assumption that more data implies better transfer performance, the selection criterion denoted as Size ranks all intermediate tasks in descending order by the training data size.
Task Type. Under the assumption that similar objective functions transfer well, we pre-select the subset of tasks of the same type. This approach may be combined with a random selection of the remaining tasks, or with ranking them by size.

Intermediate Task Data
With an abundance of available datasets, 7 and the continuous development of new LMs, fine-tuned versions for every task-model combination are not (immediately) available. The following methods, thus, leverage the intermediate task data D S without requiring the respective fine-tuned models M S . Few-Shot Fine-Tuning (FSFT). Fine-tuning of all available intermediate task models on the entire target task is infeasible. As an alternative, we can train models for a few steps on the target task to approximate the final performance. After N steps on the target task, we rank the intermediate models based on their respective transfer performance.
Proxy Models. Following Puigcerver et al. (2021), we leverage simple proxy models to obtain a performance estimation of each trained model M S on on the target dataset D T . Specifically, we experiment with k-Nearest Neighbors (kNN), with k = 1 and Euclidian distance, and logistic/ linear regression (linear) as proxy models. For both, we first compute h M x i , the token-wise averaged output representations of M S , for each training input x i ∈ D T . Using these, we define as the target dataset embedded by M S . In the next step, we apply the proxy model on D M T and obtain its performance using cross-validation. By repeating this process for each intermediate task model, we obtain a list of performance scores which we leverage to rank the intermediate tasks. Given the model weights θ and the joint distribu-tion of task features and labels P θ (X, Y ), we can define the FIM as the expected covariance of the gradients of the log-likelihood w.r.t. θ:

Intermediate Model and Task Data
We follow the implementation details given in Vu et al. (2020). For a dataset D and a model M finetuned on D, we compute the empirical FIM based on D's examples. The task embeddings are the diagonal entries of the FIM.
Few-Shot Task Embeddings (FS-TaskEmb). We also leverage task embeddings in our few-shot scenario outlined above (see FSFT), where we finetune intermediate models for a few steps on the target dataset. With very few training instances, the accuracy scores of FSFT (alone) may not be reliable indicators of the final transfer performances.
As an alternative, we compute the TaskEmb similarity of each intermediate model before and after training N steps on the target task. We then rank all intermediate models in decreasing order of this similarity.

Experimental Setup
We evaluate the approaches of §4, each having the objective to rank the intermediate adapters s ∈ |S| with respect to their performance on t ∈ T when applied in a sequential adapter training setup. We leverage the transfer performance results of our 462 experiments obtained in §3 for our ranking task.

Hyperparameters
If not otherwise mentioned, we follow the experimental setup as described in §3. We describe method specific hyperparameters in the following. SEmb. We use a Sentence-(Ro)BERT(a)-base models, fine-tuned on NLI and STS tasks, in concordance with the respective target model type.
FSFT. We fine-tune each intermediate adapter on the target task for one full epoch and rank them based on their target task performances. 8 Proxy Models. For both kNN and linear, we obtain performance scores with 5-fold cross-validation on each target task. The architectures slightly vary across task types. For classification, regression, and multiple-choice target tasks, proxy models predict the label or answer choice. For sequence tagging TaskEmb. We perform standard fine-tuning of randomly initialized adapter modules within the pretrained LM to obtain task embeddings.

FS-TaskEmb.
We follow the setup of FSFT by training for one epoch (50 update steps).

Metrics
We compute the NDCG (Järvelin and Kekäläinen, 2002), a widely used information retrieval metric that evaluates a ranking with attached relevances (which correspond to our transfer results of §3). Furthermore, we calculate Regret@k (Renggli et al., 2020), which measures the relative performance difference between the top k selected intermediate tasks and the optimal intermediate task: where T (s, t) is the performance on target task t when transferring from intermediate task s. O(S, t) denotes the expected target task performance of an optimal selection. M k (S, t) is the highest performance on t among the k top-ranked intermediate tasks of the tested selection method. We take the difference between both measures and normalize it by the optimal target task performance to obtain our final relative notion of regret. 9 6 Experimental Results Table 2 shows the results when selecting among all available intermediate tasks for BERT and RoBERTa. 10 As expected the Random and Size baselines do not yield good rankings when selecting among all intermediate tasks.
Access to only D S or M S . These methods typically perform better than our baselines.  TextEmb and SEmb perform on par in most cases. 11 While FSFT outperforms the other approaches in most cases, it comes at the high cost of requiring downloading and fine-tuning all intermediate models for a few steps. This can be prohibitive if we consider many intermediate tasks. If we have access to TextEmb or SEmb information of the intermediate task (i.e., individual vectors distributed as part of a model repository), these techniques yield similar performances at a much lower cost. Access to both D S and M S . Assuming the availability of both intermediate models and intermediate data is the most prohibitive setting. Surprisingly, we find BERT and RoBERTa to behave considerably differently, especially evident for QA tasks. As shown by Vu et al. (2020), TaskEmb performs very well for BERT, however we find that the results of this gradient based approach do not translate to RoBERTa. While these approaches perform best or competitively for all task types using BERT, they considerably underperform all methods when leveraging pre-trained RoBERTa weights. Here, the two much simpler domain embedding methods outperform the TaskEmb method based on the FIM. Summary. We find that simple indicators such 11 The used SBERT model is trained on NLI and STS-B tasks, which are included in our set of intermediate and target tasks, respectively. A direct comparison between TextEmb and SEmb for the respective classification tasks is thus difficult. as domain similarity are suitable for selecting intermediate pre-training tasks for both BERT and RoBERTa based models. Our evaluated methods are able to efficiently select the best performing intermediate tasks with a Regret@3 of 0.0 in many cases. Our results, thus, show that the selection methods are able to effectively rank the top tasks with relative certainty, thus considerably reducing the number of necessary experiments. 12 7 Analysis Computational Costs. Table 3 estimates the computational costs of each transfer source selection method. Complexity shows the required data passes through the model. 13 For the embedding-based approaches, we assume pre-computed embeddings for all intermediate tasks. For TaskEmb, we only train an adapter on the target task for e epochs.
In addition to the complexity, we calculate the required Multiply-Accumulate computations (MAC) for 42 intermediate tasks and one target task with 1000 training examples, each with an average sequence length of 128. 14 Following our experi- 12 We also find that combining domain and task type match indicators often yield the best overall results, outperforming computationally more expensive methods. See Appendix ?? for more experiments with task type pre-selection. 13 We neglect computations related to embedding similarities and proxy models as they are cheap compared to model forward/ backward passes. 14 We recorded MAC with the pytorch-OpCounter package.  mental setup in §5, we set e = 15 for TaskEmb and e = 1 for FSFT/ FS-TaskEmb. We find that embedding-based methods require two orders of magnitude fewer computations compared to finetuning approaches. The difference may be even larger when we consider more intermediate tasks.

Method Complexity MACs
Since fine-tuning approaches do not yield gains that would warrant the high computational expense (see §6), we conclude that SEmb has the most favorable trade-off between efficiency and effectiveness.
SEmb Model Dependency. We compare different pre-trained sentence-embedding model variants to identify the extent to which SEmb is invariant to such changes. We experiment with BERT and RoBERTa variants of sizes Distill, Base, and Large, and present results for RoBERTa tasks in Table 4. 15 We find that all variants perform comparably, demonstrating that SEmb is a computationally efficient, model-type invariant method for selecting beneficial intermediate tasks.
BERT vs RoBERTa TaskEmb Space. To better understand the TaskEmb performance differences between BERT and RoBERTa models, we visualize the respective embedding spaces using T-SNE in Figure 3. We find that BERT embeddings are clustered much more closely in the vector space than 15 The full results can be found in Table 7 of the appendix. RoBERTa embeddings. While TaskEmbs of BERT also seem to be located in the proximity of related tasks, TaskEmbs of RoBERTa are distributed further apart. This can result in worse performance due to the curse of dimensionality. Overall, our results and analysis suggest that TaskEmb, unlike SentEmb, considerably depend on the chosen base model.
Within-and Across-Type Transfer. Our experimental setup includes tasks of four different types, i.e. Transformer prediction head structures: sequence classification/ regression, multiple choice, extractive question answering and sequence tagging. Figure 4 compares the relative transfer gains within and across these task types for RoBERTa. We see that within-type transfer is consistently stronger across all target tasks. We find the largest differences between within-type and across-type transfer for the extractive QA target tasks. These observations may be partly explained by the homogeneity of the included QA intermediate tasks; They overwhelmingly focus on general reading comprehension across multiple domains with paragraphs from Wikipedia or the web as contexts. Tasks of other types more distinctly focus on individual domains and scenarios.
Overall, we find a negative across-type transfer gain (i.e., loss) for 8 out of 11 tested target tasks (on average). This suggests that task type match between intermediate and target task is a strong indicator for transfer success. Thus, in the next section, we evaluate variants of all methods presented in §4 that prefer intermediate tasks of the same type as the target task. Pre-Ranking by Task Types. We implement a simple mechanism to ensure that tasks with the same type as the target task are always ranked before tasks of other types during intermediate task selection. Given a task selection method, we first rank all tasks of the same type at the top before ranking tasks of all other types below. Results for applying this mechanism to all presented task selection methods are given for BERT and RoBERTa in Table 5 of the Appendix.
We find that even though the random and Size baselines do not yield good rankings when selecting among all intermediate tasks (cf. Table 2), the scores considerably improve when preferring tasks of the same type. In general, we see almost consistent improvements across all task selection methods for both BERT and RoBERTa when implementing pre-ranking by task types. Considering all target tasks and all methods, preferring intermediate tasks of the same type yields improved NDCG scores in 77 of 99 cases. Further Analysis. We further find that embedding based approaches are sample efficient, while FSFT appproaches are not ( §D). We also report results for combining ranking approaches with Rank Fusion, which does not yield consistent improvements over the individual approaches presented before ( §E).

Conclusion
In this work we have established that intermediate pre-training can yield gains in adapter-based setups, however, around 44% of all transfer combinations result in decreased performances. We have consolidated several existing and new methods for efficiently identifying beneficial intermediate tasks.
Experimenting with different model types, we find that the previously proposed best performing approaches for BERT do not translate to RoBERTa.
Overall, efficient embedding based methods, such as those relying on pre-computable sentence representations, perform better or often on-par with more expensive approaches. The best methods achieve a Regret@3 of less than 1% on average, demonstrating that they are effective at efficiently identifying the best intermediate tasks.

A Tasks
Our experiments cover a diverse set of 53 different tasks, broadly divided into the four task types sequence classification/ regression, multiple choice, extractive question answering and sequence tagging. Motivated by previous work, we first select tasks that are either part of widely used benchmarks (Wang et al., 2018(Wang et al., , 2019bTalmor and Berant, 2019) or have been successfully applied to sequential transfer setups previously (Sap et al., 2019;Liu et al., 2019a;Pruksachatkun et al., 2020;Vu et al., 2020). Additionally, we include other recent challenging tasks that fall under the four defined task types (e.g. Bhagavatula et al. (2020);Rogers et al. (2020)) and tasks that extend the range of included dataset sizes and task domains. In general, we focus on tasks with publicly available datasets, e.g. via HuggingFace Datasets 16 . Our full set of tasks is split into 42 intermediate tasks, presented in Table 8, and 11 target tasks, presented in Table 9.

B Transfer training details
For all our experiments, we use the PyTorch implementations of BERT and RoBERTa in the Hugging-Face Transformers library (Wolf et al., 2020) as the basis. The adapter implementation is provided by the AdapterHub framework (Pfeiffer et al., 2020a) and integrated into the Transformers library 17 .
In the light of the number and variety of different tasks used, we don't perform any extensive hyperparameter tuning on each training task. We mostly adhere to the hyperparameter recommendations of the Transformers library and Pfeiffer et al. (2021a) for adapter training. Specifically, we train all adapters for a maximum of 15 epochs, with early stopping after 3 epochs without improvements on the validation set. We use a learning rate of 10 −4 and batch sizes between 4 and 32, depending on the size of the dataset. These settings apply to the adapter training on each intermediate task as well as the subsequent fine-tuning on the target dataset. Additionally, since performances on the low-resource target tasks can be unstable, we perform multiple random restarts (five restarts for RoBERTa and three restarts for BERT) for all training runs on the target tasks, reporting the mean of all restarts. The final scores on each task are computed on the respective tests set if publicly available, otherwise on the validation sets.
Results for RoBERTa are shown in Table 1 and results for BERT are shown in Table 10.
C Metrics for transfer source selection C.1 NDCG Following Vu et al. (2020), we compute the Normalized Discounted Cumulative Gain (NDCG) (Järvelin and Kekäläinen, 2002), a widely used information retrieval metric that evaluates a ranking with attached relevances. The NDCG is defined via the Discounted Cumulative Gain (DCG), which represents a relevance score for a set of items, each discounted by its position in the ranking. The DCG of a ranking R, accumulated at a particular rank position p, can be computed as: In our setting, R refers to a ranking of intermediate tasks where the relevance rel i of the intermediate task with rank i is set to the mean target performance when transferring the adapter trained on this intermediate task, i.e. rel i ∈ [0, 100]. We always evaluate the full ranking of intermediate tasks, thus we set p = |S|.
The NDCG finally normalizes the DCG of the ranking predicted by the task selection method (R pred ) by the perfect ranking produced by the empirical transfer results (R true ). An NDCG of 100% indicates a perfect ranking.

C.2 Choice of metrics
Our selection of evaluation metrics combines two measures that both evaluate the quality of the full ranking (NDCG) and the top selections of each methods (Regret). We prefer this combination of metrics over various other common possible evaluation metrics. We experimented with classical correlation measures such as Spearman rank correlation, finding they give poor indication on the overall quality of a selection method. The Spearman correlation is agnostic to the location within the ranking, thus penalizing mismatches at the bottom of the ranking with the same weight as mismatches at the top. In our setting, the top ranks are more important, making the NDCG which is biased towards correct rankings at the top a better fit. Renggli et al. (2020) further discuss the limitations of correlation as an evaluation metric for task selection. Vu et al. (2020) use the average predicted rank ρ of the source task with the best target performance as an additional metric. However, this metric does not account for the real target performance difference between the top ranked source tasks across different methods. In a simple example, assume two selection methods A and B assign the top performing source task s max to the same average rank. Further, A ranks a different source task on top which nearly performs on par with s max while B predicts a much weaker source task on top. In this case, we clearly would want to prefer method A over method B. Unlike ρ, our choice of regret as evaluation metric considers these differences.

D Sample Efficiency
Embedding-based approaches. Intermediate pretraining can have a larger impact on small target tasks. We therefore analyze and compare the effectiveness of embedding-based approaches with only 10, 100, and 1000 target examples. Figure 5 plots the results for all feature embedding methods when applied to intermediate task selection for RoBERTa. We find that the quality of the rankings can decrease substantially in the smallest setting with only 10 target examples. SEmb is a notable exception, achieving results close to that of the full 1000 examples (73% vs. 74.9% NDCG). With that, SEmb consistently performs above all other methods in all settings. Few-Shot approaches. We experiment with N ∈ {5, 10, 25, 50} update steps for the fine-tuning methods FSFT and FS-TaskEmb. Results for RoBERTa are shown in Figure 6. While unsurprisingly, the performance for both methods improves consistently with the number of fine-tuning steps, FS-TaskEmb produces superior rankings at earlier checkpoints, however is outperformed by FSFT on the long run. The results indicate that updating for < 25 update steps does not provide sufficient evidence to reliably predict the best intermediate tasks.

E Rank Fusion
Vu et al. (2020) use the Reciprocal Rank Fusion algorithm (Cormack et al., 2009) to aggregate the rankings of TextEmb and TaskEmb. further experiment with various combinations of ranks produced by methods of different categories, e.g. Size + SEmb. Table 6 shows the results for a selection of all possible method combinations when applied to intermediate task selection for RoBERTa.
In a few cases, fusing improves performance over the single-method performances of all included methods (e.g. TaskEmb+TextEmb). However, for most cases, rank fusion performance is either roughly on-par with the performance of the best included single method (e.g. SEmb+TaskEmb) or even hurts task selection performance sometimes significantly (e.g. Size+SEmb). Thus, while adding additional computational overhead to the task selection process, fusing does not yield better performance in general.

F SEmb Model Dependency
The full results of our experiments with sentenceembedding model variants can be found in Table 7. Experiments were conducted on RoBERTa transfer results.      Table 9: Overview of target tasks used in our experiments, grouped by task type. 18 We use the version provided in MultiQA (Talmor and Berant, 2019). 19 Instead of performing full dependency parsing, we only label each token in a sentence with a label corresponding to the dependency relation to its head as this task can be modeled directly as a sequence tagging task.