Evidence > Intuition: Transferability Estimation for Encoder Selection

With the increase in availability of large pre-trained language models (LMs) in Natural Language Processing (NLP), it becomes critical to assess their fit for a specific target task a priori—as fine-tuning the entire space of available LMs is computationally prohibitive and unsustainable. However, encoder transferability estimation has received little to no attention in NLP. In this paper, we propose to generate quantitative evidence to predict which LM, out of a pool of models, will perform best on a target task without having to fine-tune all candidates. We provide a comprehensive study on LM ranking for 10 NLP tasks spanning the two fundamental problem types of classification and structured prediction. We adopt the state-of-the-art Logarithm of Maximum Evidence (LogME) measure from Computer Vision (CV) and find that it positively correlates with final LM performance in 94% of the setups.In the first study of its kind, we further compare transferability measures with the de facto standard of human practitioner ranking, finding that evidence from quantitative metrics is more robust than pure intuition and can help identify unexpected LM candidates.


Introduction
Advances in Deep Learning-based NLP and CV build on expressive representations from encoder models pre-trained on massive corpora.Downstream models make use of latent information in these representations to extract relevant features for the task at hand.Within this paradigm, deciding which pre-trained encoder to use in any taskspecific architecture is crucial, however training a model using each encoder candidate is infeasible.In absence of prior heuristics (e.g., via related work), the choice of encoder has therefore prevailingly been based on practitioner intuition rather than quantitative evidence.
The authors contributed equally to this work.
In NLP, prior work has examined the different yet related task of performance prediction (Xia et al., 2020a;Ye et al., 2021), surveyed and categorized LMs (Xia et al., 2020b), and used probing to predict LM performance specifically for dependency parsing (Müller-Eberstein et al., 2022b), but has yet to extensively investigate how to rank the increasingly large number of pre-trained LM encoders across various tasks and domains.Preliminary work by You et al. (2021) shows that the LogME estimator holds promise, including the first steps for encoder selection in NLP.With their main focus being on CV, however, they evaluate only a limited set of tasks and models for NLP and use self-reported benchmark scores instead of running controlled experiments which should include, e.g., the variance across initializations, domains, and fine-tuning strategies (Section 2).As such, we seek to answer: How well can we estimate the transferability of pre-trained LMs to specific NLP tasks?To do so, we contribute: • The broadest encoder selection study in NLP to date, on 10 domain-diverse classification and structured prediction tasks (Section 3); • An extensive evaluation and analysis across multiple dimensions of variation, including seven general vs. domain-specific LMs, [CLS] vs. mean representations, and head vs. full model fine-tuning (Section 4); • A study with NLP experts, comparing the prevailing ranking of LMs by human intuition with LogME's empirical evidence (Section 5); • Guidelines for applying and interpreting transferability measures to NLP (Section 6), and an open-source toolkit for efficient, task-adaptive LM pre-selection. 1ATASET TASK TRAIN / DEV |Y| METRIC CLASSIFICATION AGNews (Zhang et al., 2015) Topic Classification 84K / 12K 4 micro-F1 Airline (Crowdflower, 2020) Sentiment Analysis 10K / 1.5K 3 micro-F1 SciERC (Luan et al., 2018) Relation Classification 1.9K / 275 7 macro-F1 MNLI (Williams et al., 2018) Natural Language Inference 393K / 20K 3 micro-F1 QNLI (Rajpurkar et al., 2016) Q&A/Natural Language Inference 105K / 5.4K 2 micro-F1 RTE (Giampiccolo et al., 2007) Natural Language Inference 2.5K / 3K 3 micro-F1 STR.PRED.

Transferability Estimation
Transferability estimation aims to quantify the ability of a model to transfer knowledge learned from one task to another (Eaton et al., 2008;Sinapov et al., 2015).Formally, given a pool of L pretrained LMs {ϕ l } L l=1 and a dataset D, we calculate a predictive score S l (D) for each ϕ l which ideally correlates with the model's final performance P l (D).S l (D) is computed without fine-tuning ϕ l on D such that the optimal ϕ * l can be chosen from a large model pool at a low computational cost.
The CV community has begun to explore methods for encoder pre-selection and ranking through metrics such as LogME and the Log Expected Empirical Prediction (LEEP; Nguyen et al., 2020).These are widely-used state-of-the-art methods in CV.Recent work introduced the Gaussian Bhattacharyya Coefficient (GBC; Pándy et al., 2021) and Optimal Transport based Conditional Entropy (OTCE; Tan et al., 2021), the exploration of which we leave for future work.However, in the NLP field, related work focus on choosing a task and not an LM encoder for transferability (Vu et al., 2020;Padmakumar et al., 2022), leaving the ranking of encoders an unexplored question.
LogME LogME measures the suitability of all encoded dataset features F ∈ R |D|×h (e.g., embeddings with dimensionality h) to predict all scalar labels y ∈ R |D| via the probability density p(y|F ).As this density is intractable, it is estimated by mapping F → y using a linear transformation w; this is akin to training a linear probe with optimal param-eters w * and using the likelihood p(y|F, w * ) as a proxy for feature suitability.Because a simple linear model will overfit on the training data, it would be beneficial to obtain the marginal likelihood, or evidence, by integrating over all possible values of w: p(y|F ) = p(y|F, w)p(w)dw.To once again make this computation tractable, You et al. (2021) reformulate it as an efficient, iterative evidence maximization problem where both w as well as y are drawn from lightly parametrized, isotropic Gaussian distributions.The normalized logarithm of the maximized evidence (LogME) can then be used as S l (D) to rank encoder models directly.
NLP Setting LogME has shown promise for CV, and an initial study on the GLUE benchmark (Wang et al., 2018) indicate the same for NLP (You et al., 2021).However, for NLP, there are notable differences in setups across tasks.We adapt and apply LogME extensively to a wide range of NLP settings to identify empirically grounded guidelines.
In particular, we investigate variations concerning the task, instance granularity, domain, and tuning strategy.First, compared to most image classification tasks, NLP tasks are subject to differences in granularity, i.e., classification (C) and structured prediction (SP).Furthermore, there is less clarity than for individual images as to which representation best captures the full language input (Mosbach et al., 2020).Therefore, for C setups we experiment with two representations: i.e., using [CLS]/<s> versus mean over sequence/subwords.
Second, depending on differences in the data domain, NLP practitioners are often faced with a pool of domain-adapted LMs in addition to more general-purpose encoders-the correct choice of which may not be immediately apparent.
Finally, the best performance in NLP is often achieved using full fine-tuning, while CV models usually do not fine-tune the encoder (Peters et al., 2019).It will therefore be crucial to investigate whether the predictive performance of S l (D) holds when it is computed based on untuned F while P l (D) is based on fully fine-tuned representations.

Experimental Setup
Applying seven architecturally and domain-diverse pre-trained LMs with up to four configurations each to 10 datasets and a wide variety of tasks, we investigate LogME's predictive power for transferability estimation in NLP-for a total of 280 setups.We refer to Table 1 for our detailed set of tasks.

Model Setups
The model setup follows the same structure for each task: A pre-trained LM encoder and a 3-layer perceptron head, following Tenney et al. (2019).The input to the latter is either the [CLS] token or mean over sequence subwords for C tasks or mean over token subwords for SP tasks.While it is common in CV to keep the encoder frozen and only fine-tune the task-specific head, we also evaluate the practice of full model finetuning, as is more common in NLP (Peters et al., 2019).Considering these variations (frozen vs. finetuning, and [CLS] vs. mean), we obtain up to four setups per C task and two setups per SP task.Each experiment is run with five random seeds.Details for reproducibility can be found in Appendix A.
Evaluation Following You et al. (2021), we evaluate LogME's predictive power for ranking LMs according to their final performance by using the two correlation coefficients Pearson's ρ and weighted Kendall's τ w (Vigna, 2015), both in [−1, 1].Kendall's τ w further allows us to estimate the probability of a higher-ranked LM actually performing better by computing τw+1 2 .

Analysis of Results
Our results across all setups are consolidated in Figure 1 and Figure 2 (C: blue, SP: beige). 2,3The left of each figure plots the performance using frozen LM embeddings ( ) against LogME scores, while on the right, full LM fine-tuning is applied ( ). 4Figure 1 shows the results of using mean-pooled embeddings in both C/SP settings.For , we obtain ρ > 0.8 on 8/10 tasks and τ w > 0.7 on 6/10 tasks, indicating a strong relationship between model performance and LogME.After fine-tuning ( ), we observe a general reduction in ρ and τ w (most on CrossNER, EN-EWT), however overall correlations remain positive to a significant degree.
Overall, LogME has a positive correlation with final performance in 30/32 cases.In more detail, LogME has a τ w > 0.41 in 20/32 setups, meaning that selecting a higher ranked model is the better choice 71% of the time.LogME both identifies intuitive, domain-specific scenarios (e.g., Twitter-RoBERTa performing well on Airline Twitter), but also finds cases that may be unintuitive, such as DistilBERT's occasionally high performance for CrossNER and JobStack.This finding holds across C, SP, domains as well as different input representations.For the latter, we note that, surprisingly, even the untuned representation of [CLS]/<s> seems to contain useful information with comparable performance to mean pooling.
Comparing versus , we notice that, as expected, model performance improves, but in general, LogME's predictive power decreases.
The fully fine-tuned model makes predictions on updated representations such that decreases in predictive performance are inevitable unless the initial LM already represents a local optimum for the task at hand.This fact is crucial for NLP practitioners where full fine-tuning is the standard practice.
Taking these factors into account, LogME's efficiency is especially beneficial, as it offers an 86× speedup over full model fine-tuning (You et al., 2021), and its positive correlation in 94% of our evaluated setups indicates that it is an effective score for transferability estimation in NLP.

Human Performance
Given the lack of prior work examining transferability estimation of pre-trained LM encoders, the most common method for encoder selection employed today is practitioner intuition.As such, we conduct a small-scale study with 12 NLP practitioners and ask them to perform the same ranking as in Section 3. Despite having access to model details and associated papers, this task is difficult even for experts.While for LogME, the range of τ w is in [−0.20; 1.00], human rankings fall into a wider range of [−0.54; 1.00], indicating higher uncertainty.Similarly, we observe that human correlations are negative thrice as often as for LogME.Additionally, LogME provides a continuous scale for comparing models, while human rankings offer no indication of relative performance differences.At the same time, they are more inaccurate for tasks without an associated domain-specific model (e.g., news, mixture of genres in EWT).Moreover, even when domains are clear (e.g., Twitter, science), LogME tends to be more accurate than the predictions of most human participants.Finally, the high variance between practitioners and the fact that no single person was an expert in all setups further reinforces the necessity of quantitative transferability scores.

Conclusion
We show the value of transferability estimation for selecting high-performing LMs before full model fine-tuning in experiments, covering the two fundamental NLP settings of classification and structured prediction.By adopting the stateof-the-art LogME scoring method, we are able to rank LMs on a continuous scale which correlates with final performance-with the better encoder being chosen in 71% of cases.Additionally, we identify NLP-specific guidelines for transferability estimation: In particular, predicting the best LM for tasks/domains which greatly deviate from an encoder's pre-training setup and require large amounts of full fine-tuning may require larger pre-selections of LMs due to the higher uncertainty of the scoring methods.Finally, our human study showed that practitioners frequently misconstrue the performance of LMs even on domain-specific tasks.As such transferability quantification methods provide valuable evidence over intuition.

Limitations
A key limitation that practitioners should consider is that, while LogME is viable for the quantitative transferability estimation of LM encoders, there is a noticeable drop in predictive accuracy after full model fine-tuning.We attribute this to the misalignment between the frozen representations of the encoder, which LogME is applied to, and the representations after fine tuning.As stated in Section 4, unless the untuned LM already constitutes a local optimum for the task at hand, task-specific shifts in its parameters and representations are inevitable.This similarly applies to cases where the untuned representations differ substantially from what a fully fine-tuned model uses during training.Specifically, for the relation classification task of SciERC, it is important to note that the input given to the model is augmented with special tokens delimiting the entities involved in the relation (Baldini Soares et al., 2019) which are unknown to the untuned model and thus the representations that LogME is computed on.Furthermore, for EN-EWT we suspect that dependency labeling is a more fundamental task solvable with high accuracy by most LMs, especially after fine-tuning as reflected in micro-F1 scores between 93-95.This is mirrored by work on probing untuned LMs which identifies high levels of inherent dependency information (Tenney et al., 2019;Müller-Eberstein et al., 2022a).Such sensitivity to representational shifts is not exclusive to LogME: In preliminary experiments, we examined LEEP (Nguyen et al., 2020) as an alternative predictive score S l (D).Its original use was to rank the transferability of a classifier trained on one dataset, to a new task-leaving the ranking of pre-trained LMs for future work.LEEP has so far only been applied to CV tasks, but we apply it to LM ranking on the collection of NLP tasks above.Our initial experiments achieved low and unintuitive correlations between LEEP's S l (D) and P l (D).We speculate that this is due to the absence of a normalizing factor over the number of source classes, i.e., the high number of embedding dimensions in our case (see Equation 2in Nguyen et al., 2020).While it would further be valuable to investigate methods beyond LEEP and LogME, as mentioned in Section 2, we leave their evaluation on NLP to future work.At the time of writing, the former two were the most extensively explored in CV, in addition to the original LogME work containing an initial study showing promise for NLP.
Finally, our human ranking study in Section 5 was limited by the number of practitioners with a publication record which we could contact confidentially.However, the group still constituted a diverse set over seniority, gender, and cultural background.A larger group would cover a broader range of backgrounds and may produce different rankings.However, as the surveyed group already displayed high variance, overall predictive performance is unlikely to be significantly higher.
Keeping these limitations in mind, correlations do remain mostly positive for LogME and scores are well suited to be applied to high-dimensional embedding spaces, such that it offers a predictive and efficient measure for quantifying transferability compared to human practitioner intuition.MODEL) and its performance on a wide variety of datasets (DATASET) in different settings (FROZEN, TUNED) by taking the representations of the tokens and apply mean pooling (µ).Here we do not take the representation of the [CLS] token as this has no meaning for the structured prediction task.Given the LogME scores and the performance metrics, we can calculate the Pearson correlation coefficient (ρ) and the weighted Kendall's tau (τ w ).

Figure 1 :Figure 2 :
Figure 1: Results of Mean Pooling (µ).We plot the model's LogME scores against their task-specfic performances on each dataset based on mean pooling the token embeddings (left: Frozen embeddings ( ), right: Full model fine-tuning ( )).Task-types are indicated in specific colors: Lightblue for C and beige for SP.Further reported are the Pearson correlation coefficient (ρ) and weighted Kendall's tau (τ ).

Table 1 :
Datasets.Indicated are the 10 datasets used in this study, distinguished between the two NLP problem types C and SP for a wide variety of tasks and domains.C tasks cover AGNews (news articles), Twitter Airline Sentiment (Airline; Twitter feedback), SciERC (AI proceedings), MNLI (speech, (non-)fiction, government), QNLI (Wikipedia) and RTE (Wikipedia, news).Within the SP tasks, we experiment on the English Web Treebank (EWT; social media, reviews, emails), CrossNER (news, scientific Wikipedia) and JobStack (Stack Overflow job ads).For each task, we report their TRAIN/DEV split, label space, and task-specific performance metric.

Table 2 :
Exact Results of Classification Tasks.We indicate the LOGME score of each model (LANGUAGE MODEL) and its performance on a wide variety of datasets (DATASET) in different settings (FROZEN, TUNED) by either taking the representations of the tokens and apply mean pooling (µ) or the representation of the [CLS] token.Given the LogME scores and the performance metrics, we can calculate the Pearson correlation coefficient (ρ) and the weighted Kendall's tau (τ w ).

Table 3 :
Exact Results of Structured Prediction Tasks.We indicate the LOGME score of each model (LANGUAGE