How to Determine the Most Powerful Pre-trained Language Model without Brute Force Fine-tuning? An Empirical Survey

Transferability estimation has been attached to great attention in the computer vision fields. Researchers try to estimate with low computational cost the performance of a model when transferred from a source task to a given target task. Considering the effectiveness of such estimations, the communities of natural language processing also began to study similar problems for the selection of pre-trained language models. However, there is a lack of a comprehensive comparison between these estimation methods yet. Also, the differences between vision and language scenarios make it doubtful whether previous conclusions can be established across fields. In this paper, we first conduct a thorough survey of existing transferability estimation methods being able to find the most suitable model, then we conduct a detailed empirical study for the surveyed methods based on the GLUE benchmark. From qualitative and quantitative analyses, we demonstrate the strengths and weaknesses of existing methods and show that H-Score generally performs well with superiorities in effectiveness and efficiency. We also outline the difficulties of consideration of training details, applicability to text generation, and consistency to certain metrics which shed light on future directions.


Introduction
Recent advances in the community of Natural Language Processing (NLP) are heavily built on the effectiveness of Pre-trained Language Models (PLMs), especially on large ones (LLMs) (Zeng et al., 2023;OpenAI, 2022;Touvron et al., 2023;Wang et al., 2023).As the number of available PLMs continually grows, a critical question arises: "Which PLM can make the performance of a downstream task best?".The fine-tuning result on a task usually varies across different PLMs, and this variation becomes more pronounced in low-resource scenarios (Bassignana et al., 2022).Basically, the key to such model selection is to figure out the transferability between the model and the target task.Pioneering works conducted finetuning on every candidate model in a brute-force manner (Phang et al., 2018;Zamir et al., 2019).Though the true fine-tuning performance can be obtained in this way, expensive parameter optimization is practically prohibitive (Wolf et al., 2020).Thus, there is an urgent need to quantify the transferability at a low cost of computation.To this end, Transferability Estimation (TE), as an essential task of Transfer Learning (TL), has emerged as a key challenge with several solutions proposed in Computer Vision (CV) fields initially (Agostinelli et al., 2022).Recently, some of these remarkable approaches have also been applied to NLP tasks which show promising results on PLM selection (Bassignana et al., 2022;Vu et al., 2022).
Despite a great number of surveys established for TL and PLMs (Niu et al., 2020;Plested and Gedeon, 2022;Guo et al., 2022), there is no comprehensive survey on TE yet, especially with the purpose of PLM selection.Therefore, this paper aims to fill this gap by providing a comprehensive and well-structured summary of recent progress.To ensure comprehensive coverage, a multi-stage approach is employed to identify and select the studies included in this review.Firstly, an extensive literature search was carried out using online databases, such as Google Scholar and DBLP.The search terms used were carefully chosen to capture the key concepts and themes related to TE and PLMs.After retrieving an initial pool of nearly 100 articles, a thorough screening of titles, abstracts, and keywords was conducted to exclude irrelevant studies, leading to a final selection of 20 studies that met the predetermined criteria for inclusion.
Based on these research, we present a method taxonomy.As shown in Fig. 1, according to the need for training on target task, we divide TE methods into: (1) Model Similarity-based Methods that assume the inter-model similarity reflects the transferability which require the model trained on target task (Dwivedi and Roig, 2019).( 2) Training-free Methods that accelerate the estimation process by computing metrics free of target model training to examine the compatibility of the PLM's feature space on the target dataset (Ding et al., 2022).Then we conduct qualitative analysis for the applicability and provide empirical results on the GLUE benchmark (Wang et al., 2019) to manifest specific strengths and weaknesses in existing methods.We show that model similarity-based methods have the superiority of applicability to different target tasks, and training-free methods have the advantage over fast estimation.And for the methods simulating the dynamics of fine-tuning, they generally perform better.Besides, we analyze some factors that can affect the estimation effectiveness and efficiency including task type, sample size, feature dimension, target model as well as sample affinity function.The empirical observations demonstrate that H-Score (Bao et al., 2019) generally shows desired usability.Based on these investigations, we further exhibit some under-explored aspects to shed light on the future directions1 .

Related Work
Transfer Learning.Training robust supervised models from scratch is non-trivial, especially in low-resource scenarios (Jin et al., 2023).Aiming at transferring knowledge from a source task to a target task, TL can achieve superior performances on the target dataset by spending far less time and using far fewer data (Niu et al., 2020).Despite a good number of surveys available for TL (Ruder et al., 2019;Niu et al., 2020;Alyafeai et al., 2020;Iman et al., 2022), these works mainly focus on "what to transfer?" and "how to transfer?" that describe specific transfer approaches.To the best of our knowledge, there is no comprehensive survey on TE yet, which seeks to answer the question of "when to transfer?".This work is expected to fill this gap by shedding light on how to appropriately choose TE methods for PLMs practitioners.Transferability Estimation.To avoid exhaustive attempts on all pairs of source tasks and target tasks, TE provides efficient heuristics to exhibit the best-performing source task at a minor cost (Agostinelli et al., 2022).Originated in the field of CV, a great number of TE approaches, including model-similarity-based methods (Dwivedi andRoig, 2019), label-comparison-based methods (Tran et al., 2019) and source features-based methods (Ding et al., 2022), etc., have been proposed in the past few years.To adapt such techniques to PLM selection for NLP tasks, Bassignana et al. (2022) found the predictions of LogME can positively correlate with the true performances of candidate PLMs, and Vu et al. (2022) exhibited the model similarity computed by soft prompts reflects the transfer performance across different models.Built on these remarkable researches, we further review the TE methods and provide a comprehensive empirical study of them for PLM selection.Pre-trained Language Models.From BERT (Devlin et al., 2019) to LLaMA (Touvron et al., 2023), significant efforts have been put into scaling PLMs into LLMs and some abilities such as performing arithmetic, answering questions are emerging simultaneously (Schaeffer et al., 2023).Nevertheless, training and fine-tuning LLMs or even small ones require substantial computational resources which can limit accessibility to these models for researchers and developers with limited resources even with the help of parameter-efficient tuning (Hu et al., 2022).Based on these considerations, the efficient utilization of PLMs is still a problem worth studying.Thus we focus on the selection of PLMs in this work which aims at releasing the computing resources needed for exhaustive fine-tuning.ϕ i can encode the sample to pre-trained feature ϕ i (x i ) (usually the [CLS] embedding), the true performance T i (D) can be measured by certain evaluation metrics after fine-tuning ϕ i on D with careful tuning of hyper-parameters.The TE approach should produce a score S i (D) for each ϕ i to approximate the true fine-tuning performance T i (D).Intuitively, a well-designed method should return {S i (D)} L i=1 that correlates well with {T i (D)} L i=1 under an acceptable burden, such that the topperforming PLM can be determined rapidly.

Model Similarity-based Methods
To avoid brute force fine-tuning, the model similarity-based methods are designed based on the assumption that a high similarity between two models correlates with a high degree of transferability between the tasks bonded to the models.To this end, one model ψ fine-tuned on the target task, i.e., the target model, is required to compute its similarity to each candidate PLM.Therefore, the time consumption of fine-tuning can be significantly reduced to 1/L of brute force approach extra with a minor cost of model similarity computation.Currently, the sample features output from models are mainly used to measure the inter-model similarity.Therefore, the target is to design a similarity function to maximize the correlation between finetuning performances and similarities between the pre-trained features {ϕ(x i )} N i=1 and target features {ψ(x i )} N i=1 .In terms of the similarity computation mechanism, existing functions can fall into samplewise similarity functions and graph-wise similarity functions: Sample-wise Similarity Functions The main idea is to directly compute the similarity between features across models.Under the Direct Similarity Estimation (DSE) (Luo et al., 2022) framework, Vu et al. (2020) compute the affinity between the mean features as the model similarity, i.e., A( i ϕ(x i )/N, i ψ(x i )/N ) where A is the sample affinity function such as Euclidean and cosine distances, while Luo et al. (2022) utilize averaged sample affinities i A(ϕ(x i ), ψ(x i ))/N .

Training-free Methods
Although model similarity-based methods only need to fine-tune on the target task once, they still require a large load of computational cost.Therefore, the training-free methods try to directly compare the pre-trained features {ϕ(x i )} N i=1 with the true target labels {y i } N i=1 by cheap metrics to further save the estimation time.According to whether directly measure the fine-tuning loss, the metrics can be divided into class separability metrics and loss approximation metrics.
Class Separability Metrics These metrics intuitively examine whether pre-trained features are easy to separate according to their target labels, and assume that well-separated pre-trained features results in desired fine-tuning performance.Some of these metrics directly measure the separability of static pre-trained features.For example, MSC (Meiseles and Rokach, 2020) uses the mean intra-cluster distance and the mean nearestcluster distance to quantify the clustering quality of pre-trained features over target classes.Similarly, Puigcerver et al. (2021) rank the candidate PLMs by the test accuracy of kNN on pre-trained features via the leave-one-out cross-validation.PARC (Bolya et al., 2021) first computes the pairwise affinities between the pre-trained features of each pair of target samples, which is then compared with the pairwise label affinities of each pair of target samples to quantify the source feature space's fitness on target dataset.And GBC (Pándy et al., 2022) uses the Bhattacharyya coefficient to measure the inter-class overlap of pre-trained features, where higher overlap means poorer separability.To further consider the fine-tuning dynamics by assuming the pre-trained features can be adjusted by an extra linear transformation, Kumari et al. (2022) train a cheap Logistic Regression (LR) model on pre-trained features to estimate how fitting the linearly transformed pre-trained features are for their

Input
Task Agnostic Dynamic Consideration
target classes by LR's test accuracy.
Loss Approximation Metrics Based on solid theoretical proof, these metrics try to directly approximate the fine-tuning loss that correlates with the fine-tuning performance well.Inspired by Euclidean information geometry, H-Score (Bao et al., 2019) approximates the optimal log-loss by interclass variance and feature redundancy that characterize the asymptotic error probability of using pre-trained features to estimate target labels.Ibrahim et al. (2022) then propose regularized Hscore which further shrinks the error that occurred when inverting the high-dimensional features using a pseudo-inverse.N LEEP (Li et al., 2021) first uses a Gaussian mixture model to attach a posterior distribution on Gaussian components to each pre-trained feature, then computes the likelihood from posterior distribution to target label to approximate that from pre-trained feature to target label.TransRate (Huang et al., 2022) approximates the correlation between pre-trained features and target labels by Mutual Information (MI) which has been proven an upper bound and a lower bound to the log-likelihood.There are also some metrics that involve fine-tuning dynamics.SFDA (Shao et al., 2022) simulates the dynamics by projecting the pre-trained features using Fisher Discriminant Analysis (FDA) to increase the class separability.Then, it approximates the log-likelihood by Bayes classification over projected features and also adds a self-challenging module to further measure the ability of the pre-trained models on hard samples.
To avoid over-fitting problem of maximum likelihood estimation, LogME (You et al., 2021) turns to approximate the marginalized likelihood of label given pre-trained features over all possible linear transformation.More recently, motivated by learning theory, PACTran (Ding et al., 2022) minimizes the PAC-Bayesian upper bound over the log-loss.

Qualitative Analysis
To examine the applicability of each method, we qualitatively compare them as shown in Table 1 from the following perspectives: (1) Task Agnostic: the method does not require certain target task type; (2) Dynamic Consideration: the fine-tuning dynamics of pre-trained features are considered; (3) Free of Training: the method does not need fine-tuning on target task.Moreover, since some methods need to compute affinity graph which can be very time-consuming when there are too many samples, i.e., DDS, kNN, MSC, PARC, LFC, we limit the maximum number of samples that can be used by them to 10k.We run each method 5 times with different random seeds and report the mean results.For the implementation details, please refer to Appendix B.
Evaluation To measure the deviation of TE methods' predictions to true fine-tuning performances, we use MRR and mean Spearman's rank correlation (µ ρ ) on all GLUE tasks.Among them, MRR reveals the average ranking of the best-performing PLM, and µ ρ evaluates the overall correlation between predicted score list {S i (D)} L i=1 and true performance list {T i (D)} L i=1 .Besides, the time consumption is measured by mean training time (µ tt ) that records the time of target model training and mean estimating time (µ et ) that tells the wall clock time of estimation value computing, which are both averaged over all GLUE tasks.Note that we omit the time of sample features encoding since this is the same across all methods.

Quantitative Analysis
Effectiveness and Efficiency The overall metric scores of all methods are reported in Table 3    estimating time.As shown in Figure 3, we list the performance variation of top-8 performing trainingfree methods on different feature dimensions.It is observed that H-Score and regularized H-Score preferably enjoy original dimensions because the original feature space helps them to approximate the feature redundancy better, while the others all achieve the best results on smaller dimensions.For kNN, PARC, they need to measure sample affinity which may encounter the curse of dimensionality in high-dimensional scenes, thus performing better when the feature dimension is small meanwhile dimensionality reduction will not lose too much original information.For Logistic, LogME, SFDA, and PACTran which assume the pre-trained features can be linearly transformed, eliminating redundant feature dimensions can results in better estimation results.
Sensitivity to Different Target Models Although model similarity-based methods only need to train one target model, we actually have rich choices of PLM to train target models.An ideal method should produce similar results when different target models are implemented such that we can save the time required to try different target models.
To examine the sensitivity to the target model, we conduct model similarity-based methods with different target models and show the results in Table 3.We can observe different behaviors in which DSE performs stably while DDS is very sensitive to the type of target model.Since the main difference between DSE and DDS is that the former computes inter-sample affinities across models while the latter compares the affinity graph across models, this observation reflects that the affinity graph from the target model may not well reflect the target task mechanism and directly measuring the affinity between features across models is preferable.

Effect of Sample Affinity Function
As introduced in Section 3, the implementation of DSE, DDS, kNN, MSC, LFC, and PARC require certain sample affinity functions.We try Euclidean, cosine, and correlation distances for the above methods to examine whether different functions will affect the methods and report the results in Table 4. Compared to Euclidean distance, cosine distance and correlation distance conduct extra normalization operations and thus results in more estimating time.However, except when applied to DSE, cosine and correlation distances generally exhibit superior performance than Euclidean distance.This observation reveals that the norm of the feature vector may result in anisotropic feature space and should not be taken into account to measure the sample affinity, which is also supported by (Su et al., 2021) that suggests the normalization operation can alleviate the anisotropy problem of PLMs.

Conclusion and Future Directions
This paper reviews the recent advances in TE that can be applied to PLM selection and presents a method taxonomy based on a thorough analysis.Moreover, comprehensive qualitative and quantitative comparisons between different approaches are provided to help understand their applicability in a number of aspects.We hope this survey can help people for the purpose of research or industry to choose desired PLM by appropriate TE methods.Although lots of efforts have been made as surveyed, there still remain some directions that deserve further investigation: (1) How to make the estimation approach aware of fine-tuning strategies and experimental hyper-parameters?The fine-tuning strategy usually needs to be determined under the acceptable computation burden, i.e. fully fine-tuning (optimizing all model parameters) or parameter-efficient tuning (optimizing part of model parameters).The actual fine-tuning strategy not only affects the training time and computation consumption but also the loss landscape of PLM which results in different target task performance (Bassignana et al., 2022).However, current approaches usually consider the situation of one strategy whose effectiveness can not be guaranteed in other situations.Therefore, making the estimation able to adapt to different fine-tuning strategies is worth further exploring.Besides, even if the best PLM can be accurately selected, one still needs exhaustive searching of training hyper-parameters to produce desired finetuning performance.It is also interesting to consider other important hyper-parameters such as learning rate and temperature when estimating the transferability.
(2) How to adapt TE methods to text generation task?Although model similarity-based methods do not assume the type of target task since these methods only rely on the sample features, they neglect the information of the target label and thus the mapping from input space to output space is not well captured and the corresponding task can not be fully understood.However, taking label information into consideration for the text generation task is challenging since the length of output text changes and the one-to-many issue exists (Bao et al., 2020;Zheng et al., 2021;Zhao et al., 2023).Since currently LLMs conduct all tasks in the way of text generation and the number of LLMs is continually increasing, the TE method tailored for text generation is urgently needed.
(3) How to make estimation results consistent with specific evaluation metrics?In our experiments, the TE methods are asked to correlate just one evaluation metric for GLUE datasets, e.g., Acc for QNLI.However, some tasks may have diverse metrics, e.g., NDCG, R@1 for ranking tasks, and sometimes one may focus on one of the metrics and the variations of these metrics are not necessarily consistent, such that the TE method's predictions can be confusing in these cases.Therefore, how to make TE methods aware of our interested metric is another direction worth exploring.

Limitations
This work provides a comprehensive summary of existing TE methods.However, limited by our experimental conditions, we have to examine surveyed methods on a toy experimental setting where the following problems need to be improved: (1) Only small-scale PLMs form the candidate pool, the effectiveness of TE methods to select the bestperforming LLM is needed to be verified given the current popularity of LLMs.(2) Since most of existing TE methods only support target task of classification type, we determine the GLUE benchmark as the evaluation datasets, while the TE performances on regression task, structure prediction task and generation task are still under-explored.
A Fine-tuning Results on GLUE

B Implementation Details
Our experimental machine contains an Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz and a NVIDIA GeForce RTX 3090 24G GPU.For the implementation of TE methods, since some methods' performance heavily depend on the number of feature dimensions, we reduce the feature dimension to [16,32,64,128,256,512,768] by PCA for each method to find their most suitable dimensions.For DSE, RSA, kNN, MSC, LFC, PARC that need to compute the affinities between sample features, we also try different sample affinity functions including cosine, euclidean and correlation distances.The implementation details of surveyed methods are as follows: DSE Among the averaged sample affinities (Luo et al., 2022) and the affinity between the mean features (Vu et al., 2020), we found that the former performs better.And DSE achieves the best results when DeBERTa is taken as target model, Euclidean distance is used to measure the sample affinity and the number of feature dimensions is 768.The corresponding computation is as Eq. 1.

S(D)
DDS Among a number of instances of DDS framework (Dwivedi et al., 2020), RSA (Dwivedi and Roig, 2019) shows the best performance when DeBERTa is trained as target model, correlation distance is used and the number of feature dimensions is 512.Specifically, the pre-trained features and target features are first processed by z-score normalization.Then the pre-trained affinity graph and target affinity graph are computed by correlation distance.Finally, the lower triangular adjacent matrices of two graphs are compared by Spearman correlation coefficient as transferability score.
MSC We use the code of silhouette_score from scikit-learn to implement MSC, which exhibits the best performance when cosine distance is used and the number of feature dimensions is 256.
kNN We use the code of KNeighborsClassifier from scikit-learn and use the test accuracy of leave-oneout cross-validation to quantify the transferability.We tune the k in [1, 3, 5, 7], the method exhibits the best performance when k = 5, correlation distance is used and the number of feature dimensions is 64.
PARC The computation process of PARC is similar to that of RSA except that the target affinity graph is replaced by affinity graph derived from samples' one-hot labels.We use the code from here and the best performance is achieved when correlation distance is used and the number of feature dimensions is 512.
GBC It first uses Gaussian distribution to model each target class of samples which is parameterized by the in-class pre-trained features vectors.Then the averaged Bhattacharyya distance between every pair of different classes are used to measure the inter-class overlap as Eqs. 2 and 3: C TE Methods Performances on GLUE Tasks

Figure 1 :
Figure 1: Diagram of transferability estimation methods.Based on the pre-trained features of target samples output from candidate PLM, model similarity-based methods and training-free methods estimate the transferability by inter-model similarity and the compatibility between pre-trained features and target labels.

Figure 3 :
Figure 3: The performance variation on GLUE tasks of training-free methods when pre-trained features' dimensions are reduced to different lengths by PCA.

Table 2 :
Sanh et al., 2019)pe and metric of GLUE tasks.Sanh et al., 2019)whose fine-tuning performances are as shown in Appendix A.
(Liu et al., 2019)19))similarity-based methods satisfy this property.However, the inter-model similarity is only aware of sample features and does not consider the output space of the task.Dynamic Consideration Since fine-tuning appropriately adjust the representation of pre-trained features to adapt to the target task, the training dynamics are also a key factor.To date, only Logistic, LogME, SFDA, and PACTran assume that the pre-trained features can be adjusted by a linear transformation, while the fine-tuning process can be more diverse, e.g., adapter tuning(Houlsby et al., 2019)and prompt tuning(Lester et al., 2021), how to simulate different dynamics is under-explored.tasks'resultsfollowingYouetal.(2021):namely bert-base-uncased, bert-base-cased(Devlin et al., 2019), roberta-base(Liu et al., 2019)and their distilled versions which are distilbert-base-uncased, distilbert-base-cased and distilroberta-base (e., ALBERT: albert-base-v2, DeBERTa: debertabase and ELECTRA: electra-base-discriminator).

Table 3 :
The performance of TE methods on GLUE benchmark, where MRR, mean Spearman coefficient µ ρ on each type of task (Sen.for single sentence classification, Para.for paraphrase, and Infer.for inference) and all tasks (Overall) are reported.Moreover, mean training time µ tt and mean estimating time µ et are listed to show the method efficiency where "-" means not applicable.For model similarity-based methods, the subscript indicates the type of its target model, e.g., DSE ALBERT means DSE implemented with ALBERT as target model.

Table 4 :
The performance comparison of methods on GLUE benchmark when different affinity functions (Euclidean distance, cosine distance, and correlation distance) are employed, where the MRR score, mean Spearman coefficient µ ρ and mean estimating time µ et are reported.
Effect of Feature Dimensions If a method performs best when conducted on the pre-trained features with the original dimension, then there is no need to employ dimensionality reduction and tune the reduced dimensions, which can further save the

Table 5 :
The best fine-tuning performances of candidate PLMs on GLUE dev datasets reported from HuggingFace, where the metrics are Matthews Correlation Coefficient (MCC) for CoLA and Accuracy for the other datasets.

Table 6 :
The reciprocal rank scores, Spearman correlation coefficients and estimating time of TE methods on each GLUE dataset.