Predicting Fine-Tuning Performance with Probing

Large NLP models have recently shown impressive performance in language understanding tasks, typically evaluated by their fine-tuned performance. Alternatively, probing has received increasing attention as being a lightweight method for interpreting the intrinsic mechanisms of large NLP models. In probing, post-hoc classifiers are trained on “out-of-domain” datasets that diagnose specific abilities. While probing the language models has led to insightful findings, they appear disjointed from the development of models. This paper explores the utility of probing deep NLP models to extract a proxy signal widely used in model development – the fine-tuning performance. We find that it is possible to use the accuracies of only three probing tests to predict the fine-tuning performance with errors 40% - 80% smaller than baselines. We further discuss possible avenues where probing can empower the development of deep NLP models.


Introduction
Large-scale neural models have recently demonstrated state-of-the-art performance in a wide variety of tasks, including sentiment detection, paraphrase detection, linguistic acceptability, and entailment detection (Devlin et al., 2019;Radford et al., 2019;Peters et al., 2018).Developing systems for these tasks usually involves two stages: a pre-training stage, where the large neural models gain linguistic knowledge from weak supervision signals in massive corpora, and a fine-tuning stage, where the models acquire task-specific knowledge from labeled data.The fine-tuning results are widely used to benchmark the performances of neural models and refine the models' development procedures.
However, these fine-tuning results are summary statistics and do not paint the full picture of deep neural models (Ethayarajh and Jurafsky, 2020;Bender and Koller, 2020).As researchers are increas-ingly concerned about interpreting the intrinsic mechanisms of deep neural models, many datadriven assessment methods have been developed.These assessments usually follow the route of compiling a targeted dataset and running post-hoc analyses.Until now, one of the most popular interpretation methods is referred to as probing.To probe a neural model, one uses a predictor to obtain the labels from the representations that are embedded using the neural model.Probing analyses on deep neural models revealed some low-dimensional syntactic structures (Hewitt and Manning, 2019), common-sense knowledge (Petroni et al., 2019) and (to some extent) human-like abilities, including being surprised upon witnessing linguistic irregularity (Li et al., 2021) and reasoning about space and time (Aroca-Ouellette et al., 2021).
From the viewpoint of data-driven assessments, both fine-tuning and probing can reveal the abilities of deep neural networks, but they appear to steer towards different directions: In-domain vs. out-of-domain.Fine-tuning uses in-domain data -we evaluate the models on the same distributions as those in deployment.Probing, however, uses out-domain data: instead of simulating the deployment environment, the targeted datasets focus on diagnosing specific abilities.
Inclusive vs. specific.In fine-tuning, edge cases should be included, so the unexpected behavior after deployment can be minimized (Ribeiro et al., 2020) and the fine-tuning results can be stable (Zhang et al., 2021).On the contrary, the probing datasets are more specialized, so smaller datasets suffice. 1igh performances vs. faithful interpretations.While fine-tuning methods are mainly studied from an algorithmic perspective to enhance the perfor-mance of language models, probing methods aim at assessing the faithfulness of language models.To fulfill the former objective, fine-tuning is accompanied by efforts in pre-training, collecting more data, building better representations, and exploring novel model architectures (He et al., 2021;Sun et al., 2021;Wang et al., 2021b;Jiang et al., 2020).Conversely, the latter goal is pursued by borrowing inspirations from a variety of other sources, including psycholinguistic assessment protocols (Futrell et al., 2019;Li et al., 2022a), information theory (Voita and Titov, 2020;Pimentel and Cotterell, 2021;Zhu and Rudzicz, 2020), and causal analysis (Slobodkin et al., 2021;Elazar et al., 2021).
In short, probing assessments are more specialized (therefore more flexible) and less computationally expensive.In contrast, the performance scores of fine-tuning assessments are more relevant to the design and training of deep neural models.Can probing be used in the development of deep neural models?This question involves two aspects: • Feasibility: Are probing results relevant in the model development?• Operation: How to set up probing analyses to get these useful results?
This paper attempts to answer both.For feasibility, we show that a crucial feedback signal in model development, the fine-tuning performance, can be predicted via probing results, indicating a positive answer to the feasibility question.
For operation, we run extensive ablation studies to simplify the probing configurations, leading to some heuristics to set up probing analyses.We start with a battery of probing tasks and evaluate the utilities both task-wise and layer-wise ( §5.2 - §5.3).We then reduce the number of probing configurations, showing that as few as 3 configurations can predict fine-tuning results with RMSEs between 40% and 80% smaller than the control baseline ( §5.5).To further answer the operation question, we run ablation studies on different probing configurations, including probing methods ( §5.6) and the number of data samples ( §5.7).We also analyze the uncertainty of the results ( §5.8).Our analysis shows the possibility of using probing in developing high-performance deep neural models.

Related Work
Performance prediction Xia et al. (2020) proposed a framework that predicts task performance using a collection of features, including the hyperparameters of the model and the percentage of text overlap between the source and target datasets.Srinivasan et al. (2021) extended this framework into a multilingual setting.Ye et al. (2021) considered the reliability of performance -an idea similar to that of Dodge et al. (2019).This paper differs from the performance prediction literature in the set of features -we use the probing results as features -and more importantly, we aim at showing that the probing results can improve the interpretability in the development procedures of large models.

Out-of Domain generalization
The out-ofdomain generalization literature provides a variety of methods to improve the performance of outof-domain classification.We defer to Wang et al. (2021a) for a summary.Gulrajani and Lopez-Paz (2020) ran empirical comparisons on many algorithms, and some theoretical analyses bound the performance of out-of-domain classification (Li et al., 2022b;Minsker and Mathieu, 2019).In our setting, the probing and the fine-tuning datasets can be considered different domains, but our analysis predicts the out-of-domain performance.A similar setting was presented in Kornblith et al. (2019), which studied the correlation between the performance on ImageNet and the performance of transfer learning on a variety of image domains.Our setting focuses on text domains, and use specialized, small-sized probing datasets.
Probing, and the utility of LODNA The probing literature reveals various abilities of deep neural models, as summarized by Rogers et al. (2020); Manning et al. (2020); Belinkov (2021); Pavlick (2022).There have been some discussions on the utility of probing results.Baroni (2021) argued that these linguistic-oriented deep neural network analyses (LODNA) should treat deep neural models as algorithmic linguistic theories; otherwise, LODNA has limited relevance to theoretical linguists.Recent literature in LODNA drew interesting findings by comparing the mechanisms in which algorithms and humans respond to external stimuli, including the relative importance of sentences (Hollenstein and Beinborn, 2021).Probing results, when used jointly with evidence from datasets, can also be used to predict the inductive bias of neural models (Lovering et al., 2021;Immer et al., 2021).As we show, probing results can explain the variance in and even predict the fine-tuning performance of neural NLP models.

Fine-tuning and probing
There have been multiple papers that explored fine-tuning and probing paradigms.Probing is used as a post-hoc method to interpret linguistic knowledge in deep neural models during pre-training (Liu et al., 2019a), finetuning (Miaschi et al., 2020;Mosbach et al., 2020;Durrani et al., 2021;Yu and Ettinger, 2021;Zhou and Srikumar, 2021), and other stages of model development (Ebrahimi et al., 2021).From a performance perspective, probing can sometimes result in higher performance metrics (e.g., accuracy) than fine-tuning (Liu et al., 2019a;Hall Maudslay et al., 2020) and fine-tuning can benefit from additional data (Phang et al., 2018).We take a different perspective, considering how the probing and finetuning results relate to each other, and more importantly, how the signals of probing can be helpful towards developing large neural models.

Methods
We present the overall analysis method and evaluation metric in this section.§5 elaborates the detailed experiment settings.
Predicting fine-tuning performance A deep neural model M can be fine-tuned on task t to achieve performance A t .Let S ∈ R N be the test accuracies of probing classifications on model M , using N configurations.For example, a deep neural model M = RoBERTa can be fine-tuned to reach performance A t = 0.85 on a t = RTE task.With post-hoc classifiers applied to the 12 layers of M , we can probe for 12 test accuracies on a probing task (e.g., detecting the past vs.present tense), which constitute of S. 2To find the pattern across a diverse category of models, we regress over K models (we will describe in §4.3).The collected probing results {S (k) } K k=1 can be used to predict the fine-tuning performance {A (k) t } K k=1 via regression.Formally, this procedure optimizes for N + 1 parameters, θ ∈ R N +1 so that: This procedure has closed-form solutions that are implemented in various scientific computation toolkits (e.g., R and scipy).The minimum reachable RMSE is therefore: RMSE-reduction While RMSE can evaluate the quality of this regression, it is insufficient for measuring the informativeness of S due to the discrepancy among the fine-tuning tasks t.Suppose we have two tasks, t 1 and t 2 , where the probing results S can support high-precision regressions to RMSE = 0.01 on both tasks.However, on t 1 , even features drawn from random distributions3 might be sufficient to reach RMSE = 0.02, while on the more difficult task, t 2 , random features could only reach RMSE = 0.10 maximum.The probing results S is more useful for t 2 than t 1 , but RMSE itself does not capture this difference.
Considering this, we should further adjust against a baseline, the minimum reachable RMSE using random features.
where the random features are drawn from N (0, 0.1).Overall, the RMSE and the reduction from the baseline are computed as: In the experiments, all RMSE and RMSE c values follow 5-fold cross validation.We report the RMSE_reduction as the score that measures the utility of S.
4 Evaluation tasks and datasets

Fine-tuning tasks
We consider 6 binary classification tasks in GLUE (Wang et al., 2019) as fine-tuning tasks: RTE consists of a collection of challenges recognizing textual entailment.Given two sentences, the model decides whether a sentence entails the other.COLA (Warstadt et al., 2019) requires the model to determine if a sentence is linguistically acceptable.MRPC (Dolan and Brockett, 2005) requires the model to identify if a pair of sentences are paraphrases.SST2 (Socher et al., 2013) asks the model to output the sentiment positivity of movie reviews.QNLI contains questions and answers parsed from SQuAD (Rajpurkar et al., 2016).This task requires the model to decide whether the answer answers the question.QQP4 tests if the model can correctly output whether a pair of Quora questions are synonymous.

Probing tasks
We use 7 probing tasks from SentEval (Conneau and Kiela, 2018) which can be approximately grouped in two categories, syntactic and semantic: • Syntactic: bigram shift (BShift), and tree depth (TreeDepth) • Semantic: past present (Tense), subject number (SubjNum), object number (ObjNum), semantic odd-man out (SOMO), and coordination inversion (CoordInv) These probing tasks span across a range of linguistic abilities.In general, layers closer to the inputs (lower layers) in BERT contain more surface-level information, whereas higher layers contain more syntactic and semantic information (Tenney et al., 2019;Jawahar et al., 2019), but the actual location of different linguistic features may vary (Miaschi et al., 2020).The SentEval datasets are usually hundreds of times larger than what would be sufficient to support statistically significant comparisons (Zhu et al., 2022), so we randomly sample 1200 data points per class, corresponding to around 1% of the original SentEval data.

Pre-trained Language Models
We use several most widely used pre-trained language models for fine-tuning and probing.We refer to the models by their names on the Huggingface Model Hub. 5oberta-base (Liu et al., 2019b) pretrains BERT (Devlin et al., 2019) on over 160GB of English corpora, using improved techniques including dynamic masking, large mini-batches and masked language modeling without next-sentenceprediction.
xlm-roberta-base (Conneau et al., 2020) is pre-trained on 2.5TB of Common Crawl data from over 100 languages.The multiple languages sources improve the transferability across languages while compromising only a little accuracy on the English GLUE tasks (compared to the monolingual RoBERTa).
albert-base-v2 (Lan et al., 2020) shares parameters across layers and decomposes the vocabulary matrices into smaller matrices.These parameter-reducing techniques reduce the computation resource requirements, which allows the model pretraining to further scale-up.
microsoft/deberta-base (He et al., 2021) uses separate attention vectors to model the content and the positions of each word.During finetuning, DeBERTa adds adversarial perturbations to the normalized embeddings.
xlnet-base-cased (Yang et al., 2019) models different permutation orders of the contexts during pre-training.XLNet additionally uses attentions to keep track of previous states, allowing the model to process the contexts extending beyond fixed lengths.
Corrupted models.To increase the diversity of models, we corrupt the language models on a masked language modeling task by MLM fine-tuning on scrambled Wikipedia6 for 500, 1k, 2k, 4k, and 6k steps.This "model augmentation" procedure does not apply to XLNet because scrambling the corpus produces a permutation of context, which XLNet already models.In total, there are 25 language models, each containing 12 layers.

Fine-tuning methods
For all fine-tuning classifications, we use the Au-toModelForSequenceClassification framework by huggingface Transformers (Wolf et al., 2020).The model is trained with an AdamW optimizer with a collection of initial learning rates7 and a batch size of 4. Since the GLUE tasks do not publicize the test set labels, we use the best dev set performance as the fine-tuning results.For reproducibility, we fix the random seed to 42 in PyTorch (Paszke et al., 2019).Additional details, including runtime and computation resources, are in Appendix A.
Figure 1: Fine-tuning performance.The color coding reflects the number of corruption steps on scrambled wikipedia, with 0 corresponding to the "vanilla" language models.

Probing methods
There are many methods to probe a neural network.In this paper, we use a post-hoc classifier to predict a target ("probing task target") from the representations of the first token (CLS).We run through a collection of scikit-learn (Pedregosa et al., 2011) classifiers,8 choose the best one by the dev accuracy, and take its test accuracy as the probing result S. Additional details, including runtime and computation resources, are in Appendix A.

Fine-tuning performance
As an exploratory analysis, Figure 1 plots the distributions of the GLUE fine-tuning performances.The additional corruption steps result in more significant fine-tuning performance drops on RTE and MRPC than other tasks.Moreover, on QQP, the dev accuracies of roberta-base (and its corrupted models) are larger than 0.90, while most other models have around 0.80 dev accuracies.

Which probing task is most informative?
We start with testing the predictability of using the results from only one probing task.For each probing task, we concatenate the 12 probing results as features and predict the fine-tuning performance using linear regression. 9able 1 shows the percentage of RMSE reduction from baseline, using all layers from one probing task.There is no definitive answer towards "which probing task best predicts all fine-tuning tasks" but, depending on the linguistic abilities that each task targets, there are some regularities.For example, the 'number counting' probing tasks do not predict the fine-tuning performances on RTE, the textual entailment recognition task.In other fine-tuning tasks (COLA, QNLI, MRPC, SST2, QQP), however, each probing task shows positive RMSE reduction, signaling the ability to predict fine-tuning performance.

Which layers are the most indicative?
In the regression experiments of §5.2, we considered each feature equally important.However, a one-way ANOVA shows that some layers are more indicative than others, as Table 2 shows.For example, the probing results of tree_depth (at layer 1) and object_number (at layer 1) explain significant variance on all fine-tuning tasks.
Note that the layers with the most predictability should not be confused with those containing the richest linguistic knowledge.The former corresponds to the probing results that explain the most variances, while the latter corresponds with probing with the highest accuracy.

Only one layer per probing task
Instead of probing all 12 layers, could using the probing results from only one layer for each probing task be beneficial?Following Table 2, we use the layers that are shown to explain significant portions of variance for the most fine-tuning tasks. 10he results are also included in Table 1.When reducing the number of features into around half (12 to 7), "one layer per probing task" can reduce more RMSE in RTE and SST2.11However, the results in other fine-tuning tasks indicate that alternative feature selection methods might help find a more predictive feature set.

Can we predict with only 3 features?
This experiment further reduces the number of features used while maximizing the MSE reductions.We iterate through all possible combinations of the 12 × 7 = 84 probing features for each fine-tuning task and report the largest RMSE reduction in predicting the fine-tuning performance.
Table 1 shows the results and the corresponding features.The prediction with three features can reduce the most RMSE on RTE, SST2, QNLI, and QQP.On COLA and MRPC, the RMSE reductions by the top three features are at most 6% smaller than those of the best previous configurations, which involved many more probing features.
The results show the utility of probing.It is possible to predict the fine-tuning performances by probing on as few as three configurations (each configuration using one probing task on one layer).

Ablation: probing configuration
To further simplify the probing procedure, we run this ablation study.Instead of probing using a battery of post-hoc classifiers (as mentioned in §4.5), we test if the probing results from each individual classifier can reproduce the findings of §5.2 - §5.5.
Table 3 shows the maximum MSE reductions using different choices of probes.A perhaps sur-prising finding is that the probes selected from the "highest-accuracy" criterion do not always produce the most valuable results.To predict finetuning performances, directly specifying the probing method as MLP-20 or RandomForest-100 may be instead more recommended.
As a side note, among the 48 results presented in Table 3, only 9 are not achieved by the "best-3features" methods (including the 2 shown in Table 1).This contrast emphasizes the importance of feature selection when configuring probes.

Ablation: dataset size
The findings in §5.2 - §5.6 show that as few as 1,200 samples per class (around 1% of total data) are sufficient to provide useful findings.What if we further reduce the sizes of probing datasets?Here, we repeat §5.2 and §5.5 with probing results from only 400 samples per class.While we can also reduce RMSE with only 400 samples per class, probing results are generally not as useful as those from 1,200 samples.Among the 48 configurations, the probing results from 400 samples have worse RMSE reductions in 11 configurations, but better in 5. Detailed results are included in Table 4.

Uncertainty analysis
Our method involves comparing the maximum RMSE reductions against the baseline (regressing from features drawn from Gaussian) RMSE c , which may be affected by the random seeds.Here we describe an error analysis on the baseline regressor results of §5.2 - §5.5.We run N = 100 Monte Carlo simulations on each configuration of regression from 3, 7, and 12 features, respectively, record the RMSE c , and analyze the uncertainty.We use the variation of RMSE c (as measured by Std(RMSE c )) relative to the scale (as measured by Mean(RMSE c )) to describe the uncertainty.As shown in Table 5, the uncertainty remains relatively stable across the choice of regression tasks but increases with the number of features.This result favors the use of fewer probing results as features.
Note that these uncertainty values are nontrivial.Let us take COLA as an example.To regress the fine-tuning performance, a 3-feature setting can achieve 75.66%RMSE reduction compared to RMSE c , but RMSE c itself has 5.46% uncertainty.This translates to around 7.22% uncertainty for the RMSE-reduction results (Table 1).
Can we reduce the uncertainty by using alternative evaluation metrics like the RMSE, or the percentage of explained variance (ExplVar), instead of introducing a control task?In addition to the adjustment for dataset artifacts, the control task provides a baseline to understand the utility.While the RMSE is always positive and ExplVar is almost always above 90%, RMSE reduction itself provides a clearer picture of the utility of probing features.
5.9 Can the probing results distinguish the originating language models?
Since the 25 models come from only 5 language models (RoBERTa, XLM, ALBERT, DeBERTa, XLNet), one may wonder if the "model augmentation" procedure "shuffles" the language models sufficiently -if yes, then it would be hard to distinguish the originating language models.We use 5-class logistic regression from scikitlearn.For any combination of three features, we compute the accuracy following 5-fold cross validation.On all combinations of 3 features, the probing features can reach 0.0027 accuracy (sd=0.0109)better than the random features.This is statistically significant.12However, the maximum reachable accuracy is 0.08, whereas even a trivial predictor always outputting "RoBERTa" has an expected 0.24 accuracy (there are 6 RoBERTa models out of 25).The small accuracies show that our "model augmentation" procedure ( §4.3) produces sufficiently distinct models.

Discussion
Can probing results generalize to nonclassification tasks?All fine-tuning tasks and probing tasks in this paper are text-based classification problems.In the interpretable NLP literature, the probing analyses can also apply to other categories of deep neural networks including translation (Belinkov et al., 2017;Zhang and Bowman, 2018)  both rich syntactic information (as illustrated by high probing scores in Tense, SubjNum, etc.), and semantic information (as illustrated by high probing scores in BShift, SOMO, etc.) then we will not be surprised when observing that this neural model achieves a high BLEU score.That said, the extent to which probing results remain predictive for BLEU score needs further analysis, which we leave for future works.
Probing is computational-friendly Compared to fine-tuning, probing evaluations require less computation.Fine-tuning the 6 GLUE tasks takes around 30 GPU hours in total, while probing the 7 tasks (all 12 layers) takes 0.7 GPU hours to cache and 1.3 CPU hours to probe.We elaborate the computational budgets in Appendix A. Probing is far more efficient because it does not need to change the parameters in the neural model, and we only need one pass through the neural model and cache the representations.Fine-tuning needs the gradients to update the parameters in the neural models.There are some methods to reduce the computation costs, 13 and we note that probing is competitive as well, in terms of computational time.
Fine-tuning tasks need more specifications Currently, the most popular leaderboards for natural language understanding constitute fine-tuning tasks.A criticism towards these leaderboards is underspecification (D'Amour et al., 2020) -the short descriptions of the tasks can hardly be inclusive enough to specify the precise abilities required to complete the tasks.To further understand the underspecification problem, researchers recently developed probing datasets (McCoy et al., 2019;Warstadt et al., 2020).Probing results on these datasets have been (indirectly) used in developing deep neural models -performance prediction is a more direct application.

Fine-grained evaluations improve transparency
Leaderboard tasks should be customized to the users (Ethayarajh and Jurafsky, 2020).The diversity of probing datasets offers flexible choices to NLP researchers, supporting the diversified considerations to the consumers.Some recently proposed fine-grained leaderboards allow researchers to answer questions like "where does model A outperform model B" (Ma et al., 2021;Narayan et al., 2021;Ruder et al., 2021;Liu et al., 2021a).The 13 One approach involves using momentum to accelerate the convergence (Kingma and Ba, 2015;Dozat, 2016) Alternatively, the memory usage can be reduced (Gomez et al., 2017;Behrmann et al., 2019).Empirically, limiting the precisions can also accelerate optimization (Shin et al., 2021).Speciallydesigned structures including Adapters (Houlsby et al., 2019) and LoRA (Hu et al., 2022) are effective as well.Prefix tuning and prompt tuning are lightweight alternatives to fine-tuning (He et al., 2022;Le Scao and Rush, 2021;Li and Liang, 2021).Liu et al. (2021b) summarizes many approaches related to prompt.
probing literature can provide many more datasets for building diverse leaderboards.

Incorporating probing in model developments
While the developers are already busy, probing can still bring in benefits to the big model developments, mostly through a multi-dimensional feedback mechanism.The probing datasets introduce targeted knowledge that complement the training datasets of big NLP models.During developments, the model developers can select good model checkpoints to resume or proceed with the help of the probing evaluation scores.

Conclusion
This paper shows that analyzing probing results can be relevant to developing deep NLP models via predicting a proxy signal, i.e., fine-tuning performance.We show that as few as three probing accuracy scores can be useful to predict finetuning results with RMSEs 40% -80% smaller than baselines.This can dramatically improve the efficiency of deep learning pipelines.Given several ablation studies, we recommend MLP-20 and RandomForest-100 over other probing methods and show that the probing results from as few as 400 per class may still contain predictability.Probing analysis contain rich resources, and we show their results are closely related to fine-tuning performances.We call for further applications of probing into the developments of deep NLP models.

Limitations
The evaluation of large language models using only perplexity is uni-dimensional.Evaluations using fine-tuning tasks requires modifying many parameters, hence more costly than probing.Our paper aims at paving the path towards multidimensional evaluations of model parameters within computational budget, so instead of providing a fixed recipe (including fixing the probing datasets and specifying which layers to probe), we provide a general framework and use experiments to show the informativeness and potential utility of the probing results.While probing results are shown to be informative in our experiments, many other methods (e.g., LoRA and prefix-tuning) could optimize similar numbers of parameters.The empirical verifications of other methods are left to future work.Similarly, the problem settings considered in this paper are all classification problems.We believe this should generalize to other problem settings (e.g., BLEU score on sequential tasks), yet the empirical verifications are left to future work.

Table 1 :
RMSE reduction from baseline.A larger value shows the probing results more indicative of the fine-tuning performance.A small (or even negative) value means the probing results are not informative, compared to random features.The bold-font configurations are those with the highest RMSE reductions for predicting each fine-tuning task (i.e., within each column).

Table 3 :
Maximum RMSE reductions using different probing configurations.The bold-font numbers are the maximum values in each column.

Table 4 :
RMSE reduction from baseline, using probing results with 400 data samples per class.The colored results are different from (better than or worse than) the results with 1,200 data samples (Table1) by more than the estimated uncertainty margins in §5.8, i.e., 5% and 15% for 3 and 12 features, respectively.