An Empirical Study on Hyperparameter Optimization for Fine-Tuning Pre-trained Language Models

The performance of fine-tuning pre-trained language models largely depends on the hyperparameter configuration. In this paper, we investigate the performance of modern hyperparameter optimization methods (HPO) on fine-tuning pre-trained language models. First, we study and report three HPO algorithms’ performances on fine-tuning two state-of-the-art language models on the GLUE dataset. We find that using the same time budget, HPO often fails to outperform grid search due to two reasons: insufficient time budget and overfitting. We propose two general strategies and an experimental procedure to systematically troubleshoot HPO’s failure cases. By applying the procedure, we observe that HPO can succeed with more appropriate settings in the search space and time budget; however, in certain cases overfitting remains. Finally, we make suggestions for future work. Our implementation can be found in https://github.com/microsoft/FLAML/tree/main/flaml/nlp/


Introduction
In the recent years, deep learning and pre-trained language models (Devlin et al., 2019;Liu et al., 2019;Clark et al., 2020;He et al., 2021) have achieved great success in the NLP community. It has now become a common practice for researchers and practitioners to fine-tune pre-trained language models in down-stream NLP tasks. For example, the HuggingFace transformers library (Wolf et al., 2020) was ranked No.1 among the most starred NLP libraries on GitHub using Python 1 .
Same as other deep learning models, the performance of fine-tuning pre-trained language models largely depends on the hyperparameter configuration. A different setting in the hyperparam-eters may cause a significant drop in the performance, turning a state-of-the-art model into a poor model. Methods for tuning hyperparameters can be categorized as (1) traditional approaches such as manual tuning and grid search, and (2) automated HPO methods such as random search and Bayesian optimization (BO). Manual tuning often requires a large amount of manual efforts; whereas grid search often suffers from lower efficiency due to the exponential increase in time cost with the number of hyperparameters. Automated HPO methods were proposed to overcome these disadvantages. Recently, automated HPO methods also become increasingly popular in the NLP community (Zhang and Duh, 2020;Dodge et al., 2019). For example, Bayesian optimization (BO) (Zhang and Duh, 2020) and Population-based Training (Jaderberg et al., 2017) both prove to be helpful for improving the performance of the transformer model (Vaswani et al., 2017) for neural machine translation. The HuggingFace library has also added native supports for HPO in a recent update (version 3.1.0, Aug 2020).
With improved supports, users can now easily access a variety of HPO methods and apply them to their fine-tuning tasks. However, the effectiveness of this step is less understood. To bridge this gap, in this paper, we propose an experimental study for fine-tuning pre-trained language models using the HuggingFace library. This study is motivated by the following research questions: First, can automated HPO methods outperform traditional tuning method such as grid search? Second, on which NLP tasks do HPO methods work better? Third, if HPO does not work well, how to troubleshoot the problem and improve its performance?
To answer these questions, we start from a simple initial study (Section 4) by examining the performance of three HPO methods on two state-ofthe-art language models on the GLUE dataset. The time budget for HPO in the initial study is set to be the same as grid search. Results of the initial study show that HPO often fails to match grid search's performance. The reasons for HPO's failures are two folds: first, the same budget as grid search may be too small for HPO; second, HPO overfits the task. With these observations, we propose two general strategies for troubleshooting the failure cases in HPO as well as an overall experimental procedure ( Figure 1). By applying the procedure (Section 5), we find that by controlling overfitting with reduced search space and using a larger time budget, HPO has outperformed grid search in more cases. However, the overfitting problem still exists in certain tasks even when we only search for the learning rate and batch size. Finally, we make suggestions for future work (Section 7).
The main contributions of this work are: • We empirically study the performance of three HPO methods on two pre-trained language models and on the GLUE benchmark; • We design an experimental procedure which proves useful to systematically troubleshoot the failures in HPO for fine-tuning; • We report and analyze the execution results of the experimental procedure, which sheds light on future work;

Definition of HPO on Language Model Fine-Tuning
Given a pre-trained language model, a fine-tuning task, and a dataset containing D train , D val , D test , the goal of a hyperparameter optimization algorithm is to find a hyperparameter configuration c, so that when being trained under configuration c, the model's performance on a validation set D val is optimized. Formally, the goal of HPO is to find where S is called the search space of the HPO algorithm, i.e., the domain where the hyperparameter values can be chosen from. The function f (·, ·, ·) is called the evaluation protocol of HPO, which is defined by the specific downstream task. For example, many tasks in GLUE define f as the validation accuracy. If a task has multiple protocols, we fix f as one of them 2 . After finding c * , the performance of HPO will be evaluated using the performance of the model trained with c * on the test set D test .
To fairly compare the performances of different HPO algorithms, the above optimization problem is defined with a constraint in the maximum running time of the HPO algorithm, which we call the time budget for the algorithm, denoted as B. Under budget B, the HPO algorithm can try a number of configurations c 1 , c 2 , · · · , c n . The process of fine-tuning with configuration c i is called a trial. Finally, we call the process of running an HPO algorithm A once one HPO run.

Factors of the Study
In this paper, we conduct an empirical study to answer the research questions in Section 1. First, can automated HPO methods outperform grid search? The answer to this question depends on multiple factors, i.e., the NLP task on which HPO and grid search are evaluated, the pre-trained language model for fine tuning, the time budget, the search space for grid search and HPO algorithm, and the choice of HPO algorithm. To provide a comprehensive answer, we need to enumerate multiple settings for these factors. However, it is infeasible to enumerate all possible settings for each factor. For instance, there exist unlimited choices for the search space. To accomplish our research within reasonable computational resources 3 , for each factor, we only explore the most straight-foward settings. For example, the search space for grid search is set as the default grid configuration recommended for fine-tuning (Table 1), and the search space for HPO is set as a straightforward relaxation of the grid configuration. We explain the settings for each factor in details below.  Pre-trained Language Models. In this paper, we focus on two pre-trained language models: the Electra-base model (Clark et al., 2020), and the RoBERTa-base model (Liu et al., 2019). Electra and RoBERTa are among the best-performing models on the leaderboard of GLUE as of Jan 2021 4 . Another reason for choosing the two models is that they both provide a simple search space for grid search, and we find it helpful to design our HPO search space on top of them. We use both models' implementations from the transformers library (Wolf et al., 2020) (version = 3.4.0). Among all the different sizes of RoBERTa and Electra (large, base, small), we choose the base size, because large models do not fit into our 2-hour budget 5 . With the 2-hour time constraint, we prune tasks where grid search takes longer than two hours. For Electra, QQP is pruned, whereas for RoBERTa, SST, QNLI, QQP, MNLI are pruned.
Search Space for Grid Search and HPO. It is generally difficult to design an HPO search space from scratch. In our problem, this difficulty is further amplified with the limited computational resources. Fortunately, most papers on pre-trained language models recommend one or a few hyperparameter configurations for fine-tuning. We use them as the configurations for grid search. For HPO, the performance depends on the search space choice, e.g., it takes more resources to explore a large space than a smaller space close to the best configuration. Due to the time budget limits, we focus on a small space surrounding the recommended grid search space, as shown in As the performance of HPO depends on the time budget, to compare between grid search and HPO, we first conduct an initial study by setting the time budget of HPO to the same as grid search. For the rest of this paper, we use aGST to denote that the time budget=a×the running time for grid search. Table 3 shows the experimental results on Electra and RoBERTa using 1GST. For each (HPO method, NLP task) pair, we repeat the randomized experiments 3 times and report the average scores. We analyze the results in Section 4.1. 6 The grid search spaces in Table 1 are from Table 7 of Electra and Table 10 of RoBERTa. For Electra, we fix the hyperparameters for Adam; we skip the layer-wise learning rate decay because it is not supported by the HuggingFace library. While Electra's original search space for learning rate is [3e-5, 5e-5, 1e-4, 1.5e-4], we have skipped the learning rate 5e-5 in our experiment.

Analysis of the Initial Results
Electra. By comparing the performance of grid search and HPO in Table 3 we can make the following findings. First, HPO fails to match grid search's validation accuracy in the following tasks: RTE, STS-B, SST and QNLI. In certain tasks such as QNLI and RTE, grid search outperforms HPO by a large margin. Considering the fact that grid search space is a subspace of the HPO space, this result shows that with the same time budget as grid search (i.e., approximately 3 to 4 trials), it is difficult to find a configuration which works better than the recommended configurations. Indeed, with 3 to 4 trials, it is difficult to explore the search space. Although ASHA and BO+ASHA both search for more trials by leveraging early stopping (Li et al., 2020), the trial numbers are still limited (the average trial numbers for experiments in Table 3 can be found in Table 6 of the appendix). Second, among the tasks where HPO outperforms grid search's validation accuracy, there are 2 tasks (WNLI, MRPC) where the test accuracy of HPO is lower than grid search. As a result, the HPO algorithm overfits the validation dataset. Overfitting in HPO generally happens when the accuracy is optimized on a limited number of validation data points and cannot generalize to unseen test data (Feurer and Hutter, 2019). (Zhang et al., 2021) also found that fine-tuning pre-trained language models is prone to overfitting when the number of trials is large, though they do not compare HPO and grid search. Finally, by searching for more trials, ASHA and BO+ASHA slightly outperform random search in the validation accuracy, but their test accuracy is often outperformed by random search. RoBERTa. By observing RoBERTa's results from Table 3, we can see that the average validation accuracy of HPO outperforms grid search in all tasks except for CoLA. It may look like HPO is more effective; however, most of the individual runs in Table 3 overfit. As a result, HPO for fine-tuning RoBERTa is also prone to overfitting compared with grid search. The complete lists of the overfitting cases in Table 3 can be found in Table 8 and  Table 9 of Appendix A.3.

A General Experimental Procedure for Troubleshooting HPO Failures
Since Table 3 shows HPO cannot outperform grid search using 1GST, and is prone to overfitting, we propose two general strategies to improve HPO's  performance. First, we increase the time budget for HPO so that HPO can exploit the space with more trials. Second, to control overfitting, we propose to reduce the search space. More specifically, we propose to fix the values of certain hyperparameters to the default values in the grid configuration ( Table 3). The reason is that overfitting can be related to certain hyperparameter settings of the model. For example, it was shown in ULMFit (Howard and Ruder, 2018) that using a non-zero warmup step number can help reduce overfitting. Intuitively, a larger search space is more prone to overfitting. For example, by using a warmup search space = (0, 0.2), the warmup steps in the best trial found by HPO may be much smaller or larger than the steps used by grid search. Other hyperparameters which are related to overfitting of fine-tuning include the learning rate (Smith and Le, 2017), batch size , and the dropout rates (Srivastava et al., 2014;Loshchilov andHutter, 2019, 2018).
Our proposed procedure for troubleshooting HPO failures is depicted in Figure 1. Starting from the full search space and 1GST, we test the HPO algorithm for a few times. If any overfitting is observed, we reduce the search space and go back to testing the HPO algorithm again. On the other hand, if no overfitting is observed and HPO also does not outperform grid search, we increase the time budget and also go back to testing the HPO algorithm again. We continue this procedure until any of the following conditions is met: first, HPO successfully outperforms grid search; second, the search space cannot be further reduced, thus HPO overfits the task; third, the time budget cannot be further increased under a user-specified threshold, thus whether HPO can outperform grid search is to be determined for this specific task. relatively small list for time budget options {1GST, 4GST}. For the second component, it is difficult to guarantee to reduce overfitting by fixing a specific hyperparameter to its grid search values. When choosing the hyperparameter to fix, we refer to the configurations of the best trials which cause the HPO results to overfit.

Choosing the Hyperparameter to Fix
Electra. To decide which hyperparameter to fix, we examine the best trial's configuration for the overfitting HPO runs (compared with the grid search performance). If there is a pattern in a certain hyperparameter of all these configurations (e.g., warmup ratio below 0.1 for Electra), by fixing such hyperparameters to the values of grid search, we can exclude the other values which may be related to overfitting. We apply this analytical strategy to the initial Electra results in Table 3. Among the 72 runs, 9 runs overfit compared with grid search. For each run, we list the hyperparameter configurations of the best trial in Table 8 of Appendix A.3. For Electra, we have skipped showing weight decay in Table 8, because the HPO configuration is never smaller than the grid configuration, thus does not affect the result of the analysis. For comparative purpose, we also list the hyperparameter values of the best trial in grid search. To improve the readability of Table 8, we use 4 different colors (defined in Appendix A.3) to denote the comparison between values of the best trial in HPO and values of the best trial in grid search. From Table 8, we observe that the warmup ratios are often significantly lower than 0.1. We skip the analysis on learning rate because its search space (log((2.99e-5,1.51e-4))) cannot be further reduced without losing coverage of the grid configurations or continuity; we also skip weight decay because any trial's value cannot be smaller than 0. Following this empirical observation, we hypothesize that fixing the warmup ratio to 0.1 can help reduce overfitting in Electra. We use S f ull to denote the original search space and S −wr to denote the search space by fixing the warmup ratio to 0.1. If HPO overfits in both S f ull and S −wr , the procedure will reduce the search space to the minimal continuous space S min containing the grid search space, which searches for the learning rate only.
RoBERTa. We apply the same analytical strategy to the RoBERTa results in Table 3 and show the hyperparameters of the best trials in Table 9. For RoBERTa, we propose to fix the values of two hyperparameters at the same time: the warmup ratio and the hidden dropout. We denote the search space after fixing them as S −wr−hdo . If HPO overfits in both S f ull and S −wr−hdo , the procedure will reduce the search space to S min which contains the learning rate and batch size only.

Execution Results of the Procedure
In this section, we apply the troubleshooting procedure on the initial HPO results from Table 3 and observe the execution paths. In Table 10 and Table 11 of Appendix A.4, we list the full execution results of the procedure for random search and random search + ASHA. Table 10&11 have included only the tasks where the HPO does not succeed in the initial study. In Table 10&11, we show the validation and test accuracy for the three repetitions of HPO runs as well as their average score.
An Example of Executing the Procedure. In Figure 4, we show an example of applying the procedure on random search for Electra on RTE. In round 0, the validation and test accuracies of all three repetitions are lower than grid search. That implies RS needs more time budget, therefore we increase the budget (marked as ↑res) for RS from 1GST to 4GST. After the increase, overfitting is detected in the 1st repetition of round 1 (validation accuracy = 84.5, test accuracy = 74.6). We thus reduce the search space (marked as ↓ space) from S f ull to S −wr . In round 2, the 1st repetition still shows (weak) overfitting: RS has the same  Table 4: An example of executing the experimental procedure applied to random search for Electra on RTE. The grid search accuracy is denoted using the blue bold font.
An HPO run is highlighted in dark grey if it overfits and medium grey if it overfits weakly . validation accuracy as grid search (84.1), a smaller test accuracy (76.1), and a smaller validation loss (RS's validation loss = 0.8233, grid search's validation loss = 0.9517). We thus continue reducing the search space to S min , and overfitting is detected again in the 1st repetition of round 3 (validation accuracy = 84.8, test accuracy = 75.3). After round 3, the search space cannot be further reduced, so we classify this case as 'HPO overfits task'.
We analyze the execution results in Table 10 and 11 jointly as follows.
Effects of Reducing the Search Space. From the two tables we can observe that reducing the search space can be effective for controlling overfitting. In WNLI (Electra), both algorithms outperform grid search after reducing the search space once. In WNLI (RoBERTa), ASHA outperforms grid search after reducing the search space twice. We can observe a similar trend in MRPC (Electra), SST (Electra), RTE (RoBERTa), and CoLA (RoBERTa). However, for these cases, overfitting still exists even after we reduce the search space twice, i.e., using the minimal search space.
Effects of Increasing the Time Budget. By observing cases of increased budget in Table 10 and 11, we can see that this strategy is generally effective for improving the validation accuracy. After increasing the time budget, in STS-B (Electra) all HPO methods outperform grid search's validation and test accuracy; in SST (Electra-RS) and CoLA (RoBERTa) HPO outperforms grid search in only the validation accuracy. In RTE (Electra) and QNLI (Electra), however, this increase is not enough for bridging the gap with grid search, thus HPO remains behind. For RTE (Electra), SST (Electra), QNLI (Electra), and CoLA (RoBERTa), overfitting happens after increasing the time budget from 1GST to 4GST. After reducing the search space, we still observe overfitting in most cases.
Comparisons between RS and ASHA. By comparing the results between random search and ASHA in Table 10 and 11, we find that before increasing the budget, RS rarely outperforms ASHA in the validation accuracy; however, after the budget of both RS and ASHA increases to 4GST, the best validation accuracy of RS has consistently outperformed ASHA, i.e., in all of RTE (Electra), STS-B (Electra), SST (Electra), and QNLI (Electra). That is, the increase in the time budget has led to more significant (validation) increase in RS than ASHA. This result may be caused by two reasons. First, at 1GST, ASHA already samples a larger number of trials (Appendix A.2), which may be sufficient to cover its search space; on the other hand, RS cannot sample enough trials, thus increasing the time budget is more helpful. Second, ASHA may make mistake by pruning a good trial that shows a bad performance at the beginning.

Summary of the Main Findings
In Table 5, we list the final execution results for each task in Electra and RoBERTa. Our main findings can be summarized as follows. After increasing the time budget and reducing the search space, HPO outperforms grid search in the following cases: (1) in 3 cases (i.e., CoLA (Electra), STS-B (Electra) and MNLI (Electra)), HPO outperforms grid search by using the full search space, where STS-B needs more budget; (2) in 4 cases (i.e., WNLI (Electra), WNLI (RoBERTa), MRPC (RoBERTa) and STS-B (RoBERTA)), HPO succeeds after reducing the search space; (3) in the other 7 cases, HPO cannot outperform grid search even after increasing the time budget and reducing the search space. This result shows that when searching in a continuous space surrounding the recommended grid configurations, it can be difficult for existing automated HPO methods (e.g., Random Search, ASHA, Bayesian optimization) to outperform grid search (with manually tuned grid configurations recommended by the language model) within a short amount of time; even if we can identify a configuration with good validation score, most likely the test score is still worse than  The execution for all experiments in Table 10 and 11 took 6.8×4V100 GPU days. This is in contrast to the cost if we enumerate all 5 factors in Section 3, which is 16×4V100 GPU days.
A Caveat on Results in Table 5. For all study results in this paper (i.e., Table 3, Table 10 and  Table 11), we have repeated each HPO run three times. Therefore if a case succeed in Table 5, it is because no overfitting is detected in the 3 repetitions, if we ran more repetitions, the risk of overfitting can increase. In addition, all results are evaluated under transformers version=3.4.0 and Ray version=1.2.0. If these versions change, results in Table 5 may change.
An Analysis on the Relation between Overfitting and Train/Validation/Test split. As overfitting indicates a negative correlation between the validation and test accuracy, one hypothesis is that overfitting is caused by the different distribution of the validation and test set. We thus compare HPO runs using the original GLUE spilt and a new split which uniformly partition the train/validation/test data. The results can be found in Appendix A.5.
6 Related Work

Automated Hyperparameter Optimization
Hyperparameter optimization methods for generic machine learning models have been studied for a decade (Feurer and Hutter, 2019;Bergstra et al., 2011;Bergstra and Bengio, 2012;Swersky et al., 2013). Prior to that, grid search was the most common tuning strategy (Pedregosa et al., 2011). It discretizes the search space of the concerned hyperparameters and tries all the values in the grid. It can naturally take advantage of parallelism. However, The cost of grid search increases exponentially with hyperparameter dimensions. A simple yet surprisingly effective alternative is to use random combinations of hyperparameter values, especially when the objective function has a low effective dimension, as shown in (Bergstra and Bengio, 2012). Bayesian optimization (BO) (Bergstra et al., 2011;Snoek et al., 2012) fits a probabilistic model to approximate the relationship between hyperparameter settings and their measured performance, uses this probabilistic model to make decisions about where next in the space to acquire the function value, while integrating out uncertainty. Since the training of deep neural networks is very expensive, new HPO methods have been proposed to reduce the cost required. Early stopping methods (Karnin et al., 2013;Li et al., 2017Li et al., , 2020 stop training with unpromising configurations at low fidelity (e.g., number of epochs) by comparing with other configurations trained at the same fidelity. Empirical study of these methods is mostly focused on the vision or reinforcement learning tasks, there has been few work focusing on NLP models. ASHA was evaluated on an LSTM model proposed in 2014 (Zaremba et al., 2014). In (Wang et al., 2015), the authors empirically studied the impact of a multi-stage algorithm for hyperparameter tuning. In (Zhang and Duh, 2020), a look-up table was created for hyperparameter optimization of neural machine translation systems. In BlendSearch (Wang et al., 2021), an economical blended search strategy was proposed to handle heterogeneous evaluation cost in general and demon-strates its effectiveness in fine-tuning a transformer model Turing-NLRv2. 7 Some existing work has addressed overfitting in HPO (Lévesque, 2018) or neural architecture search (Zela et al., 2020). For HPO, cross validation can help alleviate the overfitting when tuning SVM (Lévesque, 2018), which is rarely applied in deep learning due to high computational cost. For neural architecture search (Zela et al., 2020), the solution also cannot be applied to our case due to the difference between the two problems.

Fine-tuning Pre-trained Language Models
As fine-tuning pre-trained language models has become a common practice, existing works have studied how to improve the performance of the finetuning stage. Among them, many has focused on improving the robustness of fine-tuning. For example, ULMFit (Howard and Ruder, 2018) shows that an effective strategy for reducing the catastrophic forgetting in fine-tuning is to use the slanted triangular learning rate scheduler (i.e., using a nonzero number of warmup steps). Other strategies for controlling overfitting in fine-tuning include freezing a part of the layers to reduce the number of parameters, and gradually unfreezing the layers (Peters et al., 2019), adding regularization term to the objective function of fine-tuning (Jiang et al., 2020), multi-task learning (Phang et al., 2018). Applying these techniques may reduce overfitting in our experiments; however, our goal is to compare grid search and HPO, if these techniques are helpful, they are helpful to both. To simplify the comparison, we thus focus on fine-tuning the original model. Meanwhile, the performance of fine-tuning can be significantly different with different choices of the random seeds (Dodge et al., 2020). To remove the variance from random seed, we have fixed all the random seeds to 42, although HPO can be used to search for a better random seed. (Zhang et al., 2021) identifies the instability of fine-tuning BERT model in few-sample cases of GLUE (i.e., RTE, MRPC, STS-B, and CoLA). Similar to our work, they also found that overfitting increases when searching for more trials. However, they have not compared grid search with HPO. There are also many discussions on how to control overfitting by tuning hyperparameters (in manual tuning), e.g., learning rate (Smith and Le, 2017), batch 7 msturing.org size , dropout rates (Srivastava et al., 2014;Loshchilov andHutter, 2019, 2018), which may help with designing a search space for HPO that overfits less.

Conclusions, Discussions and Future Work
Our study suggests that for the problem of finetuning pre-trained language models, it is difficult for automated HPO methods to outperform manually tuned grid configurations with a limited time budget. However, it is possible to design a systematic procedure to troubleshoot the performance of HPO and improve the performance. We find that setting the search space appropriately per model and per task is crucial. Having that setting automated for different models and tasks is beneficial to achieve the goal of automated HPO for fine-tuning. For example, one may consider automatically mining the pattern from Table 8&9 to identify the hyperparameters that likely cause overfitting. Further, for the tasks remaining to be unsuitable for HPO, other means to reduce overfitting is required. One possibility is to use a different metric to optimize during HPO as a less overfitting proxy of the target metric on test data. Previous work has shown that random seed is crucial in the performance of fine-tuning (Dodge et al., 2020). Fine-tuning also benefits from ensembling or selecting a few of the best performing seeds (Liu et al., 2019). It would be interesting to study HPO's performance by adding the random seed to the search space for future work.
In our study, the simple random search method stands strong against more advanced BO and early stopping methods. It suggests room for researching new HPO methods specialized for fine-tuning. A method that can robustly outperform random search with a small resource budget will be useful.
It is worth mentioning that although we find HPO sometimes underperforms grid search, the grid search configurations we study are the default ones recommended by the pre-trained language models for fine tuning, therefore they may be already extensively tuned. We may not conclude that HPO is not helpful when manual tuning has not been done. How to leverage HPO methods in that scenario is an open question.

A.1 HPO Checkpoint Settings
In this paper, we report the validation and test accuracy of the best checkpoint (in terms of validation accuracy) of the best trial instead of the last checkpoint of the best trial. While the default setting in Ray Tune uses the last checkpoint, when finetuning pretrained language model without HPO, the best checkpoint is more widely used than the last checkpoint. To further study the difference between the two settings, we compare their validation and test accuracy of grid search using Electra on three tasks: WNLI, RTE and MRPC. The result shows that the validation and test accuracy of the best checkpoint of the best trial are both higher than those of the last checkpoint of the best trial. As a result, we propose and advocate to report the best checkpoint of all the trials for HPO fine-tuning pretrained language models. The checkpoint frequencies in our experiment are set to 10 per epoch for larger tasks (SST, QNLI, and MNLI) and 5 per epoch for smaller tasks (WNLI, RTE, MRPC, CoLA and STS-B), with lower frequency in smaller tasks to reduce the performance drop caused by frequent I/Os within a short time.

A.2 Number of Trials Searched by HPO
In Table 6, we show the number of trials searched by each HPO algorithms in the initial comparative study ( Table 3)

A.3 Choosing the Hyperparameter to Fix
The hyperparameters of the best trials in overfiting runs are shown in Table 8 and Table 9. We use colors to denote the comparison with the hyperparameter value in grid search: dark grey if the value is higher than grid search; light grey if the value is lower than grid search.

A.4 Execution Results of Procedure
In Table 10 and Table 11, we show the execution results of applying the experimental procedure to Electra and RoBERTa respectively.

A.5 An Analysis on Overfitting and Train/Validation/Test split
In this paper, we have observed that HPO tends to overfit when the number of trials/time budget increases. In other words, the higher the validation score, the lower the test score. One hypothesis for the reason behind this phenomenon is that the validation set has a different distribution than the test set. Since GLUE is a collection of NLP datasets from different sources, it is unclear whether the validation and test set in all GLUE tasks share the same distribution.  To observe whether HPO still overfits under a uniformly random split, we have performed the following experiment: we merge the training and validation folds of QNLI in GLUE, randomly shuffle the merged data, and resplit it into train/validation/test with the proportion 8:1:1. We run random search, rank all trials based on the validation accuracy, and examine the Pearson correlation coefficient between the top-4 trials's validation and test accuracies (the trials are ranked by the validation accuracy), which are listed in       The execution results of applying the procedure on RoBERTa. Each task's grid search accuracy is denoted using the blue bold font.
An HPO run is highlighted in dark grey if it overfits and medium grey if it overfits weakly . The average of 3 repetitions is highlighted in light grey if it outperforms grid search's validation and test accuracy . For STS-B we only report the Spearman correlation, for MRPC we only report the accuracy.