How Predictable Are Large Language Model Capabilities? A Case Study on BIG-bench

We investigate the predictability of large language model (LLM) capabilities: given records of past experiments using different model families, numbers of parameters, tasks, and numbers of in-context examples, can we accurately predict LLM performance on new experiment configurations? Answering this question has practical implications for LLM users (e.g., deciding which models to try), developers (e.g., prioritizing evaluation on representative tasks), and the research community (e.g., identifying hard-to-predict capabilities that warrant further investigation). We study the performance prediction problem on experiment records from BIG-bench. On a random train-test split, an MLP-based predictor achieves an $R^2$ score greater than 95%, indicating the presence of learnable patterns within the experiment records. We then formulate the problem of searching for"small-bench,"an informative subset of BIG-bench tasks from which the performance on the full set can be maximally recovered. We find a subset as informative as BIG-bench Hard for evaluating new model families, while being $3\times$ smaller. Additionally, we find competitive subsets by clustering task representations learned by our MLP-based predictor and selecting tasks close to cluster centroids, highlighting the importance of task diversity in constructing"small-bench."


Introduction
Large language models (LLMs) have revolutionized natural language processing (NLP) research.Typically, when researchers introduce a new set of LLMs, they release them in various sizes and conduct extensive evaluation on different tasks, while also considering different experiment configurations, such as prompting strategies and the number of in-context examples (Black et al., 2021; Zhang   1 Code can be found at https://github.com/INK-USC/predicting-big-bench.

How to evaluate new models within budget constraints?
T a s k 2 1 4 T a s k 3 2 BIG-bench "small-bench" T a s k 1 2 7 Figure 1: Overview.We study the problem of (1) predicting LLM performance on new experiment configurations; (2) searching for a subset of tasks which is most informative for predicting performance on remaining tasks when evaluating a new model family.et al., 2022;Touvron et al., 2023).Given the combinatorially large space of possible experimental configurations, running all possible experiments for a new set of LLMs is impractical.This begets a critical question: to what extent can we predict the capabilities of an LLM in a given experimental setting?
Studying this problem helps address various practical issues.For LLM users, a performance prediction model could offer guidance for experiment design and decision-making by answering questions such as, "What model scale and how many shots are necessary to attain satisfactory performance for my task?"For LLM developers and the research community, a performance prediction model could lead to insights into LLM capabilities by identifying which capabilities are hard-topredict and require further investigation, and which capabilities are highly correlated and may be deprioritized during evaluation to save budget.
We investigate the predictability of LLM capabilities on the BIG-bench (Srivastava et al., 2023) evaluation suite, as it includes a vast collection of experiment records.BIG-bench is a collaborative initiative aimed at "prob[ing] large language models and extrapolat[ing] their future capabilities."It has extensively evaluated various state-of-the-art LLMs on a diverse set of tasks contributed by the community.We gather and carefully filter these records, yielding a total of 56k+ records which we use as the "dataset" for our analysis.
We first formulate the problem of performance prediction given experiment configurations such as model family, model scale, task, and the number of in-context examples used.We compare various matrix completion, tree-based, and neural network methods in a random train-test split scenario ( §3).Further, we design and experiment with various data splits, representing different types of generalization challenges, to simulate practical situations researchers may face ( §4).
We then consider the problem of searching for "small-bench," a compact and informative subset of BIG-bench.This subset should allow for maximum recovery of performance on the complete BIGbench, enabling efficient evaluation of new LLMs while maintaining evaluation generality.We formulate this as a subset search problem ( §5) and empirically compare various search methods, clusteringbased subset construction methods, along with widely-adopted subsets such as BIG-bench Lite and BIG-bench Hard (Suzgun et al., 2023).
Our key findings are summarized as follow: 1. LLMs' performance on BIG-bench follows predictable patterns.In the default random train-test split scenario, our best predictor, an MLP model, achieves an RMSE lower than 0.05 (i.e., on average mis-predict by < 0.05 when the range is [0, 1]) and an R 2 greater than 95% (i.e., explains more than 95% variance in the target variable).2. The predictor's performance is dependent on the assumptions of the train-test distribution.In a more challenging setting where we hold out the Cartesian product of complete model families (all model scales) and complete tasks (all numbers of shots), the predictor's performance decreases (R 2 : 95% → 86%).3. Performance of emergent tasks (Wei et al., 2022a) is not entirely unpredictable.In general, performance of emergent tasks is harder to predict than that of non-emergent tasks.In specific scenarios (e.g., when a related emergent task is present in the training set) our model can accurately predict emergent abilities.4. BIG-bench Lite and BIG-bench Hard (Suzgun et al., 2023), two subsets of BIG-bench commonly used for evaluating new models, are suboptimal if the goal is to recover the performance on remaining tasks.We are able to find a subset that is as informative as BIG-bench Hard while being 3× smaller by using randomized search.5. Task diversity and task value are critical factors in constructing "small-bench."By clustering task representations learned by the MLP-based predictor and selecting tasks close to cluster centroids, we obtain competitive "small-bench" candidates.This strategy is further improved by incorporating task value information.

Related Work
Scaling Laws and Emergent Abilities.Pretraining scale is critical to language model capabilities.Research on scaling laws (Kaplan et al., 2020;Rae et al., 2021;Hoffmann et al., 2022) aims to categorize the relationship between pre-training compute, corpus size, model size and the test loglikelihood loss.Our work can be loosely considered as an extension to scaling laws, with three notable distinctions: (1) we focus on predicting downstream task performance; (2) we use model scale along with other experiment configuration information; (3) we mainly experiment with machine learning methods instead of explicit power laws.In this same vein, recent work has studied the effect of scale in a "pre-train then fine-tune" paradigm (Tay et al., 2022) and has explored non-monotonic scaling laws for complex scaling behaviors (Caballero et al., 2023).Another important observation about scale is that very large language models exhibit emergent abilities (Wei et al., 2022a), which are described as "unpredictable."In this work we empirically examine this claim and quantify the prediction errors under various assumptions.
Benchmarking for LLMs.Along with the development and scaling of LLMs, there are continuing efforts to create benchmarks that assess the capabilities of these models.One general trend for these benchmarks is transitioning from single-task (Bowman et al., 2015;Rajpurkar et al., 2016), to multi-task (Wang et al., 2018(Wang et al., , 2019)), and finally to massively multi-task (Hendrycks et al., 2021;Srivastava et al., 2023).However, due to budget or API constraints, models are typically evaluated on only a subset of the full range of available benchmarks.The selection is often made arbitrarily by the models' developers, making it challenging to compare models in a fair and holistic way (see Liang et al. 2023, Fig. 4).In response to this issue, we study the "small-bench" problem and hope it offers insights on efficient benchmarking of LLMs.
Performance Prediction.NLPERF (Xia et al., 2020) is a pilot work on performance prediction in NLP, focusing on bilingual and cross-lingual tasks.It demonstrates the potential of selecting an informative subset of tasks for evaluation, which inspired our work on searching for "small-bench."3 Performance Prediction on BIG-bench

Problem Definition
In this section, we focus on the problem of learning from experiment records of large language models, and predict the performance of an unseen combination of experiment settings.
Notations.We use L to denote the model families in consideration (e.g., PaLM, GPT-3).We use T to denote the diverse collection of tasks in consideration (e.g., 2-digit subtraction, emoji_movie).Formally, an experiment record is defined by the following values: • Normalized performance3 y ∈ [0, 1] Our goal is to learn a regression model f that predicts ŷ based on (l, n param , t, n shot ).
Data Splits.We obtain a large set of experiment records D = {(l, n param , t, n shot , y)} and split them into three non-overlapping subsets, D train , D dev , D test .By default, we use random splitting and adopt 10-fold cross validation. 4In subsequent sections of this paper, we also use other splitting strategies for controlled analysis ( §3.4 and §4).
Evaluation.We report root mean square error (RMSE) and coefficient of determination score (R 2 ) on the test set D test .RMSE is defined as In the main paper, we focus on RMSE and R 2 scores as they are widely accepted evaluation metrics for regression problems.In Appendix C.1 we introduce two alternative metrics based on rank correlation and discuss our findings.

Data
We construct our dataset D from the BIG-bench repository. 5We design a series of filtering criteria (e.g., excluding tasks where all models have 0 accuracy, excluding tasks with <100 examples), which are detailed in Appendix A. After filtering, our dataset has 56,143 experiment records.We list high-level statistics about this dataset in Table 1.
We would like to highlight that this dataset covers diverse tasks and models.According to Srivastava et al. (2023), tasks in BIG-bench cover "problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond."We refer the readers to Fig. 3 Compared to (a), an additional vector q u is learned for each user u, and p i for each item i.The term q ⊤ u p i is expected to model the interaction between user u and item i. (c) Model-model kNN: First find the top k models most similar to u, then aggregate the performance of these k models on item i by using weighted averaging.(d) Task-task kNN: similar to modelmodel kNN, but finds similar tasks and aggregate performance on these tasks instead.
Trees.We use two common tree-based methods that can directly learn to make predictions ŷ from the input (l, n param , t, n shot ): (e) Random Forest (Breiman, 2001)

Results and Analysis
Trees and MLPs achieve strong performance.
We experiment with the prediction models mentioned above and present their performance in Fig. 2. (1) Tree-based methods and MLP outperforms matrix completion methods by a large margin.We hypothesize that the 2D user-item simplification may cause loss of information on the input space.For example, the value of n param is merely used to distinguish different "users," and does not contribute to the computation of ŷ directly.(2) Gradient boosted trees and MLP are the strongest among all compared models; both achieving RMSE < 0.05 (i.e., on average mis-predict by < 0.05) and R 2 > 0.95 (i.e., more than 95% of variance in y is explained).This suggests that learnable patterns exist in LLM experiment records-LLM performance on unseen configurations can be predicted to a considerable extent in the current setting.
Performance varies on different test groups (Fig. 3 ).To have a more fine-grained understanding of the predictions, we group D test examples according to the features such as n shot , n param , and model family l, and then compute R 2 on each of these test groups. 6We use the MLP model predictions and present the results in Fig. 3 using dark blue bars ( ).In terms of n shot , we find that it is harder to predict zero-shot performance than 2-or 3-shot performance.In terms of the model family l, we believe the three BIG-G models (T=0, T=1, sparse) are easier to predict because their pre-training pipelines are similar.For n param , we group all models into four buckets.For example, bucket 1 contains the smallest 25% models.We observe a trend that performance of larger models are harder to predict.We also group D test according to whether the task t is an emergent task (see Appendix E in Wei et al. 2022a).Our predictor achieves an R 2 score of 0.94 on the emergent group and 0.95 on the nonemergent group.This suggests in general emergent abilities are indeed harder to predict.
Multi-group training is helpful (Fig. 3, / ↔ ) .We further conduct a set of controlled experiments by only training on examples from the test group of interest (e.g., examples with n shot = 0).We name them as single-group experiments ( / ), as opposed to multi-group experiments ( ) done in previous sections where the predictor is trained on all groups.Notably, in all settings, multi-group R 2 is always larger than single-group R 2 .There are limited observations of Gopher models in the training set, and they benefit from multi-group learning significantly (R 2 increases from 0.74 to 0.87).This reaffirms the claim that LLM performance exhibits shared patterns across various settings.Some groups benefit more from multi-group training, some are intrinsically harder to predict.(Fig. 3, ↔ / ↔ ) Our controlled experiments also allow us to distinguish between two factors: the group is intrinsically harder to predict or the group benefits more from multi-group learning.In Fig. 3(a), results suggest that n shot = 0 group is not necessarily harder to predict than n shot = 2 and n shot = 3 on its own (indicated by / bars), but the latter two benefit more from multi-group learning.Typically, when evaluating LLMs on a task, there is a huge performance boost when going from zero-shot to one-shot, and the performance improves more stably when more shots become available.It is easier to predict 3-shot performance when given 0,1,2-shot performance, than to predict 0.6 0.7 0.8 0.9 1.0  0-shot performance when given 1,2,3-shot performance.This partly explains why the 0-shot group does not benefit much from multi-group training.
In Fig. 3(b), BIG-G T=0 and BIG-G T=1 benefit from multi-group learning more than the GPT-3 model family, resulting in higher R 2 scores in the multi-group setting.In Fig. 3(c), we observe that larger models tend to be intrinsically more challenging to predict.This observation is more significant when the single-group training set size is controlled to be 1000 (represented by bars), where we observe a clear trend that groups consisting of larger models achieve lower R 2 score.
cording to the task t they belong to, to identify the most and least predictable tasks.The five most predictable tasks (Fig. 4 Top) include qa_wikidata and linguistic_mappings, which were tasks marked as having high linearity 7 in their scaling behavior (Srivastava et al. 2023, Sec 3.4).This observation is reasonable because high linearity typically implies predictability.However, modified_arithmetic, a task marked as having high breakthroughness (Srivastava et al., 2023) and emergence (Wei et al., 2022a), is considered highly predictable in our setting.Our hypothesis is that the predictor is able to infer this by learning from experiment records with similar configurations: If breakthroughness is observed with other models or other number of shots, the trained predictor is able to infer the trend for a new experiment configuration.We believe it is still challenging to predict breakthroughness/emergence in more restricted setting, e.g., based on task meta-data or input text alone.
For the five tasks with lowest R 2 scores (Fig. 4 Bottom), we manually examined their scaling behavior (Fig. 10) and indeed find these curves to be surprising.For future work, it will be interesting to investigate the underlying reasons, and to identify common characteristics shared among these tasks.

Creating Challenging Train-Test Splits
Previously, we randomly split D into D train , D dev , and D test , i.e., we randomly sampled 7 Linearity measures "the extent to which performance improves reliably with scale;" Breakthroughness measures "the extent to which a model is able to learn a task only once the model grows beyond a critical scale."See Srivastava et al. ( 2023) Appendix B for the formal definition.0.6 0.7 0.8 0.9 Gradient Boosted Trees MLP (l, n param , t, n shot ) combinations.This is a relatively easy setting; for example, when the model family l, number of parameters n param , and task t are kept the same, it may be easy for a model to predict performance of n shot = 2 when the records of n shot = 1 and n shot = 3 appear in D train .A more challenging data split would ensure that the combinations of (l, n param , t) in the test set are completely unseen in D train , and the model is required to predict for all possible n shot values.Taking it a step further, one may want make predictions on an unseen configuration of (l, t) for all possible values of n param and n shot .

Train-Test Split Settings
To simulate these use cases, we design and compare model performance on three additional settings (L2.1, L2.2, L3).
• L1: Random (l, n param , t, n shot ), used in §3 • L2.1: Group by (l, n param , t) • L2.2: Group by (l, t, n shot ) • L3: Group by (l, t) For example, in L3, we first group all experiment records in D according to (l, t), then create D train , D dev , and D test by random splitting the groups.
Additionally, we make L3 setting even more challenging by holding out one entire subset of L test × T test .Specifically, we first select L test , a subset of model families ⊆ L, and T test , a subset of tasks ⊆ T .After this, D test is defined as {(l, n param , t, n shot , y)|l ∈ L test , t ∈ T test }.This corresponds to a practical scenario where a new model family is developed, and researchers want to take a "sneak peek" at the full picture of its capabilities-evaluate on a subset of tasks (i.e., T train = T \T test ), and predict the model's performance on the remaining tasks (i.e., T test ).We refer to this as the "L3 Composition" setting.8

Results and Analysis
Main Results.Results on four representative models in these settings are visualized in Fig. 5.We observe that as the settings becomes more challenging (L1 → L2 → L3 → L3 Composition), performance gradually decreases and the standard deviation increases.Another important observation is that though MLP and gradient boosted trees are comparable in the L1 setting, MLPs are less sensitive to the increased difficulty (performance decrease is smaller and standard deviation is smaller).
Sample Prediction Results in L3.In Fig. 7 we visualize predictions on four sample (l, t) combinations by the MLP model-two achieving high R 2 scores and two achieving low R 2 scores.We have two high-level observations: (1) Predictions are more accurate on (l, t) combinations which has observations on similar tasks t ′ or similar models families l ′ in D train .(2) Over-estimation is a common type of mistake made by our trained prediction models.We observe several cases of "false positives" of emergent abilities.Due to space limit, we defer more discussion in §C.2.

Searching for "small-bench"
There has been a recent emphasis on assessing the generality of large language models, which involves evaluating these models on numerous tasks and scenarios.However, it will be extremely expensive to conduct all these experiments every time a new model is developed in the future.Extending from the holding out L test × T test setting in §4.1, in this section, we formulate and study the problem of searching for "small-bench:" Can we find a subset of BIG-bench tasks, such that when a new model family is evaluated on it, the performance of the remaining tasks can be maximally recovered?In the following we give a formal definition of this problem ( §5.1), construct "small-bench" candidates using different search algorithms and strategies ( §5.2), and present our findings ( §5.3).

Problem Definition
Our goal is to find T train , a subset of all tasks T , that are selected and used for evaluating new model families L test .We use b to represent the evaluation budget, i.e., |T train | = b.We use T test = T \T train to denote the tasks whose performance we wish to recover.The problem of finding the optimal T (b) * train can be formulated as the following: when a predictor is trained on the remaining experiment records, as previously done in §4.
Evaluation.Ideally the optimal T train should allow us to predict the performance of any new model family, without overfitting to a specific held-out model family.To evaluate this, we adopt nested cross-validation on L during evaluation of a selected T train .Specifically, given that |L| = 6, we create 6 × 5 = 30 different ways to hold out one model family as L dev and one model family as L test .We then train 30 prediction models and report the average of 30 R 2 (T test × L test ) scores.
Baselines.(a) BIG-bench Lite (Srivastava et al., 2023): A subset of BIG-bench for cheaper evaluation, proposed in the original BIG-bench paper.
|T train | = 42 for BIG-bench Lite.9(b) BIG-bench Hard (Suzgun et al., 2023): A subset of BIGbench containing challenging tasks that cannot be solved with direct in-context learning but can be improved with chain-of-thought prompting (Wei et al., 2022b).For each b, randomly select 5000 T train and select the one achieving the highest R 2 (L dev × T test ).
Note that these search algorithms optimize R 2 (T test × L dev ) during search, to ensure that T test × L test is held-out for evaluation.Additionally, to make the search computationally tractable, we only use 1 fixed fold from the 30 folds during search.We discuss the impact of these experimental decisions in Appendix C.5.
Clustering-based.We hypothesize that a good "small-bench" should be diverse (covering the task space comprehensively while avoiding redundancy by excluding similar tasks) and representative (each selected task providing informative insights for recovering the performance of other tasks).To validate this, we use the following methods to construct T train .(f) k-means: We extract the task representations10 learned by our MLP models in §3.We apply k-means clustering to these representations, group them into b clusters, and then select the task closest to the centroid of the each cluster.(g) k-means + Task Value: We first calculate the task value for each task in T by aggregating their contributions from the Best of 5000 search history.For example, if a task is present in 20 trials out of the 5000, its task value will be the average of the R 2 scores from those 20 trials.We then incorporate this information into k-means clustering, by selecting the task closest to the centroid among tasks that are top 25% valuable globally.

Results and Discussion
We visualize the results of all compared methods in Fig. 6.We have the following key observations.BIG-bench Hard and Lite are sub-optimal for recovering performance on remaining tasks.A 8-task subset found by Best of 5000 and randomlysampled 16-task subsets can match the 24-task BIGbench Hard for this goal.We further examine the re- sults and find several cases where BIG-bench Hard fails to represent the complete BIG-bench.For example, according to full BIG-bench performance recovered from BIG-bench Hard, BIG-G T=1 2B is better than GPT-3 Large; however, according to the ground-truth BIG-bench performance, GPT-3 Large is better than BIG-G T=1 2B, which is captured more accurately when a 24-task small-bench candidate is used.See Table 3 for the details.
It is important to note that BIG-bench Hard was not specifically designed for our goal, and thus is not expected to be competitive in our problem.Yet it is surprising that it underperforms randomlysampled subsets.As a general recommendation for evaluating newly-developed models, we suggest using the T train subsets found by solving the optimization problem in §5.1.If there is a specific evaluation goal in mind (e.g., focus on frontier, as in the case of BIG-bench Hard), T train should still be manually selected.
Greedy search is unstable and finds sub-optimal solutions.Search algorithms consistently outperforms randomly sampled T train sets; however, greedy search appears to be unstable, with occasional performance drops as the budget increases.Furthermore, at b = 42, it underperforms the Best of 5000 approach.We include additional results on other search algorithms, including beam search and simulated annealing, in §C.4,where we observe similar instablitiy.One possible explanation is the complexity of the search space, where the greedy search algorithm cannot guarantee finding the optimal solution.The gaps between the search objective (L dev × T test in one fold) and the evaluation objective (L test × T test in 30 folds) could also contribute to this issue ( §C.5).
Task diversity and task value are important factors for constructing "small-bench."Firstly, kmeans is comparable to or surpasses Best of 5000, despite that it is not explicitly optimized for the R 2 objective.This supports the notion that diversity is an important factor in constructing "small-bench."This finding also suggests that the MLP models for performance prediction produce meaningful task representations as a side product.Secondly, k-means + Task Value is comparable to or outperforms k-means, confirming that task value is another important factor for constructing "smallbench," complementing the diversity aspect.

Conclusion and Future Work
In this work, we began with the question, "How predictable are large language model capabilities?"and conducted a detailed case study on BIG-bench.We first formulated the machine learning problem of predicting performance given configurations such as model family, the number of parameters, task, and the number of in-context learning examples.Our strongest prediction model achieves an R 2 score greater than 95%, which suggests past LLM experiment observations can be used to predict the performance of new experiment configurations.To address the problem of increasing evaluation cost on massively multi-task benchmarks, we introduced the problem of searching for an informative "small-bench."Results suggest that popular subsets such as BIG-bench Lite and BIG-bench Hard are not optimal for this purpose.Instead, subsets characterized by diversity and high task values offer competitive "small-bench" candidates, highlighting the importance of these two factors.
In closing, while our study primarily focused on the predictability of LLM capabilities, we hope to initiate discussions on the following broader topics.
Rethinking LLM Evaluation.Currently, there is a lack of consensus regarding evaluation practices for newly developed LLMs.Often times new LLMs are evaluated on different set of selected tasks, making it hard to compare different models and quantify the progress in LLM development.Moreover, task selection is often heuristic, following past practices, or chosen arbitrarily without principled justifications.We anticipate more active discussion on establishing evaluation practices that assess LLM capabilities efficiently, reliably and rigorously, and we hope our work provides useful insights towards this.Related to our efforts on searching for "small-bench," Perlitz et al. ( 2023) investigate the impact of benchmarking options on the trade-off between computational efficiency and reliability, and develop Flash-HELM, an efficient alternative to HELM (Liang et al., 2023).Vivek et al. (2023) propose Anchor Point Selection to select representative examples in the test set and reduce evaluation cost at the instance-level.
Broadening observations on LLM capability landscape.Complementary to BIG-bench, several ongoing initiatives, such as HELM (Liang et al., 2023), Open LLM Leaderboard11 , and Eleuther AI LM Harness12 are dedicated to systematically evaluating existing LLMs.Integrating insights from these great initiatives into future work has the potential to enhance the accuracy of LLM performance prediction and deepen our understanding of LLM capabilities.Additionally, it would be intriguing to take into account recent advances such as chain-of-thought prompting (Wei et al., 2022b) and instruction tuning (Sanh et al., 2022;Ouyang et al., 2022), and systematically measure their effects on LLM capabilities.

Limitations
Limited to BIG-bench results.We choose BIGbench for our study due to its extensive collection of experiment records.Though it offers considerable diversity in terms of tasks and models, several limitations exist.(1) Tasks: It's important to note that BIG-bench tasks are sourced from the research community and may not accurately reflect the actual distribution of tasks encountered by LLMs in real-world scenarios.Therefore, our study has limitations in terms of generalizing our conclusions to the real-world task distribution.(2) Models: Though we have made every effort to incorporate as many model families as possible, there are only 6 model families in our experiment record dataset derived from BIG-bench.Such scarcity introduces instability and increases the difficulty in investigating the "small-bench" problem.
Limited to publicly-available LLM meta-data.LLMs capabilities are dependent on many factors, beyond the model family l and number of parameters n param used in this study.Factors such as pre-training stability, convergence, pre-train corpus composition, etc., all play important roles.However, we often don't have access to this information.In this work, we assume that the input features (l, n param ) can capture such information during training implicitly.In the future, we believe our method can be expanded to include additional pretraining meta-data when they become available.
Limited to interpolation settings.Our experiments mainly concentrate on interpolation settings, where the combinations of (l, n param , t, n shot ) are new in the test set, but each of the input element is seen at least once in the training set.As LLMs continue to grow in size, a very important aspect is predicting performance of larger models in an extrapolation setting.We present some preliminary findings in this setting in §C.3.
Limitations in evaluation metrics.We use RMSE and R 2 as they are widely-used metrics for regression tasks.However, both metrics have their limitations for our problem, especially in the context of conducting group-wise comparison ( §3.4).RMSE does not account for the variance in the target variable.A low RMSE value for a task may be solely due to the fact that the task performance is relatively insensitve to different experiment settings.On the other hand, while R 2 score accounts for variance, it creates discrepancies when conducting group-wise comparison since the denominator used to compute R 2 differs for each group.To get a more comprehensive picture of our prediction model, we introduce task-average Pearson Correlation and Kendall Rank Correlation for evaluation and discuss our findings in Appendix C.1.
3. Remove BIG-bench subtasks whose performance on the preferred metric is zero for all models.
4. Keep experiments whose preferred metric is in exact_str_match, multiple_choice_grade, rougeLsum.This keeps 93% of all records before this step.

Remove entries of aggregating results from
multiple subtasks as the performance of a task.

Remove subtasks with less than 100 examples because small sample size may lead to large variance during evaluation
We present a summary of the 6 model families in these records in Table 2.

B Featurization
In the main body of the paper we use the abstraction of (l, n param , t, n shot ) to describe experiment configurations.In the actual training of tree-based models and MLP, we modify how these features are represented.In particular, the input features contain the following: 1. l is converted as binary features for each model family, e.g., is_PaLM.

(l, n
) is converted as binary features for each model, e.g., is_PaLM_535B.
3. There are 6 numerical features for the number of parameters: number of total parameters, number of non-embedding parameters, number of FLOP-matched non-embedding parameters; and the natural log of these three values.
4. t is converted as binary features for each task, e.g., is_code_line_description.
5. m is a binary feature for the preferred metric associated with the task t, e.g., is_exact_str_match.BIG-bench defines a preferred metric for each task, so in the abstraction it is covered by t.
6. n shot is directly used as an input feature.
For numerical features (6 features for the number of parameters and 1 feature for the number of shots), we use StandardScaler in the sklearn libarary to normalize them.Additionally, we normalize the performance value y to be in the range of [0, 1].exact_str_match and multiple_choice_grade already satisfy this constraint.The reported rougeLsum values are in the range of [0, 100] and are multiplied by 0.01 to form our dataset.

C.1 Additional Evaluation Metrics for Performance Prediction
In Within each group, we compute the Pearson Correlation or Kendall Rank Correlation between the predicted performance and the actual performance.Finally, the average of these rank correlation values across all tasks T is then reported as the task-average rank correlation.We report these numbers along with RMSE and R 2 in Table 6.
Generally, prediction models with higher R 2 scores exhibit higher rank correlation.Exceptions emerge when closely comparing tree-based models and MLP models.While MLP models are comparable or outperform tree-based models in terms of R 2 score, tree-based models tend to outperform MLP models in terms of task-avg (rank) correlation.Our further investigation reveals that tree-based models make more errors with large absolute errors, which are penalized heavily by RMSE and R 2 (involving taking square of these errors), whereas rank correlation is less sensitive to such errors.
Throughout our paper, we primarily focus on experimentation with MLP models due to their faster runtime and the values of the learned representations in MLP.In practice, we recommend selecting methods based on the final goal: MLP models for more accurate prediction in terms of exact values; tree-based methods for more accurate ranking of different experiment settings.

C.2 Sample Predictions in the L3 setting
In Fig. 7    two are cases achieving high R 2 scores.The right two are cases achieving low R 2 scores.For (l, t) combinations achieving high R 2 scores, our observation is that either a combination (l, t ′ ) exists in D train such that t and t ′ are similar (e.g., t and t ′ are two sub-tasks from the same BIG-bench task), or (l ′ , t) exists in D train such that l ′ and l are similar (e.g., both l and l ′ from the three BIG-G model families).Our interpretation is that the learned predictor is capturing model family similarities and task similarities and therefore predicts more accurately when l or t has a similar counterpart in the training set.
For (l, t) combinations achieving low R 2 scores, we observe several cases of overestimating performance or predicting "false positives" of emergent abilities.The selection of these combinations are largely due to using R 2 as selection criteria-they have small total variance as the denominator for R 2 score, so any overestimation will results in an extremely negative R 2 score.Nevertheless, these qualitative results suggest that overestimation is a common type of mistake made by our trained prediction model.

C.3 Performance Prediction in Extrapolation Settings
In the input space of our problem, n shot and n param are numerical features.Thus it is possible to test the extrapolation capabilities on these two inputs.We create three settings for testing the model's extrapolation capabilities:  4. (1) Extrapolation in terms of n shot is promising, achieving R 2 which is greater than 0.9.However, extrapolation to increased model size remains challenging.This is closely related to the observation that emergent abilities is difficult to predict (Ganguli et al., 2022;Wei et al., 2022a).(2) Performance in S2 is consistently better than S1.Note that the only difference between these two setting is the D dev used to do model selection.This suggests that the model overfits and fails to extrapolate well in the strict S1 setting.Leaking some information about the test distribution (as in S2 and S3) can greatly help improve prediction accuracy.

C.4 Additional Results on "small-bench" search
We experiment with two additional search algorithms for "small-bench" search: (1) Randomized Beam Search: Similar to regular beam search, except that we maintain a beam size of q = 4 and we enumerate 1/q randomly selected task candidates at each timestamp.This ensures the search runtime is equivalent to greedy search.
(2) Simulated Annealing ( Černỳ, 1985): Initialize with a seed T train ; at time t, iteratively search in the neighbourhood of the T train at time t − 1 and occasionally allowing uphill moves (i.e., moving towards a worse solution).Results are visualized in Fig. 8. Similar to greedy search, these two search methods face optimization challenges discussed in §5.3 and may lead to sub-optimal solutions.
C.5 Gaps between the "small-bench" search objective and evaluation objective In §5.3, we observe that the performance of greedy search is unstable and hypothesize that this is partly due to the gaps between the search and evaluation objective.
Gap between T dev and T test .To simulate the scenario that the prediction model is expected to make predictions on an unseen model family, we make sure that the search algorithm optimizes on T dev × L test , and holds out T test × L test for evaluation.This creates a dev-test shift which may affect search algorithm results.Gap between 1 fold and 30 folds.Due to runtime concerns, we only launch search algorithms on 1 fold, while at evaluation time we use all 30 folds.The 30 folds, derived from 6 distinct model families, may exhibit significant variations.Consequently, the search result on 1 fold may overfit to that specific fold and become less optimal on all 30 folds.

D Reproducibility D.1 Hyperparameters and Training Details
For all the following methods and for each data split setting (L1/L2.1/L2.2/L3/L3Composition), we select hyperparameters based on the dev performance on the first fold, and select the best set of hyperparameters from 100 random combinations from a pre-defined list.

G.5 k-means
The numbers reported in Fig. 6 is the average of 5 runs.We ran the k-means algorithms with 5 different random initialization.In the following we list the "small-bench" candidates from 1 run.The same applies to "k-means + Task Value" results.
Group by n param (4 Buckets)

Figure 3 :
Figure 3: R 2 Score when Grouped with n shot , l, and n param .Example: Multi-group n shot = 0 means training on the complete D train (containing all n shot values) and evaluate on n shot = 0 examples in D test .Single-group (1000 Examples) n shot = 0 means using 1000 n shot = 0 examples in D train as the training data.

Figure 5 :
Figure 5: Performance of Different Prediction Models on Challenging Train-Test Splits.As the setting becomes more challenging (L1 → L2 → L3 → L3 Composition), performance gradually drops and variance increases.MLP is least sensitive to these changes.
|T train | = 24 for BIG-bench Hard.(c) Random: For each b, randomly sample 5 T train such that |T train | = b.We report the mean and standard deviation of these 5 runs.Search Algorithms.(d) Greedy Search: Based on the search result T (b−1) train at budget b − 1, enumerate all tasks not present in T (b−1) train , and select the one task that achieves the highest R 2 (T (b) test × L test ) to form the T (b) train at budget b.(e) Best of 5000:

F
Scaling Behavior of Most/Least "Predictable" Tasks ( §3.4) and Table App.3  in Srivastava et al. (2023)for an overview.We also made our best effort to incorporate all available model families in BIG-bench.The six model

Table 1 :
Statistics of BIG-bench experiment records after filtering.† "Model" is defined by model family l and n param , e.g., PaLM 535B.‡ The 313 subtasks fall into the 134 tasks.For simplicity we disregard the task-subtask hierarchy in this study.In the remaining of this paper, "tasks" refers to BIG-bench subtasks.
"Small-bench" Search Results.X-axis: size of "small-bench" (T train ), i.e., number of tasks selected for evaluating a new model family.Y-axis: R 2 score on recovering performance of remaining tasks.The complete BIG-bench will be at (313, 1.0) in this figure.Takeaways: (1) BIG-bench Lite and Hard are sub-optimal for recovering performance on remaining tasks; (2) Task diversity and task value are important for constructing effective "small-bench" candidates.
addition to RMSE and R 2 score, two common metrics for evaluating regression models, we introduce two new metrics, Task-average Pearson Correlation and Task-average Kendall Rank Correlation.The usage of Pearson Correlation and Kendall Rank Correlation are inspired by Liu et al. (2023) and we further adapt them to be averaging across tasks.Concretely, we first group the test set D test into |T | groups, based on the associated task t of each example.
we visualize predictions on four sample (l, t) combinations by the MLP model.The left

Table 2 :
Summary of model families included in this study.These model families offer considerable diversity.

Table 3 :
Using BIG-bench Hard and "small-bench" to recover performance and compare models.In this example, BBH is less informative in recovering performance on remaining tasks, and thus is the comparison is less accurate.
Relaxing Constraints by leaking 10% D test .Pure extrapolation may be extremely challenging.To better contextualize the results and understand limitations, we compare model performance in three slightly different settings.The first setting (S1) is holding 10% D train as dev set for hyperparameter selection and early stopping.This corresponds to the pure extrapolation setting, as no information about D test is available at training time.The second setting (S2) is holding 10% D test as dev set.Information about D test is indirectly leaked to model training.The third setting is leaking 10% D test during training.This third setting (S3) is mainly for reference.More specifically, we split the original D train into 90% D ′ train and 10% D dev1 .Also we split the original D test into 90% D ′ test and 10% D dev2 .

Table 5 :
We first split D into 10 disjoint subsets, and then rotate on which ones are D dev and D test .To save computation budget, hyperparameter selection is done on the D dev of the first fold.Runtime (Training+Evaluation) of Performance Prediction Models.10-foldcross validation for all these settings, similar to the practice in L1.This ensures for every fold, the sizes of D train /D dev /D test are consistent across settings, and each example in D appears in D test exactly once.In this case, the only changing variable is the data splitting strategy.
L2.1/L2.2/L3:Random Splitting at Different Granularity.To ensure that the results are comparable across settings as much as possible, we use

Table 6 :
Full Results of Performance Prediction.Matrix Completion Trees Neural Network