Flatness-Aware Prompt Selection Improves Accuracy and Sample Efficiency

With growing capabilities of large language models, prompting them has become the dominant way to access them. This has motivated the development of strategies for automatically selecting effective language prompts. In this paper, we introduce prompt flatness, a new metric to quantify the expected utility of a language prompt. This metric is inspired by flatness regularization in statistical learning that quantifies the robustness of the model towards its parameter perturbations. We provide theoretical foundations for this metric and its relationship with other prompt selection metrics, providing a comprehensive understanding of existing methods. Empirically, we show that combining prompt flatness with existing metrics improves both performance and sample efficiency. Our metric outperforms the previous prompt selection metrics with an average increase of 5% in accuracy and 10% in Pearson correlation across 6 classification benchmarks.


Introduction
Manually "engineering" prompts for large language models (LLMs) have been shown to lead to tremendous performance gains and have been a subject of intense study in recent years (Schick and Schütze, 2021a;Reynolds and McDonell, 2021;Mishra et al., 2022).However, the task of prompt engineering can be challenging due to the difficulty in determining the effectiveness of a prompt solely based on its raw text form.Consequently, this process is typically carried out manually, which can be laborious and time-intensive.In particular, LLMs may produce vastly different predictive distributions for two seemingly comparable prompts, despite their semantic similarity (Mishra et al., 2022).
♡ Equal contribution. 1 The code is accessible here: https://github.com/shadowkiller33/flatness. Figure 1: We show that prompt flatness is an effective indicator of a prompt's performance on an LLM.For example, if two prompts p 1 , p 2 incurs the same loss on an LLM parameterized by θ 0 , i.e., L(p 1 , θ 0 ) = L(p 2 , θ 0 ), we find that the one with a flatter loss landscape of LLM parameters (p 1 , in this visualization) is better.
In response to such difficulties, recent works propose metrics for automatic prompt selection.Notably, Sorensen et al. (2022) introduces Mutual Information (MI) to quantify the shared information between prediction and inputs.Further, Chen et al. (2022) introduces Sensitivity (SEN) to quantify model receptiveness to textual perturbations of the input prompts.Despite such metrics' empirical effectiveness, the underlying principles that enable them are not well understood.
This motivates the following questions: (RQ 1 ) What makes the existing methods for prompt selection effective?(RQ 2 ) How are these existing methods connected?(RQ 3 ) Are there any new metrics complementary to the existing ones?
To address the questions above, we study existing methods from an optimization perspective.The objective L(p, D, θ 0 ) quantifies the performance of an LLM (parameterized by θ 0 ) on labeled data D and a prompt p appended to the dataset inputs.Prompt selection is in effect an optimization on L(p, D, θ 0 ) as a function of different choices of p.The challenge is that, in practice, there are few labeled data D (Perez et al., 2021), which would make L(.) an unreliable measure for selecting effective prompts.We show that the existing prompt selection metrics (MI and SEN) (Sorensen et al., 2022;Chen et al., 2022) approximate the objective function L, and therefore, act as its surrogates.This addresses (RQ 1 ) and (RQ 2 ) above.
Additionally, to address (RQ 3 ) we borrow ideas from statistical learning on flatness-aware optimization (Hochreiter and Schmidhuber, 1994;Keskar et al., 2017).We introduce Prompt Flatness (PFLAT), a metric that quantifies L's sensitivity to small perturbations in LLMs parameters, when conditioned on a prompt (see Figure 1 for intuitions).
Our results indicate that prompts with higher flatness generally lead to better accuracy.
Our formal derivations also show that PFLAT is distinct from and complementary to prior metrics such as MI and SEN.Our empirical results ( §3) on six classification benchmarks and four different model sizes also confirm our theoretical intuition.For example, combining PFLAT and MI improves the downstream performance by 6% accuracy over the prompts selected by MI only.Similarly, combining PFLAT and SEN improves the downstream performance by 9% accuracy over prompt selected by SEN only.Additionally, using PFLAT substantially improves sample efficiency, an important feature of low-resource scenarios.
In summary, our contributions are: (a) We propose a formal optimization framework that unifies several existing prompt selection metrics such as MI and SEN. (b) Enabled by our formalism, we introduce PFLAT, a metric for selecting prompts that is more robust to LLMs' parametric perturbations.(c) We conduct comprehensive experiments and the results demonstrate the effectiveness of our method for prompt selection.

Prompt Selection via Flatness
We start by introducing the necessary background and the notational convention ( §2.1), then introduce our proposed metric, PFLAT ( §2.2), followed by a discussion of its relation to other existing prompt selection metrics ( §2.3).

Background and Setup
Notation.We cast prompt selection into an optimization problem.We are provided with a pretrained language model f with parameters θ ∈ R m which maps each input natural language instance x to f θ (x) ∈ [0, 1] |V | , a distribution over the label set V .We are also given input-output pairs D = {(x, y)}, where y is a one-hot label.
Prompt selection.Given a language model f , we seek to minimize the following empirical risk, also called prompt loss in this paper: where p • x is the string combination of a prompt p to input x, and ℓ is an appropriate loss such as cross-entropy that quantifies the gap between gold label y and predicted distribution f θ (p • x).
In the classic machine learning literature, it is customary to minimize empirical risk L(p, D, θ) with respect to the parameters of the underlying model θ.However, the recent developments in LLMs (Radford et al., 2019;Brown et al., 2020) have resulted in an alternative that involves optimization concerning the choice of prompt p: given a collection of natural language prompts P that are "engineered" by domain experts (Schick and Schütze, 2021b,a;Mishra et al., 2022)

Prompt Selection via Flatness
Our work draws inspiration from classic machine learning, where studies have demonstrated that using loss flatness in model selection leads to improved performance and generalization (Foret et al., 2020;Baldassi et al., 2020;Zheng et al., 2021;Stutz et al., 2021;Andriushchenko and Flammarion, 2022).In this prior literature, the optimization is performed with respect to model parameters θ.Conversely, in the modern NLP literature, the parameters of LLMs are set once they are pre-trained, and further optimization is achieved through input prompts As a result, it remains to be seen whether the findings from classic machine learning literature will translate to prompting LLMs.
Robust prompt selection objective.We start with the formal definition of flatness.Specifically, the goal is to select parameters that are robust to parameter perturbations: where ϵ is a small perturbation added to model parameters θ.The inner optimization quantifies the worst-case loss upon a small perturbation of the model parameter from its default value, where the perturbations are contained within a small algebraic ball, ∥ϵ∥ < r.The overall objective is a minimax optimization (Zheng et al., 2021;Stutz et al., 2021;Baldassi et al., 2020) i.e., selecting the best prompt p with the smallest worst loss under small perturbations.Note that this is a strict generalization of the standard prompt selection objective in Equation 1.
Flatness definition.Since Equation 2 is a nontrivial saddle-point optimization problem, previous work (Zhao et al., 2022;Zhang et al., 2023b) has approximated it with the gradient norm of loss function: where F(p, D, θ) is the accurate analytical definition of flatness the loss function L(.).Intuitively, it quantifies how resilient it is against small perturbations in parameter space θ.
The calculation of F requires (1) gradient computation of the loss L and (2) ground-truth labels which may not be available.To circumvent these challenges, we introduce an approximation of F.
An efficient surrogate for flatness.Here provide an approximate definition of flatness (F in Equation 5) that does not depend on instance labels.Our new metric, PFLAT quantifies the amount of changes in LLM confidence values upon perturbations in its parameters: where are sampled from a Gaussian distribution N (0, σ 2 ) with its variance σ 2 determining the perturbation magnitude.Furthermore, D X = {x} refers to the input instances only (no labels).Intuitively, higher PFLAT means higher sensitivity towards perturbation in model parameter, indicating that the given input prompt, instances, and the model parameters have formed a sharper minimum.The formal connection between PFLAT and F is deferred to Appendix B.
Although the precise computation of PFLAT demands numerous Gaussian samples, practically, approximating it with few samples suffice for a reasonable PFLATestimate.We'll demonstrate this in the experiments ( §4).
Putting it together.Incorporating our PFLAT metric (Equation 6) in robust prompt selection objective (Equation 4) we get the following: where α is a scalar hyperparameter.In our experiments, we select the prompt with the smallest L and show that such prompts have better quality than those selected only by MI or SEN.For emphasis, this equation shows that for robust prompt selection according to L, it is not enough to use PFLAT alone.It should be used in conjunction to L or its approximations (discussed in the next section).We show this point empirically in Section 3. The only reason that our metric is not fully zero-shot is that the hyper-parameter α has to be selected according to a few examples of a held-out set.

Relation to Prior Prompt Metrics
We show that prompt selection through existing methods such as MI (Sorensen et al., 2022) andSEN (Chen et al., 2022) is approximately equivalent to minimizing prompt loss L(p, D, θ), as shown in Equation 5. Therefore, they can be viewed as surrogates to L(.).Formally, we provide the gap between prompt loss and its surrogates (e.g., MI and SEN) which is determined by the difference (e.g., KL divergence) between a model's predictions and the ground-truth labels.
Mutual Information.Sorensen et al. (2022) propose to pick prompts that maximize the mutual information between model input and the prediction.
Proposition 1. Mutual information MI(p, D, θ) is a surrogate loss for prompt loss L(p, D, θ) with a gap quantitatively defined as follows: where c is a constant c = H(f θ (x • p)) that does not depend on prompt p. KL refers to KL divergence.
Sensitivity.Give a prompt p, Chen et al. ( 2022) utilizes the sensitivity of model prediction towards the textual perturbation in p.
Proposition 2. Sensitivity SEN(p, D, θ) is a surrogate loss for prompt loss L(p, D, θ) with a gap defined as follows: where p ′ and ℓ 01 refer to the perturbed prompt and 0-1 loss, and E p ′ is an expectation (average) over different choices of perturbed prompts p ′ The detailed analyses are deferred to Appendix A. These derivations show that selecting prompts based on MI and Sen is approximately selecting the prompts with the smallest prompt loss, which shows their connections and explains why they are effective for prompt selection.
Complementarity to PFLAT.A corollary of Proposition 1,2 is that prompt-selection metrics such as MI (Sorensen et al., 2022) and SEN (Chen et al., 2022) are surrogates for prompt loss, which are complementary to PFLAT, for the purpose of robust prompt selection (Equation 2).To see this, it is enough to go back to Equation 7, which shows how robust prompt selection decomposes into PFLAT and L. Finally, as we see, L is approximated by SEN and MI, which concludes the argument.

Experiments
We conduct extensive experiments to assess the effectiveness of prompt selection metrics.
Implementation.We prepare 20 human-written instructions by the authors (included in Appendix F) appended by random demonstrations for each task.The number of demonstrations is set as 5, which matches the settings in Sorensen et al. (2022) for a fair comparison.We use MI, SEN, PFLAT, and their combinations for comparison.The results are averaged on three random seeds.We estimate PFLAT (Equation 6) via 5 random Gaussian perturbations of LLM parameters with variance σ 2 set to 1e-4.Later, we offer an assessment of the influence of this estimation ( §4.4).

Evaluation metrics. We use two metric families:
Correlation with accuracy: The first category measures the alignment between prompt selection metrics (including our proposed metric) and the downstream accuracy of each prompt.This evaluation contrasts the relative quality of prompts based on their accuracy with their prompt-selection accuracy.Specifically, for each prompt, we compute the prompt selection metric score (MI, MI + PFLAT, which uses only task inputs) and the prompt's accuracy on the test set.Given a collection of such paired numbers, we compute their correlation.A high correlation indicates that this prompt-selection metric can serve as a "surrogate" (proxy) for selecting the most accurate prompt, bypassing the direct maximization of accuracy which often demands extra held-out labeled data.Ranking evaluation: Since correlations are sensitive and brittle to outliers (Anscombe, 1973), we further use different metrics for best-performance prompt retrieval.Specifically, we use NDCG@1 (Järvelin, 2000), NDCG@3, and Rate.NDCG is a common metric for ranking quality in information retrieval.Here, we take prompts' performance as their quality score for NDCG.We denote the prompt selected by metric (e.g., highest MI or lowest SEN) as p, and Rate is defined as follows: AGNews  where p o refers to the prompt that achieves the best performance on the task.Intuitively, Rate reflects the performance of the selected prompt compared to the best prompt, and it is a real-valued number between 0 and 1.A larger Rate corresponds to a better selected prompt p.
Flatness is complementary to MI and SEN.The correlation results are in Figure 2 (detailed numbers in Appendix C). Figure 2 (first row) shows that correlations are higher for MI+PFLAT and SEN+PFLAT than for metrics without PFLAT.In other words, combining existing (MI or SEN) with flatness results in a more effective prompt selection metric that correlates better with test accuracy.
We find similar results in the ranking evaluation illustrated in Figure 3 (full results in Appendix D).In all benchmarks, metrics incorporating flatness generally surpass those without it, highlighting the importance of utilizing prompt flatness in the prompt selection process.4 Further Analysis

Continuous Prompt Selection
In addition to text form (discrete) prompt, we also test out the effectiveness of flatness for continuous prompt optimization (also known as 'prefixtuning').Like the earlier result, introducing flatness to prefix-tuning also improves model performance.Experimental setup.We following prefix-tuning setup of Li and Liang (2021) and consider three text classification benchmarks in our experiments: SST-2 (Socher et al., 2013), AGNews (Zhang et al., 2015), and SNLI (Bowman et al., 2015).We use the GPT2-medium as the model and set prefix length to 10 tokens for all prefix-tuning experiments.We train 30 epochs for SST-2 and 25 epochs for AGNews and SNLI, as suggested in Yang and Liu (2021).

Method
Implementation of flatness-aware prefix-tuning.
To introduce flatness into prefix tuning, we leverage sharpness-aware optimizer SAM (Foret et al., 2020) for model optimization.We use Adam (Kingma and Ba, 2015) as our optimizer in the counterpart without flatness.Specifically, both cases use the same learning rate 5e-5.
Results.As shown in Table 1, prefix-tuning with flatness achieves better performance than without flatness.Such results show that flatter continuous prompts bring better performance, which matches our conclusions on discrete prompts.

Influence of Model Size
We investigate the effects of model size in our methods.As shown in Figure 4, as the model size increases the gap between the two metrics (e.g., MI vs MI+PFLAT) measured in terms of Rate generally increases, indicating an increasing gain from adding PFLAT to existing prompt selection for larger models.

Impact on Sample Efficiency
If there is enough labeled data, a reasonable approach for prompt selection is based on the ac-curacy of the prompt on a labeled development set (we name this baseline method "acc").Thus, a natural question concerning practicality arises: how does our method compare to prompt selection based on the accuracy of limited labeled examples?To perform the comparison, we select N labeled data from the AGNews dataset and evaluate Rate (Equation 8) for both "acc" baseline and our method (MI/SEN + PFLAT).
Based on the results in Figure 5, we observe that with little data available, our methods select a far better prompt than the "acc" baseline, allowing performance gains in low-data scenarios.This can be attributed to the fact that when the dataset is small, there may be a significant distribution shift between the development and test sets.However, our methods, MI/Sen/PFLAT, provide signals beyond labeled data and thus more resilient to such distribution shifts.Unsurprisingly, when data size grows, the gap between our method and the "dev" baseline decreases since the distribution shift issue is mitigated by increasing the size of the dev set.In conclusion, our metrics are more advantageous than development set accuracy for prompt selection in low-resource scenarios.

Estimation of PFLAT
In our implementation of PFLAT (Equation 7), there are two factors that affect PFLAT: the sampling number N and the perturbation size σ 2 .We explore their effects in this part.
As noted earlier, we compute prompt flatness by sampling ϵ from a standard Gaussian distribution N (0, σ 2 ).Since the computational cost of this estimate is proportional to the sample size N , the choice of N is crucial for our efficiency.Figure 7 shows the results of an experiment showing the trade-off between N and estimation quality.The results indicate that N ≈ 5 is sufficient to provide reliable estimates for PFLAT.
Likewise, we investigate the impact of σ 2 on the estimate.The results in Figure 6 (a, b) indicate that the optimal perturbation size is around 1e-4.When the perturbation size increases after 1e-4, the estimation error also increases.

Related Work
Prompt selection and engineering.Performance of LLMs is highly sensitive to their prompt prefix, including the ordering of demonstrations (Lu et al., 2022) or framing of the instructions (Mishra et al., 2022).This has motivated work prompt selection, such as the ones discussed in this work (Chen et al., 2022;Sorensen et al., 2022).Beyond quantifying prompts' effectiveness, the literature has explored alternative ways to address LLMs' brittleness, such as chainof-thoughts prompting (Kojima et al., 2022), LLM self-consistency (Wang et al., 2022a) and complexity (Fu et al., 2022).Our optimization-based framework does not cover these classes of prompt engineering, which we hope future work will address.
Algorithmic prompt generation.Several prior works focus on generating effective prompts to solve a given task via an LLM.Examples are RLPrompt (Deng et al., 2022), GrIPs (Prasad et al., 2023), and Tempera (Zhang et al., 2023a).While these works primarily focused on generating prompts with high performance for prompt-tuning, our goal is to identify effective prompts for a pool of candidate prompts that is beneficial for in-context learning.In particular, within the context of Appendix E, a comparative analysis is conducted to examine the in-context learning performance of prompts generated by these approaches.The results reveal that prompts deemed suitable for finetuning exhibit sub-optimal performance in terms of in-context learning.
Besides, the ability to generate prompts inevitably involves model tuning via setups like Reinforcement Learning which incurs an additional computational cost.More importantly, the qual- ity of the generated prompts depends on the task's domain.When confronted with Out-of-Domain (OOD) tasks, these approaches tend to generate nonsensical prompts.
Continuous prompts.Beyond language (discrete) prompts, we show that our results also apply to continuous prompts.In contrast to manually creating discrete prompts, one can optimize continuous prompts in embedding space, yielding better results (Lester et al., 2021;Li and Liang, 2021;Zhang et al., 2022;Gu et al., 2022;Lang et al., 2022;He et al., 2022).Despite higher accuracy, continuous prompt optimization is only applicable to LLMs that are publicly accessible.Besides, there is no evidence that continuous prompts are interpretable (Khashabi et al., 2022), making it challenging to transfer insights from prompts that work well for one task to another.
Flatness-aware language modeling Previous works (Liu et al., 2023;Mehta et al., 2021) showed that flatness-aware optimization can enhance the generalization of LLM during pre-training, even if the training loss is the same.Na et al (Na et al., 2022) demonstrated that flatness-aware training increases the compression rate.Wang et al (Wang et al., 2022b) showed the advantages of flatness in training encoder-only models.

Model calibration and robustness analysis.
Model calibration focuses on adjusting LLMs' predictions to reflect human uncertainty (Holtzman et al., 2021;Zhao et al., 2021;Jiang et al., 2022).Calibration is related to our work as a wellcalibrated LLM's confidence could be used for prompt selection.However, calibration algorithms have remained domain/task-specific so far, restricting their applicability to the problem discussed in this paper.

Conclusion
We developed a theoretical framework for prompt selection techniques that merges prompt loss and flatness, enabling the integration of previous studies to elucidate their distinctions and efficacy.
Through extensive experimentation, we demonstrated the effectiveness of our proposed flatnessbased metric when used in conjunction with existing ones.Our research offers valuable insights and directions for future investigations in effective prompt engineering.

Limitation
The limitations of this study can be outlined as follows: (1) Our paper assesses the methods based on classification tasks, but they can potentially be applied to generation tasks in the future.(2) Our framework presumes that the provided collection of candidate prompts is all coherent and fluent for the intended task, despite the possibility of yielding varying results.(3) Our approach is not entirely zero-shot, since it still requires a small labeled development set for adjusting the α parameter.

Ethical Considerations
To the best of our knowledge, the paper does not pose any immediate ethical concerns.
divergence.Overall, MI can be formulated as follows: MI(D, p) = H(E) − L(p, D, θ) (12) Equation 12 shows that maximizing mutual information is equivalent to minimizing prompt loss to a certain degree, indicating that MI serves as a surrogate for prompt loss.
Sensitivity Sensitivity (Sen) reflects how much the model output changes given small perturbations of the input.Sen first creates a perturbed prompt set P given a prompt p, by changing demo order σ and adding perturbation ϵ to the prompt instruction I.We direct readers to the original paper (Chen et al., 2022) for details of how such prompt sets can be created.Sensitivity on one single test sample x is formally denoted as follows: Naturally, we can extend this sample-level metric to the dataset level.Given test samples D, the SEN of prompt p is defined as follows: We can re-write the formula for Sen as follows: Note that L(p, D) is a 0-1 loss instead of cross-entropy loss as shown in MI's derivation.The equation above shows that SEN can be regarded as a surrogate for the prompt loss L. Therefore, minimizing SEN is partially equal to minimizing prompt loss, explaining why a low-sensitivity prompt achieves better performance, as empirically verified by Chen et al. (2022).Generally, the gap between prompt loss L and two surrogates (MI and SEN) is determined by the distance (i.e., KL divergence) between the model's prediction f θ (x • p) distribution and ground-truth label.When f θ (x) is identical to round-truth label, MI and SEN become perfect surrogates for prompt loss L.

B On the approximation gap of flatness and F
This section details the approximation gap of flatness PFLAT towards F. Firstly, we recall the definition of PFLAT and F. Thus, the approximation gap can be obtained through Equation 16and Equation 17.When the model's confidence is identical to ground-truth labels, PFLAT is a precise approximator of F.

Figure 2 :
Figure 2: Results of correlation evaluation across six datasets and their average (AVG).First row: SEN vs SEN+PFLAT and MI vs MI+PFLAT show that flatness brings consistent improvements over existing metrics.Bottom left: From PFLAT vs MI+PFLAT, flatness does not perform well when applied alone, as expected.Bottom right: MI+SEN vs MIcomparison shows that combining SEN and MI brings limited improvement.
Figure 2, SEN+PFLAT, MI+PFLAT or MI generally outperforms PFLAT, these results show the importance of combining prompt loss.Without prompt loss, prompt flatness on itself is insufficient to reflect prompt quality.Such results also stress the importance of combining prompt loss and flatness.

Figure 4 :
Figure 4: Rate (reflecting the ability to select better prompts) evaluation computed prompt selection across four model sizes, evaluated on the AGNews dataset.Combining prompt loss and flatness (MI+PFLAT or SEN+PFLAT) is consistently better than MI/Sen alone across different model types.More detailed results are deferred to Appendix D.

Figure 5 :
Figure5: For development sets with varying sizes n = 16, 32, • • • , 512, the devset-acc based method (green line) selects prompts based on the accuracy of prompts on n devset samples.On the other hand, our metric (MI+PFLAT and SEN+PFLAT) also use n samples and achieve better performance under low-resource scenarios (n < 500).

Figure 6 :Figure 7 :
Figure 6: (a) Rate of SEN+PFLAT as perturbation size varies.The optimal ϵ is around 1e-5, as ϵ enlarges, the performance of SEN+PFLAT continues to degrade.(b) Rate of MI+PFLAT as perturbation size varies.The trend in MI+PFLAT is similar to (b).
Table 1: Performance for prefix tuning with flatness and w/o flatness in a mean (standard-deviation) form.It is observed that leveraging flatness in continuous prompt tuning brings improvements to performance.Stronger numbers for each dataset are marked bold.

Table 2 :
Pearson (Pr)and Spearman (Spr) correlation between prompts' performance and the metrics of various method.Overall, flatness-based metrics obtain higher correlations.Red means the best performance.