Robust Prompt Optimization for Large Language Models Against Distribution Shifts

Large Language Model (LLM) has demonstrated significant ability in various Natural Language Processing tasks. However, their effectiveness is highly dependent on the phrasing of the task prompt, leading to research on automatic prompt optimization using labeled task data. We reveal that these prompt optimization techniques are vulnerable to distribution shifts such as subpopulation shifts, which are common for LLMs in real-world scenarios such as customer reviews analysis. In this light, we propose a new problem of robust prompt optimization for LLMs against distribution shifts, which requires the prompt optimized over the labeled source group can simultaneously generalize to an unlabeled target group. To solve this problem, we propose Generalized Prompt Optimization framework, which incorporates the unlabeled data from the target group into prompt optimization. Extensive experimental results demonstrate the effectiveness of the proposed framework with significant performance improvement on the target group and comparable performance on the source group.


Introduction
LLMs have gained significant attention for their remarkable performance in a broad range of Natural Language Processing (NLP) tasks (Ouyang et al., 2022;Chung et al., 2022;Brown et al., 2020;Touvron et al., 2023).This success has led to a shift in the paradigm of solving NLP tasks, moving away from training task-specific deep models towards developing task-specific strategies to effectively utilize LLMs (Wei et al., 2022;Kojima et al., 2022;Wang et al., 2022a;Ye et al., 2023b).In the new paradigm, the prompt becomes a crucial factor in ensuring the effectiveness of LLM on the NLP task, since even slight variations in prompt phrasing can largely affect LLM output (Reynolds and  McDonell, 2021; Gao et al., 2021), making prompt optimization a promising research direction.
Existing research has explored automatic prompt optimization methods to eliminate manual effort in identifying effective prompts for a given task.These methods can be gradient-based or gradientfree, depending on the availability of model gradients.Gradient-based methods optimize the prompt by calculating its gradients through the LLM (Schick and Schütze, 2021b,a;Hu et al., 2022).Gradient-free methods update prompts based on LLM outputs using techniques such as an iterative search-and-select over the prompt space (Zhou et al., 2023;Prasad et al., 2022;Pryzant et al., 2023).This work focuses on gradient-free prompt optimization as LLMs are evolving into black-box API services (Sun et al., 2022).
Current gradient-free prompt optimization methods ignore distribution shifts (Wang et al., 2023), where the data an LLM serves may differ from the labeled data for prompt optimization.Realworld NLP applications often encounter distribution shifts, such as new user groups with distinct linguistic habits in customer review analysis.It is unclear if prompts hinder the robustness of LLMs against distribution shifts.To answer this question, we conduct experiments with the representative gpt-3.5-turbo-0301model and prompts optimized by APE (Zhou et al., 2023) over paired data groups with distribution shifts.Results on 30 pairs of data groups from six tasks show the risk of significant performance gaps under certain distribution shifts.
Based on this finding, we propose a new robust prompt optimization problem, which aims to optimize task-specific prompts with consideration of performance on both source and target groups under different distributions.Given an NLP task such as sentiment analysis, our problem setting has a labeled source group similar as the conventional prompt optimization setting and a unlabeled target group.We keep the target group unlabeled for the consideration that distribution shifts happen along time in practice.Labeling the newly coming target group will cause unnecessary labor cost and latency.Accordingly, the main challenge for solving this robust prompt optimization problem is incorporating unlabeled data into prompt optimization.
To this end, we propose the Generalized Prompt Optimization (GPO) framework to obtain a taskspecific prompt for both source and target groups.To jointly considering the two groups in prompt optimization, the key lies in labeling the target group in an automatic and reliable manner by adapting knowledge from the labeled source group.Towards this goal, we leverage the strong power of LLM in zero-shot labeling, and prompt ensemble to enhance the labeling robustness.Experimental results on three tasks demonstrate the effectiveness of our framework in improving the performance on the target group and simultaneously preserving a comparable performance on the source group.To sum up, our contributions are threefold: • We reveal the robustness issue of prompt optimization against distribution shifts and propose a new robust prompt optimization problem.
• We propose the Generalized Prompt Optimization framework, which generates robust prompts considering both labeled and unlabeled data.
• We conduct extensive experiments on three NLP tasks, validating the rationality and effectiveness of our proposed framework.

Preliminary Experiments
Prompt optimization aims to find the best prompt p that can instruct LLMs to predict the output y based on the concatenation of p and task input x, where x, y and p are all sequences of tokens.Formally, given an NLP task with a dataset {(x, y)} following a distribution P , the goal is to obtain where Z denotes the prompt optimization space and r is the evaluation metric to compare the LLM output with the ground truth output y, e.g., Accuracy.Existing studies usually leverage gradientbased or gradient-free methods to automatically optimize the prompts.Since LLMs are evolving into black-box API services, gradient-free methods become increasingly important.However, they ignore distribution shifts between training and testing data.In this light, we conduct controlled experiments to answer the following research question: Are prompts optimized by existing gradient-free methods robust to distribution shifts?

Evaluation Protocol
We conduct the controlled experiments between a pair of data groups with distribution shifts, i.e., a source group {(x s , y s )} following a distribution P s , and a target group {(x t , y t )} with a distribution P t , where P t ̸ = P s .We intend to examine whether the prompt p s optimized on the source group can generalize to the target group.Specifically, given p s and p t optimized on the target group, we compare the performance of p s on the target group Datasets.We select 16 datasets from six popular NLP tasks, where each pair of groups under the same task is treated as the source and target groups.Following recent out-of-distribution (OOD) research (Yang et al., 2022), we take each dataset as a group and regard different backgrounds and topics across the datasets as the distribution shifts.For the sentiment analysis task, we adopt Yelp (Zhang et al., 2015), Flipkart (Vaghani and Thummar, 2023), IMDB (Maas et al., 2011) and Amazon (Zhang et al., 2015) of different topics.For the natural language inference task, we utilize MNLI (Williams et al., 2018), and ANLI (Nie et al., 2020) which is an adversarial dataset for MNLI.For the textual entailment, we use RTE (Wang et al., 2018) and its OOD dataset HANS (McCoy et al., 2019).For commonsense QA, we use SocialIQA (Sap et al., 2019), PIQA (Bisk et al., 2020), and OpenbookQA (Mihaylov et al., 2018), which focus on different types of commonsense knowledge.For the multi-turn dialog reasoning, we use DSTC7 (Gunasekara et al., 2019), Ubuntu Dialog (Lowe et al., 2015), and MuTual (Cui et al., 2020).Besides, for the numerical QA task, we use the samples of two different answer types (i.e., numerical values and text spans) in DROP (Dua et al., 2019) as two groups.See Appendix A.1 for details.
Experimental Setup.We adopt APE (Zhou et al., 2023), an effective gradient-free prompt optimization method, for prompt generalization analysis.
To highlight the effect of prompts, we conduct experiments under the zero-shot setting without incontext examples.For the backbone LLMs, we leverage gpt-3.5-turbo-0301by calling the OpenAI API 1 .For all classification tasks (all tasks except for DROP), we use accuracy as the evaluation metric.For DROP, we utilize its standard evaluation metric -F1.data to report the averaged results.More implementation details can be found in Appendix A.2.

Experimental Results
Demonstration of Generalization Performance Gap.Table 1 shows the tasks without a large generalization gap between the performance of prompts p s and p t , and Table 2 shows the tasks with large gaps (Accuracy gap>8.0) on some groups.The row headers refer to the source groups for prompt optimization while the column headers show the target groups to test optimized prompts.The generalization performance gap between p s and p t can be observed by comparing the values in the same column.
From the tables, we can observe: 1) The generalization performance gap may not exist for previously studied OOD and adversarial groups (see Table 1), including the groups of the natural language inference and the textual entailment tasks.This is possibly attributed to the strong generalization ability of LLMs.2) However, under some data groups of Table 2 such as the sentiment analysis datasets (e.g., Flipkart and Yelp) and the commonsense QA datasets with different topics (e.g., PIQA and OpenbookQA), and the DROP groups with different answer types, there are still significant generalization performance gaps, demonstrating the existence of the generalization issue of prompt optimization.3) Surprisingly, the prompt p s op-  timized from the source group does not always perform worse than the prompt p t optimized on the target group.In Table 2(b), p s from OpenbookQA performs even better than p t for SocialIQA.Besides, for DROP in Table 2(c), p s from Spans also performs better than p t from Number.In the following section, we try to explore the reasons for the above three observations.

Exploration on the Factors Affecting Prompt
Robustness.Based on the above observations, we further explore two research questions.Q1: Why do the prompts optimized on source groups perform differently on a target group?Q2: Why does the prompt optimized on the source group perform even better than the prompt optimized on the target group in some cases?For Q1, we conjecture that the varied performance gaps are attributed to different distribution shifts between the source and target groups.To verify this, we examine two metrics to measure two kinds of distribution shifts: 1) the label shifts measured by the KL divergence, and 2) the input similarity quantified by the n-gram similarity of the input corpora of the two groups.Their detailed implementation is illustrated in Appendix A.3.We show the results of the sentiment analysis task as an example in Table 3.We can observe that the smallest label distribution shifts and the largest input similarity in  formance.Nevertheless, the two metrics cannot perfectly explain the performance on all tasks (cf.Appendix A.3).Therefore, Q1 is still a challenging research question, requiring further exploration in future work.
For Q2, we conjecture that the outstanding generalization performance is because a source group with large diversity covers heterogeneous patterns in the target group, leading to a more robust prompt p s than p t .To explore this, we measure the heterogeneity of source and target groups by calculating the percentage of unique n-grams, and the percentage of n-grams of the target group covered by the source group.For illustration, we present the results of the commonsense QA task in Table 4. From Table 4(a), we can observe that OpenbookQA has the most diverse input according to the n-gram statistics.Moreover, OpenbookQA covers a large proportion of n-grams of SocialIQA and PIQA.These partly explain the superiority of the prompts optimized on OpenbookQA (see Table 2).

Robust Prompt Optimization
In this section, we first formulate a robust prompt optimization problem and propose a GPO framework to enhance the robustness of the prompts.

Problem Definition
To enhance the generalization ability of prompts, we propose a robust prompt optimization prob-  lem.Specifically, given an NLP task such as sentiment analysis, it aims to optimize a task-specific prompt for the data groups with different distributions.We consider the popular scenario where a source group G s = {(x s , y s )} following a distribution P s and {x t } in a unlabeled target group G t = {(x t , y t )} ∼ P t (P t ̸ = P s ) are available while {y t } is unseen during prompt optimization.The objective becomes utilizing G s = {(x s , y s )} and {x t } to optimize a task-specific prompt robust to the samples from either P s or P t .
Reasons for Access to Unlabeled Target Group.In a real-world deployment, LLMs continually encounter the testing data with distribution shifts.Collecting the input features {x t } of the target group is feasible.For example, when using LLMs as web services to solve user queries of certain NLP tasks, it is easy to collect extensive user queries as unlabeled target groups.However, labeling {x t } may be time-consuming and costly, and thus we intend to optimize robust prompts without the labels of the target group.
A Task-Specific Prompt vs.One Prompt for Each Group.To tackle the generalization issue of optimized prompts, an intuitive approach is to optimize a separate prompt for each data group, yet this simplistic approach faces several limitations in real scenarios.In real-world deployment, it not only requires additional computation costs to construct more prompts, but also needs to accurately classify each testing sample into the appropriate group of the same distribution, thereby resulting in increased computation costs, latency, and new challenges for precise group classification.Further-more, the collected source group data cannot cover all potential target groups, and the prompts optimized on the source groups may inevitably test on the examples from previously unseen groups.Thus, we aim at improving the generalization ability of one task-specific prompt across different groups.

GPO Framework
To obtain a robust prompt for both the source and target groups, it is natural to jointly consider G s and G t for prompt optimization.However, G t lacks the labels {y t } that are commonly required by the gradient-free optimization methods (refer to Table 5 for the inferior results without labeling).With the impressive capabilities of LLMs on zero-shot labeling, we propose to utilize LLMs to label {x t }.Considering that noisy labels may damage the quality of optimized prompts, we further present two strategies to improve labeling accuracy.
As illustrated in Figure 2, we first propose a Meta Prompt to instruct LLMs to acquire knowledge from the labeled source group and generate a series of prompts.Thereafter, we utilize a prompt ensemble labeling strategy to apply generated prompts to an LLM for precise labeling of {x t }.In detail, we derive a three-step framework to perform the labeling with two strategies, and then conduct joint prompt optimization as shown in Figure 2

Setup
Datasets.We experiment GPO with three tasks: sentiment analysis, commonsense QA, and DROP.For each task, we select a pair of groups with generalization performance gap as source and target groups, and ablate the labels for the target groups.Compared Methods.We adopt the following baseline methods: 1) APE; 2) APO (Pryzant et al., 2023), the state-of-the-art gradient-free prompt optimization method for LLM; 3) APE-ut, a naive generalization solution by incorporating the unlabeled target group input into APE; 4) the Upper Bound, which represents the performance of the prompt optimized on the target group data with ground-truth labels by APE; and 5) our proposed GPO; We also show the results of simple humanwritten prompts that are general for the task, and the revised versions by PromptPerfect 2 which is an automatic prompt engineering website.Evaluation Protocol.We utilize two strategies for testing: Top 1 and Ensemble.Top 1 refers to using the single optimized prompt with the best validation performance, while Ensemble refers to labeling with all obtained K prompts and accept the output with the most agreement on the prompts.We utilize the same N -shot data as the preliminary experiments and also report the averaged results for five runs.More implementation details are illustrated in Appendix A.4.

Performance Comparison
Compare to Generated Prompts.From Table 5, we can observe the followings: 1) GPO achieves superior performance for all target groups in both Top 1 and Ensemble testing, validating its effectiveness.However, there is still space for improvement towards the Upper Bound for all tasks, showing the challenge of the generalization problem.2) GPO achieves comparable source group performance for all tasks, showing its improvement on the target 2 https://promptperfect.jina.ai.Compare to Human-written Prompts.From Table 6, we further observe that GPO outperforms human-written prompts and PromptPerfect for sentiment analysis and commonsense QA tasks.However, on the most difficult task DROP, GPO underperforms human-written prompts.This is poten- tially because the inaccurate labels for Spans hinder the prompt optimization.Similarly, PromptPerfect also fail to optimize human-written prompts for DROP.We study the effect of prompt ensemble labeling and joint prompt optimization by evaluating two modifications of GPO: (1) setting the consistency threshold as 0, denoted as w/o cons; and (2) removing the target group training data during the final prompt generation, denoted as w/o cons+t-train.From Table 7, we can observe that: 1) In all cases except for Flipkart with Top 1 evaluation, GPO performs better than w/o cons on target groups, showing the effectiveness of the consistency threshold.2) Among the three tasks, DROP has large improvement between w/o cons and GPO on both source and target groups then the other two tasks.We hypothesis that this discrepancy is related to the different degrees of improvement in the labeling accuracy by the consistency threshold, which will be further discussed in Section 4.4.3) Comparing

In-depth Analysis
Analysis on the Effect of the Consistency Threshold.To further reveal the effect of consistency threshold, we first show the labeling accuracy of the target group training and validation data for GPO and w/o cons in Table 8.We can observe that applying the consistency threshold can improve the labeling accuracy for all target groups.By examining the relationship between this labeling accuracy improvement and the performance difference between GPO and w/o cons in Table 7, it can be explained that for Flipkart and OpenbookQA, where the labeling accuracy is already high under w/o cons, further improving the labeling accuracy by the consistency threshold is unlikely to achieve large performance gain.Conversely, in the case of Spans with low labeling accuracy, even a minor improvement can result in significant performance gains.To explore the connection between labeling accuracy and target group performance further, we conducted an experiment where we manually assigned incorrect labels to varying proportions (0%, 50%, and 90%) of the target training and validation data.The results are illustrated in Figure 3.It can be observed that as the percentage of incorrect labels increases, the overall performance on the target group generally decreases, emphasizing the importance of labeling accuracy for achieving effective generalization.8. GPO with Different Backbone LLMs.We also conducted experiments with GPO using different backbone LLMs, including Vicuna 7B and 13B (Chiang et al., 2023) which are notable smallersized LLMs, and GPT-4 (OpenAI, 2023).Table 9 shows the generalization results on Flipkart with Yelp as the source group for APE and GPO on different backbone LLMs.Due to the small sizes of the Vicuna models, generating the exact sentiment label as the answer can be challenging.Therefore, we extract the sentiment labels from their outputs before calculating the accuracy.The results show that there is room for enhancing the generalization performance in APE across various LLMs, and GPO consistently outperforms APE in all cases.Notably, when applying GPO to the smaller Vicuna 7B model, there is a significant improvement that allows it to reach the same performance level as the Vicuna 13B model.Across LLMs, the smallersized Vicuna models achieve relatively worse performance, and the powerful GPT-4 achieves the best performance on GPO.

Related Work
Generalization Ability and Robustness of LLM.
Researchers have been investigating the generalization ability and robustness of LLMs since their recent breakthrough.LLMs like ChatGPT have shown significant improvement in out-ofdistribution (OOD) and adversarial tasks (Wang et al., 2023), although they are still imperfect (Chen et al., 2023).Some LLMs still rely on shortcuts and spurious correlation (Tang et al., 2023;Stolfo et al., 2022).Moreover, LLMs remain vulnerable to adversarial perturbations and achieve inconsistent results (Wang et al., 2023;Ye et al., 2023a;Liang et al., 2022).Additionally, LLMs demonstrate high sensitivity to the prompt (Reynolds and McDonell, 2021;Zhu et al., 2023) and the selection of in-context examples (Liu et al., 2022;Rubin et al., 2022).Lastly, instruction tuning allows LLMs to generalize to novel tasks (Ouyang et al., 2022;Wang et al., 2022b,a).We specifically focus on the generalization issue of prompt optimization on the distribution shifts within one task.
Prompt Optimization.Obtaining effective prompts for applying LLM in NLP tasks is a popular research area.Prompt tuning methods (Li and Liang, 2021;Lester et al., 2021;Qin and Eisner, 2021;Gu et al., 2022) learn soft continuous vectors as prompts in the LLM input using gradients from the task objective.Recent studies have also focused on gradient-free prompt optimization for black-box LLM, such as reinforcement learningbased methods (Zhang et al., 2023;Deng et al., 2022;Diao et al., 2022), search-based methods (Brown et al., 2020;Prasad et al., 2022;Pryzant et al., 2023), and other gradient-free optimization techniques like evolutionary algorithms (Sun et al., 2022) and boosting (Hou et al., 2022).Among them, the state-of-the-art methods leverage the power of LLMs for prompt optimization, such as prompt generation and evaluation by LLM (APE (Zhou et al., 2023)) and prompt editing following critiques (APO (Pryzant et al., 2023)), where we mainly compare with them.Notably, while some previous work on prompt tuning has addressed generalization across tasks and models (Su et al., 2022;Vu et al., 2021;Qin et al., 2023), and domain adaptation (Tam et al., 2022;Guo et al., 2022), this paper specifically focuses on the generalization issue of gradient-free prompt optimization.

Conclusion
In this paper, we revealed the generalization issue of prompt optimization for LLMs under distribution shifts.We observed that the prompt optimized on the source data group may have a performance drop on the target group with distribution shifts.We performed an initial analysis aiming at identifying the factors that correlate to the varied generalization performance across groups, including label distribution shift and input distribution sim-ilarity.To enhance the generalization ability of LLMs, we proposed a Generalized Prompt Optimization framework to jointly consider the source and target groups for robust prompt optimization.
Experimental results validated the effectiveness of our proposed framework in boosting the robustness of the prompts on the source and target groups.In future work, we plan to study the prompt generalization to unseen target groups without available inputs {x t }, and explore prompt generalization ability with in-context examples from different groups.

Limitations
Firstly For Yelp and Flipkart, we assign the review scores of 0 and 1 as negative, 3 as neutral, and 4, 5 as positive.For multi-turn dialog reasoning, we select the instances of MuTual within 5 dialog turns, Ubuntu and DSTC7 within 7 dialog turns, and reduce the number of choices to 4 for all three datasets.We show an example of LLM input for each task in Table 11, and the dataset statistics in Table 10.

A.2 Additional Implementation Details for Preliminary Experiments
The APE performs prompt optimization by iteratively generating and selecting the prompts leveraging LLM.The input similarity quantifies the n-gram similarity of the input corpuses of the two groups.Suppose that we sample M inputs from the source and target groups respectively, denoted as x s = {x s 1 , ..., x s M } and x t = {x t 1 , ..., x t M }, we calculate the Spearman's rank order correlation between the bag-of-word vectors of x s and x t , where V s and V t denotes the ranked bag-of-word vectors of x s and x t on the vocabulary of x t .
Calculation of Q2 Metrics.We sample the same amount of inputs from SocialIQA, PIQA and Open-bookQA, and denote the input corpuses as x 1 , x 2 and x 3 .Firstly, we calculate the proportion of unique n-grams for each group against the number of all n-grams for the three corpuses as where n-gram(•) returns the set of unique n-grams, and the braces denotes mixing the inputs.Secondly, we think the source group that has already covered a larger proportion of n-grams of the target group may promote better generalization, and we calculate the proportion of n-gram coverage between the source and target groups as For both metrics, the n-gram(•) is calculated as both word 1-gram and character 4-gram using scikit-learn.
Q1 Metrics for More Tasks.Table 12 and Table 13 show the two Q1 metric results for commonsense QA and Dialog tasks.Linking the results with the generalization performance in Table 1 and Table 2, we have the following observations.1) For each target group of the commonsense QA task, the largest value for input similarity coheres with the best generalization performance, but the smallest value of label distribution shifts does not correlate to the best generalization performance.2) For the Dialog groups, the zero label distribution shifts and the close input similarities cohere with the subtle generalization performance difference on each target group.3) The evaluation metrics cannot be compared across target groups nor across tasks.e.g., the source group SocialIQA performs better on PIQA than OpenbookQA (cf.Table 2), but the input similarity is higher for OpenbookQA.Also, MuTual has smaller input similarity with Ubuntu (input similarity is 0.56, and generalization performance is 74.7) but better generalization performance than PIQA generalizing to SocialIQA (input similarity is 0.57, and generalization performance is 68.9) (cf.Section 2).These findings reveals the benefits and limitations of the Q1 metrics.
• PromptPerfect: For sentiment analysis: Your task is to perform a sentiment analysis on a given input text and provide a single word indicating whether the sentiment is positive, negative, or neutral.The input text may contain any language or style of writing.Please ensure that your analysis takes into account the overall tone and context of the text.Your response should be concise and clear, providing a single word that accurately reflects the sentiment of the input text.If there are multiple sentiments present in the text, please choose the one that best represents the overall feeling conveyed by the author.Please note that your analysis should take into account all relevant factors, such as tone, language use, and content.Your response should also be flexible enough to allow for various types of input texts.
For commonsense QA: Please choose the best answer for the following multiple choice question.
Choose the one answer that best fits the given scenario.Please provide only the single letter (a, b, c, or d) as labels.
For DROP: Your task is to answer a numerical question based on a given context involving numerical reasoning.Please provide a direct answer to the question, which can be a numerical value or a short string.Please note that your response should be concise and directly answer the question.The question may involve various numerical data, such as percentages, averages, or counts.You should focus on identifying the relevant information and providing a clear and accurate answer.Additionally, please ensure that your response is flexible enough to allow for various relevant and creative answers based on the context provided.

A.5 Case Study
We present a case study by presenting the best prompt among the five runs for sentiment analysis and DROP as shown in Table 14.We can observe that the optimized prompt for a single group often contains group-specific background information as highlighted by underline which may hinder robust prompt generalization.On the contrary, the optimized prompts of GPO are more general and thus performs well on both groups.Note that for Spans, the optimized prompt is also general enough and thus can generalize well to Number as shown in Table 2.
Yelp Provide feedback on various experiences, such as dining, shopping, and service.The output format is a sentiment analysis, where the input is analyzed to determine whether the experience was positive, negative, or neutral.The output is a single word indicating the sentiment of the experience.Flipkart Provide a sentiment analysis of customer reviews.
The input consists of a customer review of a product, and the output is a binary classification of the sentiment as either positive or negative.GPO provide a sentiment analysis of a given text.The output format is a single word indicating whether the sentiment is positive, negative, or neutral.
Number Answer a specific question based on a given context.The output format is a numerical value that directly answers the question asked.

Spans
Answer a specific question based on a given context.The output format is a single word or phrase that directly answers the question asked.

GPO
Answer questions based on given context information.
The output format is a numerical value or a single word answer.

A.6 Study on the Impact of the Number of Candidate Prompts
We examine the effect of varying the number of candidate prompts K on GPO performance in our 36-shot sentiment analysis task.We test the K values in {3, 6, 9, 12, 18}.The results on the target group Flipkart are shown in Table 15.We observe that the generalization performance stabilizes as K reaches a specific value, in this case is 6, indicating that further generating more prompts are unlikely to yield significant improvements in performance.

Figure 1 :
Figure 1: Illustration of prompt optimization under distribution shifts.Existing prompt optimization solutions aim to improve the LLM performance on the training data, while it is unclear whether the optimized prompt can be generalized to testing data of the same task but with distribution shifts.

Figure 3 :
Figure 3: Target group performance under different percentage of wrong labels.The blue dotted line indicates the labeling accuracy of GPO as in Table8.

Table 1 :
Results for tasks without large generalization performance gap across groups.

Table 2 :
Results for tasks with significant generalization performance gap across groups.Bold font indicates the largest value for each column.

Table 3 :
Results for (a) label distribution shifts (b) input similarity of the sentiment analysis datasets.Bold font indicates the least distribution shift for each column.
Table 3 generally coincide with the best generalization performance on each target group in Table 2, indicating the correlation between distribution shifts and generalization per-

Table 4 :
Evaluation on (a) the n-gram diversity and (b) word 1-gram coverage ratio (c) character 4-gram coverage ratio of commonsense QA datasets to study the even higher generalization performance.Bold font indicates the largest value for each column.
Joint Prompt Optimization.Finally, we mix G s and G * t to run APE for joint prompt optimization and obtain the final optimized prompt.As G * t may have fewer samples than G s after filtering with T , we perform a random upsampling on G * t to have the same data number as G s before running APE.A brief illustration about APE can be found in Appendix A.2.
(2022a), we set a consistency threshold T ∈ [0, 1] to only accept the labeled examples that have more than T percent of prompts agreed on the label.Eventually, we obtain a filtered labeled set G * t for the target group.3.

Table 5 :
Results of the compared methods.Bold font indicates the best performance for each column.

Table 6 :
Performance comparison for the human-written prompts, PromptPerfect and the more effect testing strategy of GPO (Top 1 or Ensemble, denoted as GPO best).Bold font indicates the best performance for each column.

Table 7 :
Ablation study.Bold-font and underline indicate the best and second-best results, respectively.

Table 8 :
The labeling accuracy comparison for the target group training and validation data on GPO and w/o cons.The results for Spans here is accuracy instead of F1. w/o cons and w/o cons+t-train, removing the target group training data benefits the Top 1 results of the source group, but harms the Ensemble results of the target groups.It has less effect on the target group Top 1 results since the two methods still use target group validation data.

Table 9 :
Performance comparison of APE and GPO on Flipkart of different backbone LLMs.
(Raffel et al., 2020) the generalization ability of prompts while ignoring the effect of other LLM inputs such as in-context examples.The choice of in-context examples might also affect the robustness of LLMs.Future work can look into the generalization issue of the prompt in combination with in-context examples.Secondly, this work assumes the availability of the inputs {x t } of the target group.It is under-explored how to achieve generalized prompt optimization to completely unseen groups without {x t }.To improve the robustness on these groups, we believe it is helpful to extend this work toward robust prompt optimization on multiple heterogeneous groups.Thirdly, we acknowledge that the scope of our research is limited to black-box LLMs capable of understanding instructions, where gradient-free prompt optimization with instructing LLM is a suitable choice.For smaller LMs without instruction understanding abilities, e.g., BERT(Devlin et al., 2019)and T5(Raffel et al., 2020), they are generally not black-box and are more advantageous to utilize gradient-based prompt optimization methods.Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2023.Large language models are human-level prompt engineers.In The Eleventh International Conference on Learning Representations.

Table 11 :
Goldberg offers everything I look for in a general practitioner.He's nice and easy to talk to without being patronizing; he's always on time in seeing his patients... Dataset examples for each task.The output for classification tasks is one of the Labels, while for Number the output is a string of numerical value.

Table 14 :
Case study on the prompts optimized by APE from a source group, and GPO.

Table 15 :
Generalization performance of GPO on Flipkart with different numbers of candidate prompts K.