MAPO: Boosting Large Language Model Performance with Model-Adaptive Prompt Optimization

Prompt engineering, as an efficient and effective way to leverage Large Language Models (LLM), has drawn a lot of attention from the research community. The existing research primarily emphasizes the importance of adapting prompts to specific tasks, rather than specific LLMs. However, a good prompt is not solely defined by its wording, but also binds to the nature of the LLM in question. In this work, we first quantitatively demonstrate that different prompts should be adapted to different LLMs to enhance their capabilities across various downstream tasks in NLP. Then we novelly propose a model-adaptive prompt optimizer (MAPO) method that optimizes the original prompts for each specific LLM in downstream tasks. Extensive experiments indicate that the proposed method can effectively refine prompts for an LLM, leading to significant improvements over various downstream tasks.


Introduction
Advancements in Large Language Models (LLMs) have ushered in a transformative era in natural language processing, showcasing their remarkable capabilities across a wide range of tasks (Ope-nAI, 2023;Bubeck et al., 2023).While these models possess human-like comprehension and response abilities, their performance is heavily influenced by the quality of prompts.As can be observed in Fig. 1, answers from different LLMs vary widely when they are provided with the same task-specific prompts.Therefore, it is necessary to generate prompts that are most suitable for each LLM, thereby enhancing its performance on downstream tasks.
A common practice towards prompt optimization is to count on human expertise (White et al., 2023;Jiang et al., 2022;Zamfirescu-Pereira et al., 2023).While effective, such approaches are costly and unscalable.Hence, there has been a lot of effort to streamline the prompt optimization process through automated or semi-automated ways, including prompt retrieval (Ma et al., 2023;Zhou et al., 2022), prompt generation from scratch (Pang et al., 2023), and prompt editing (Gao et al., 2020;Pryzant et al., 2023;Deng et al., 2022).For example, in prompt retrieval, Ma et al. (2023) propose a search strategy based on greedy search to identify near-optimal prompts for in-context learning; in prompt generating from scratch, Pang et al. (2023) introduce SharpT, which learns a shared latent space and generates soft prompts using a lightweight prompt generator; in prompt editing, some approaches rely on reinforcement learning or LLM-based feedback for prompt optimization (Deng et al., 2022;Zhou et al., 2022).
However, the aforementioned research primarily emphasizes the importance of adapting prompts to specific tasks, rather than specific LLMs.The latter, although very important, has not been studied to date in NLP.The only relevant work so far has been done on multi-modal large models, which automatically optimizing prompts using reinforcement learning to generate images based on text (Hao et al., 2022).They underscore the concept of "model-preferred prompts" or "modelspecific prompts", emphasizing that there's a need for a systematic method to automatically align user intentions with the specific prompt preferences of each model.Therefore, in this paper, we novelly propose a Model-Adaptive Prompt Optimization (i.e.MAPO) approach for LLMs in NLP.Given the lack of effective training signals, we first establish a so-called warm-up dataset to obtain candidate prompts from an oracle LLM, and then model the prompt optimization problem with reinforcement learning.Specifically, we first generate candidate prompts and search for the optimal prompts to establish a warm-up dataset.After that, we combine Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to optimize original prompts for each specific LLM in various downstream tasks.Moreover, we make joint learning with Proximal Policy Optimization (PPO) (Schulman et al., 2017) and RRMF (note that RRMF is inspired by RRHF (Yuan et al., 2023)), to further improve the performance of RL.We conduct extensive experiments which validates the robustness and generalization of the proposed MAPO.To sum up, our main research question revolves around identifying the optimal prompt that is suited for various models.Our contributions are threefold: • We are the first to quantitatively show that different prompts should be adapted to different Large Language Models (LLMs) in order to enhance their performance across various NLP downstream tasks.
• We introduce a novel approach called the Model-Adaptive Prompt Optimizer (MAPO), specifically designed to optimize the original prompts for each particular LLM in downstream tasks.
• The experiments show that our proposed MAPO model exhibits greater robustness and generalization and also achieves superior performance in a variety of downstream tasks.

Empirical study
In this section, we conduct empirical study on three LLMs (BLOOM-7B (Scao et al., 2022), GPT-J-6B (Wang and Komatsuzaki, 2021), and LLaMA-7B (Scao et al., 2022)) to evaluate their separate performance on question-answering (QA), classification, and generation tasks with same task-specific prompts.We use nine dataset from P3 (Sanh et al., 2021) covering three downstream tasks with details in Appendix E. P3 is a widely-used prompt benchmark which contains original prompts and the corresponding ground-truth answers.We adopt F1 score, accuracy and ROUGE-L for QA, classification, and generation tasks, respectively.The visualization results are shown in the Fig. 2. From the violin plot, we observe significant variations in distribution among different LLMs in each task.For example, in the generation task, the results of all three models are distributed within the range of 0 to 0.5, but there are still differences in the specific distribution patterns.Moreover, the medians, means, and other statistical measures also differ greatly among three LLMs in each downstream task.Therefore, we consider that finding the optimal prompt for each specific LLM on each task is meaningful, as it can help enhance the LLMs' performance on various downstream tasks.

Methods
Based on the above empirical study, we present MAPO, a model-adaptive prompt optimization approach for LLMs.It takes the original prompt as input and generate an optimized prompt which makes an LLM give better outputs.The framework of MAPO is shown in Fig. 3.

Warm-up Dataset Establishment
We first establish a warm-up dataset as training dataset for prompt optimization.
Generating Candidate Prompts.The original prompts are from nine above-mentioned datasets in P3 (Sanh et al., 2021).We generate 1,000 candidate prompts per prompt using GPT-3.51 .The   generated candidate prompts should maintain semantic meaning similar to the original prompt but may have different expressions.To achieve this, we use the following instruction as input for GPT-3.5 to generate candidates: "Please rewrite the given text 'original prompt' while keeping the semantic meaning unchanged.".Some candidate prompts are shown in Appendix A.
Searching for the Optimal Prompt.To determine which candidate prompt is optimal for an original prompt, we compare the match degree, which refers to the similarity, between the outputs generated using a candidate prompt and the ground truth output.The purpose is to identify the candidate prompt that produces an output most similar to the ground truth output.When a ground-truth output is not available, the output of a stronger LLM, such as GPT-3.5, is regarded as the ground-truth.Specifically, first, we input the original prompt P and each candidate prompt into an LLM, respectively, for inference and obtain the corresponding outputs.Next, we compare the match degree with specified evaluation metrics.We adopt F1 score, accuracy, and ROUGE-L (Lin, 2004) for QA, classification, and generation tasks, respectively.Based on these metrics, we iterate the searching process and find the optimal prompt P o for an LLM in downstream tasks.The warm-up dataset consists of a collection of prompt pairs (referred to as {P, P o }), whose distribution is shown in Table 1.

Prompt Optimizer Construction
The prompt optimizer seeks to refine the initial prompt (P ) into an optimized prompt (P o ) tailored to a particular LLM.This refinement process entails altering the structure or wording of P to produce P o , which is more suitable for the LLM in subsequent tasks.

Supervised Fine-tuning
We begin by employing the warm-up dataset to conduct supervised fine-tuning (SFT) with an LLM across multiple downstream tasks.The objective of SFT is to enhance the LLM's capacity to generate responses that align with its preferences, utilizing annotated data.Prior research conducted by Ramamurthy et al. (2022) supports the notion that employing SFT prior to reinforcement learning (RL) leads to improved outcomes.Furthermore, to differentiate between specific tasks during training, we incorporate a brief instruction preceding the input, such as "This is a... (generative/questionanswering/classification) task.".

Building Reward Model
Next, we construct a reward model to learn the effectiveness of prompts based on the preferences of different LLMs.This approach is motivated by the fact that discriminative annotation through sorting incurs significantly lower costs compared to generating annotations for answers.Initially, we obtain a ranking sequence for an LLM in a specific downstream task.We sort the outputs generated by candidate prompts {P 1 , P 2 , . . ., P k−1 , P k } alongside the original prompt P , using the same evaluation metric as described in Sec.3.1.This sorting process yields a ranking sequence {P 1 , P 2 , . . ., P, P k−1 , P k }.Prompts to the left of P exhibit poorer inference results, while prompts to the right demonstrate better results.Next, we em-ploy the ranking sequence to train a reward model.We utilize the same LLM utilized in the SFT process (referred to as LLM) and replace the softmax layer with a linear layer to construct the reward model.The reward model takes a prompt as input and produces a scalar score indicating the quality of the prompt.We form pairwise ranking pairs by combining prompts from the ranking sequence and employ Pairwise Ranking Loss for training, as illustrated below: where x represents the original prompt, y w and y l denote the higher-scoring and lower-scoring prompts, respectively, in the corresponding ranking pair.r θ represents the scalar output of the reward model, D is the set of ranking pairs, and K denotes the number of candidate prompts.Through this process, based on the outputs generated by LLM with the given prompt, the reward model learns to assign higher scores (rewards) to better prompts and lower scores (rewards) to inferior prompts, thus imitating an LLM's preferences.

Reinforcement Learning
Subsequently, we employ Reinforcement Learning (RL) to further fine-tune LLMs.RL is used to adjust the bias in the reward model's scoring since the distribution of generated prompts might change during the SFT process.The primary objective of optimization is to maximize the scores of prompts generated by LLM after SFT (referred to as the SFT model), as evaluated by the reward model.To achieve this, we utilize a combination of Proximal Policy Optimization (PPO) (Schulman et al., 2017) and RRMF algorithms (note that RRMF is inspred by RRHF (Yuan et al., 2023)) for joint learning.
Policy Optimization.This step aims to optimize the RL policy to improve the performance of the RL model.We first adopt the datasets shown in Table 3, which to construct environmentaction pairs.The environment refers to the original prompt, and the action represents the prompt generated by LLM without instruct-tuning.We pass the environment-action pairs to the reward model to obtain rewards.In this process, we introduce an actor model, which is LLM, and a frozen model, which is SFT model with its parameters frozen during the RL training process.The frozen model serves as a benchmark to evaluate whether the updated actor model has advantages over it.We then calculate the policy gradient loss (i.e., actor's loss) based on the importance ratio and reward (r + γV next − V cur ), and calculate the value loss (i.e., critic's loss) by comparing the predicted value V pred with the true value (r + V next ) as follows: Here, Pπ a (t) Pπ f (t) represents the ratio of probabilities (i.e., importance ratio) of generating the same token under the actor model and the frozen model.(r + γV next − V cur ) represents the reward of the current step.V pred denotes the predicted value, and (r + V next ) denotes the true value.
Next, we maximize the mathematical expectation of the reward model, aiming to consistently generate prompts that LLM perceives as the best in the RL-trained SFT model (referred to as RL model).We feed prompts x generated by the SFT model based on the datasets shown in Table 3 (i.e., D) into the RL model π RL ϕ to obtain an optimized prompt y. y changes every time the RL model is updated.We then input (x, y) into the reward model r θ and calculate a score (i.e., reward), which represents the real-time feedback from the reward model.The loss function is defined as follows: Finally, we combine the above loss functions to optimize the RL policy from multiple perspectives.The final loss function is defined as: where α 1 , α 2 , and α 3 represent the optimal weights of each function, which are determined through experiments (the same applies below).
SFT Approximating.This step aims to maintain similarity between the RL model and the SFT model.When the RL model undergoes parameter updates, it leads to variations in the generated prompt y based on the given prompt x.If there is a significant discrepancy between the RL model and the SFT model, it can result in inaccurate estimation of scores by the reward model.To address this issue, we measure the distance between the prompts generated by the RL model and the SFT model using Kullback-Leibler (KL) divergence.
The objective is to minimize the KL divergence and the loss function is defined as follows: where π RL ϕ (y|x) and π SF T (y|x) represent prompts generated by RL model and the SFT model, respectively.
Next, we have borrowed the idea from RRHF (Yuan et al., 2023) but adapt it to focus on "model feedback" instead of "human feedback".We name it Ranking Responses from Model Feedback (RRMF).Specifically, we calculate the likelihood probability of LLM during SFT and align this probability with the score of the reward model.To optimize this objective, we employ supervised learning with a rank loss, defined as follows: where p i is the conditional log probability which represents the reward of each optimized prompt y i .r i represents the reward model r θ (x, y i ).We also incorporate the cross-entropy loss introduced by RRHF (Yuan et al., 2023) to learn the generated prompts y ′ i with the highest reward r ′ i as follows: Finally, we combine the above loss functions for SFT approximating as follows: Generalization Maintaining.This step addresses the issue of catastrophic forgetting by ensuring that an LLM performs well not only on specific tasks but also on general NLP tasks.To achieve this, we follow a similar approach as outlined in In-structGPT (Ouyang et al., 2022).We sample 10% data from general NLP tasks in GLUE (Wang et al., 2018) and the SuperGLUE benchmark (Wang et al., 2019), which are considered representative, as indicated in Table 3, during the pre-training phase.The objective of pre-training is to generate outputs that are as good as or better than the original one based on the original prompts.The original prompts are taken from Natural Instructions (Wang et al., 2022b).The loss function is as follows: where D pretrain represents the selected datasets for pre-training.Joint learning.Finally, we make joint learning with the above-mentioned loss functions as follows: 4 Experiments In this section, We conduct experiments with three popular LLMs as LLM, respectively, including BLOOM (7B), GPT-J (6B), and LLaMA (7B), on different downstream tasks to validate the effectiveness of MAPO.

Experimental Setups
The experiments are executed on 4 Nvidia A100

Main Results
The main results are shown in Table 2.We observe that the performance increase evidently among all LLMs during SFT.We then utilize MAPO to make further optimization.our method applies not only to LLMs but also to smaller LMs.We analyze the possible reasons as follows: MAPO employs both SFT and RL to optimize LLMs.In fact, the SFT process is not specific to LM. Fine-tuning smaller models is feasible and common, requiring fewer computational resources.RL is a widely-used algorithm across applications and model scales, and small models require less computational and storage resources, making RL more feasible on them.
We also test the performance of MAPO with the above-mentioned SOTA prompt optimization baselines.We use three LLMs, including BLOOM, GPT-J, and LLaMA, to replace the BART model used in Table 3 for verifying the nine datasets in Table 2, as shown in Table 4. Due to SFT in LLMs equals fine-tuning pretrained language models, we directly list the SFT results in the Fine-tuning row.Apart from Fine-tuning, we also freeze the LLMs and only modify the prompts for inference on downstream tasks.According to the experimental results, the performance of almost all baselines, except RLprompt, does not exceed that of Fine-tuning /SFT, and some even do not outperform the original LLMs.This highlights the importance of SFT in LLMs.When we add RL, as in the case of RLprompt, the performance on downstream tasks surpasses that of SFT, indicating the significance of RL for prompt optimization.Moreover, using our proposed MAPO method to optimize the prompt further improves performance over RLprompt, except in a very few cases, such as using BLOOM for movie classification tasks.These experimental results demonstrate that the MAPO method proposed in this study makes a substantial contribution to improving the performance and accuracy in downstream tasks.
Moreover, we conduct experiments to evaluate the domain transfer performance of MAPO.The results are presented in Table 5 and Table 6, while the results of LLMs with original prompts are reported by Arora et al. (2022).Remarkably, we observe that each LLM, when using prompts optimized by MAPO, displays improved performance across various downstream tasks.Specifically, BLOOM exhibits the highest increase in performance compared with GPT-J and LLaMA.This experiment clearly demonstrates the significant domain transfer capability of MAPO.

Ablation Study
The effect of RL compared with SFT.From the experiments (Table 3, Table 5 and Table 6), we can observe that the performance improvements gained solely from using SFT are less than half of those achieved by our proposed MAPO method, both on similar tasks and general NLP tasks.This clearly indicates the effectiveness of MAPO in optimizing model-adaptive prompts.
In order to further demonstrate RL is necessary and how it compares to simply extending SFT with a larger warm-up dataset, we use various proportions of the warm-up dataset to progressively increase the SFT training data and then introduce RL to it as shown in Table 17.Our findings consistently show that RL adds value to the performance beyond what is achieved by SFT alone across all proportions of the dataset.This affirms the effectiveness of RL irrespective of the SFT dataset size.However, as the proportion of the warm-up dataset increases, the margin of improvement from adding RL begins to decline.While one could hypothesize that adding RL to a very large SFT dataset might not result in as significant an improvement as it would for a smaller dataset, this observation actually underscores our method's suitability for low-resource scenarios.
Moreover, we have tried different number of epochs to see if extended training time consistently improves SFT performance as shown in Table 18.Extending the training time does not consistently lead to performance improvements for SFT.In some instances, the performance even declines.It is important to note that we save the best-performing models in real-time during training, as the peak performance does not necessarily occur at the final epoch.
The effect of warm-up dataset.As shown in Fig. 5 and Table 11, our study examines the effects of different proportions of the warm-up dataset on MAPO's performance.The results indicate that as the size of the warm-up dataset increases, performance typically improves.BLOOM is particularly sensitive, showing a pronounced growth trend.Conversely, GPT-J shows a more gradual growth.LLaMA's performance reveals an inflection around 60%, suggesting other factors also influence its performance.Even with reduced dataset sizes, the decrement in performance remains minimal, highlighting the method's suitability for low-resource tasks.We also conduct few-shot experiments on general NLP tasks with just 10% data and observe promising improvements.This underlines our method's adaptability and effectiveness in scenarios of data scarcity.
The effect of PPO and RRMF.To investigate the specific effects of PPO and RRMF during the RL process, we conduct separate experiments to evaluate the contributions of each component.The experimental results, depicted in Fig. 6 (with details provided in Table 10 in Appendix F), clearly demonstrate the important roles played by PPO and RRMF in enhancing the performance of MAPO.We propose the following explanations for these results: PPO focuses on reducing the dissimilarity between the RL model and the SFT model.RRMF aligns the scores from the reward model with the likelihood probabilities of an LLM.Both PPO and RRMF aim to assign higher probabilities to prompts that are more adaptable to the model.
The effect of the Randomness.We also incorporate randomness (e.g., temperature) during the generation process of LLM.Given that our prompts do not require high creativity, we have set a lower temperature range [0-0.5] for generation, within which we aim to generate optimal prompts.To further investigate the impact of varying temperatures on the generated output, we conduct an additional set of experiments to assess the performance of the MAPO method under different randomness settings (temperature=0,0.2,0.5,0.8) as shown in Table 14, Table 12, Table 15 and Table 16.Each experiment group runs 5 times.Our findings reveal that a hightemperature setting (t=0.8)tends to produce inferior prompts that lead to less accurate outputs for a specific task.Lower temperature (t=0.2) or greedy settings (t=0) are likely to produce more accurate outputs that are closer to our optimal results.This suggests that in a task like prompt optimization,  introducing a stable (low temperature) but slight degree of variability (non-zero temperature) yields the best results.

Case Study and Error Analysis
We conduct a case study to visualize the prompts optimized by MAPO, as shown in Table 7.Additional cases are included in Appendix G.We first observe that the majority of the original prompts have significant modifications after optimization through our MAPO method.Only about 10% of the generated prompt pairs remain completely unchanged.To further quantify these changes, we calculate a normalized edit distance.Given the varying lengths of different prompt pairs, we divide the edit distance by the average length of the two strings.This yields a value between 0 and 1, where 0 indicates identical strings and 1 indicates completely different strings.The average normalized edit distance for all prompt pairs stands at 0.67, demonstrating that most prompts do experience substantial modifications.
Next, we provide a detailed examination of these modifications.In the QA task, BLOOM transforms active voice into passive voice, GPT-J utilizes the phrase "the term used" and substitutes "refer to" with "denote", while LLaMA adopts a more informal style by mentioning the "commonly used term".In the generation task, both BLOOM and GPT-J present similar prompts that emphasize topic coverage.LLaMA maintains the original sentence structure but modifies the subjects and replaces "decorate" with "adorn".In the classification task, all three LLMs rearrange the word order and offer additional details about the topic.Therefore, MAPO demonstrates its prompt optimization capabilities by adapting better prompts to specific tasks for different LLMs while preserving core information and adjusting tone or structure as necessary.
However, there are also some errors during prompt optimization, including prompts with incomplete sentences, prompts with improper prepositions or missing necessary parts, and prompts with ambiguous meanings, etc.Therefore, there is ample room for improvement in MAPO to better adapt to different LLMs in downstream tasks.

Exploratory Analysis
We conduct an exploratory analysis to further investigate the patterns in optimized prompt as shown in Fig. 7, Fig. 8 and Fig. 9.We extract the three most frequent words from the original prompt and investigate their distribution in the optimized prompt for each LLM, while either retaining high-frequency words in instructions (including sentence, topics, subjects, present, statement, discussed, mentioned, included, following) or removing them.
Taking the generation task (Fig. 7) as an example, when high-frequency words in instructions are retained, we observe that BLOOM retains a relatively higher proportion of the original prompts compared to GPT-J and LLaMA, while LLaMA retains the fewest.When these words are removed,

Gen
Original: What are the topics in the sentence: The furniture in the living room is decorated with flowers.BLOOM: What topics does the sentence "The living room furniture is adorned with flowers" cover?GPT-J: What topics are covered in the sentence "The living room furniture is adorned with flowers"?LLaMA: Which subjects are present in the statement: Flowers adorn the furniture in the living room.

Class
Original: Reincarnated Napster Goes Public Roxio bought the Napster name in a bankruptcy auction two years ago.Now, the company dumps its CD-burning software business to concentrate on selling and delivering music over the web under the Napster name.What label best describes this news article?BLOOM: Roxio, the buyer of Napster's name in a bankrupt auction two years ago, has taken the reincarnated Napster public.The company has made the decision to abandon their CD-burning software business and focus solely on the distribution and sale of music online via the Napster name.What genre of news article would this story fall under?GPT-J: Roxio, which acquired the Napster name through a bankruptcy auction two years ago, has taken Napster public again after rebranding it as a music delivery service.The company has divested its CD-burning software business and now focuses solely on selling music online under the Napster name.What category would this news article fall under?LLaMA: The Napster name, which was purchased by Roxio in a bankruptcy auction two years ago, has now been resurrected with a public launch.Roxio has shifted its focus solely to the sale and distribution of music under the Napster name, leaving its CD-burning software business behind.What category would you assign to this news article?
Table 7: Original prompts and MAPO-optimized prompts for three LLMs in various downstream tasks.
we notice that BLOOM has a higher proportion of words like "man", "view" in its optimized prompts, which are more relative with human.GPT-J has a higher proportion of words like "match", "grass", "bathroom", "white", which suggests it focuses on specific scenes, objects, or themes.LLaMA has a higher proportion of words like "room", "close", "playing", indicating its preferences on place and experiences.The variations observed in word distribution indicate that each LLM tends to emphasize different aspects during the optimization process.Accurate conclusions need more experiments.

Related Work
LLMs' prompt optimization process involves prompt retrieval, prompt generation from scratch and prompt editing.However, all above-mentioned prompts optimization approaches aim to obtain task-specific prompts instead of model-specific ones.Different from theirs, we dedicate at optimizing prompts for LLMs within the NLP domain and achieve impressive performance.

Conclusions
The remarkable capabilities of LLMs have revolutionized NLP in various tasks.However, their performance heavily relies on the quality of prompts.In this work, we address the prompt optimization challenge by proposing a Model-Adaptive Prompt Optimization (MAPO) approach.Through extensive experiments, we demonstrated that MAPO can adapt different LLMs with generating modelfriendly prompts to enhance their capabilities across various downstream tasks.In future work, we aim to construct more fine-grained modeladaptive prompts that can adapt to the continuously evolving data encountered in real-world production environments.Additionally, we intend to enhance its applicability across a broad spectrum of linguistic contexts.

Limitations
It is important to acknowledge certain limitations of our approach.Firstly, the effectiveness of prompt optimization heavily relies on the availability and quality of the warm-up dataset.In cases where the dataset is limited or does not sufficiently cover the specific task, the performance gains from prompt optimization may be constrained.Additionally, MAPO requires extensive SFT and RL, which can be computationally expensive and time-consuming.This could limit the scalability of MAPO, especially when dealing with large-scale tasks or datasets.Despite these limitations, our study provides valuable insights into model-adaptive prompt optimization for LLMs and contributes to the ongoing efforts in improving the performance of these LLMs in practical applications.

A Candidate Prompts
For the nine datasets selected in P3, we present one prompt and the corresponding three candidate prompts for each dataset, as shown in Table 8.

B Training Details
We provide the training details as shown in Table 9.
Other hyper-parameters are set to default.

C Computational Cost
While the training phase is computationally intensive, the generation phase is relatively lightweight.Specifically, once the prompt optimizing model MAPO is trained, the prompt generation process simply involves a feed forward propagation to generate the optimal prompt instead of further optimization through SFT and RL, thus significantly reducing the computational complexity.We list the computational complexity during the training and inference phase: Training Phase.During the training phase, initially, a warm-up dataset is established.This involves generating candidate prompts using a model like GPT-3.5.For each original prompt, 1000 candidate prompts are generated.This leads to a time and space complexity of O(N × M ), where N is the number of original prompts, and M is the number of candidates per prompt.Subsequently, an optimal prompt is searched for, which involves comparisons among candidate prompts, yielding complexities of O(N × M ) for both time and space.Building the prompt optimizer is the next stage.Supervised fine-tuning (SFT) has a time complexity of O(E × B × T ), with E being the number of epochs, B the batch size, and T the number of model parameters.Its space complexity mainly arises from model parameters and gradients, which is O(T ).For building the reward

OpenQA
Original: What's above the muscles and needs direct sunlight Which is the correct answer?Options: ... Candidate 1: What lies beyond the muscles and requires direct exposure to sunlight?Which is the correct answer?Options: ... Candidate 2:What is located above the muscles and requires direct sunlight?Which is the correct answer?-Options: ... Candidate 3: Which body part requires direct sunlight and is located higher than the muscles?Which is the correct answer?Options: ...

CloseQA
Original: Q: What kind of relationship between glucagon and insulin is vital to managing fuel storage and consumption by body cells?A: Candidate 1:Q: What is the essential connection between glucagon and insulin for regulating fuel storage and utilization in body cells?A: Candidate 2:Q: In managing the storage and consumption of fuel by body cells, what is the crucial interrelation between insulin and glucagon?A: Candidate 3: Q: What is the crucial connection between glucagon and insulin in regulating the storage and utilization of fuel by the cells in the body?A:

News
Original: Reincarnated Napster Goes Public Roxio bought the Napster name in a bankruptcy auction two years ago.Now, the company dumps its CD-burning software business to concentrate on selling and delivering music over the web under the Napster name.What label best describes this news article?Candidate 1: Roxio, the buyer of Napster's name in a bankrupt auction two years ago, has taken the reincarnated Napster public.The company has made the decision to abandon their CD-burning software business and focus solely on the distribution and sale of music online via the Napster name.What genre of news article would this story fall under?Candidate 2: Roxio, which acquired the Napster name through a bankruptcy auction two years ago, has taken Napster public again after rebranding it as a music delivery service.The company has divested its CD-burning software business and now focuses solely on selling music online under the Napster name.What category would this news article fall under?Candidate 3:The Napster name, which was purchased by Roxio in a bankruptcy auction two years ago, has now been resurrected with a public launch.Roxio has shifted its focus solely to the sale and distribution of music under the Napster name, leaving its CD-burning software business behind.What category would you assign to this news article?

Movie
Original: writer/director joe carnahan's grimy crime drama is a manual of precinct cliches , but it moves fast enough to cover its clunky dialogue and lapses in logic .The sentiment expressed for the movie is Candidate 1: The gritty crime drama by writer and director Joe Carnahan may rely heavily on familiar tropes and cliches of the genre, but its quick pace manages to distract from any awkward dialogue and illogical moments.The sentiment expressed for the movie is Candidate 2: Joe Carnahanś gritty crime drama relies heavily on standard police procedures, yet its rapid pace compensates for any cumbersome dialogues and unreasonable plot holes.The sentiment expressed for the movie is Candidate 3:Although writer/director Joe Carnahanś gritty crime drama contains numerous stereotypes within the precinct environment, its swift pace effectively masks its awkward dialogue and occasional lapses in logic.The sentiment expressed for the movie is QASC Original:If I tell you that Hydrogen bonds cause a tremendous force when a substance freezes, and ask you the question "hydrogen bonds cause a tremendous force when a substance does what", is the correct answer "strong"?Candidate 1: If I were to inform you that when a substance freezes, Hydrogen bonds create a significant force and ask, "What term describes the force generated by Hydrogen bonds when a substance freezes?", would the appropriate response be "Powerful"?Candidate 2: Suppose I inform you that the process of substance freezing is deeply influenced by Hydrogen bonds that generate an enormous force.Now, if I inquire, "What occurs when the substance undergoes this process?",would it be accurate to say that the force generated is "powerful"?Candidate 3: Suppose I inform you that when a substance freezes, Hydrogen bonds result in a remarkable force, and inquire, "When a substance undergoes what, do hydrogen bonds cause a remarkable force?"Would it be accurate to respond with "robust"?

Topics
Original: What are the topics in the sentence: A bathroom with the toilet missing and the room fairly torn up.Candidate 1: What are the subjects of the sentence: A torn-up room without a toilet.Candidate 2: Which subjects are covered in the phrase "A bathroom that lacks a toilet and has a considerably damaged room"?Candidate 3:What are the subjects mentioned in the statement: A torn up room that lacks a toilet in the bathroom?

Summary
Original: Sum up the following dialogue: Gordon: Did you see my car, bro?Gordon: <file_photo> Gordon: It's my first car ever!And I love it!:) Leo: Grats, bro! Leo: It looks awesome, I have to see it with my own eyes!Gordon: Are you home?Leo: Yeah Gordon: Look out of the kitchen window :) Leo: No shit :D Leo: Wait, I'm coming!Gordon: Waiting :D Candidate 1: Sum up the following dialogue: Gordon asked, "Bro, have you seen my car?" and sent a file photo.He expressed his excitement saying itś his first ever car and he loves it.Leo congratulated him saying it looks awesome and expressed his wish to see it in person.Gordon asked if he was home and told him to look out of the kitchen window.Leo was surprised and replied, "No shit :D" and said he was coming.Gordon eagerly waited for him.Candidate 2: Sum up the following dialogue: Gordon inquires, "Hey bro, have you laid eyes on my car?" Gordon shares a photograph of his first vehicle and expresses his adoration for it with a smiley face.Leo congratulates him and expresses interest in seeing it in person.Gordon asks if Leo is home and instructs him to look out of the kitchen window.Leo is surprised and excited, responding with laughter and promising to come see it.Gordon waits patiently.Candidate 3: Sum up the following dialogue: Gordon asked his brother if he had seen his car and sent a photo of it.He expressed his love for it as it was his first car ever.Leo congratulated him and expressed his desire to see the car in person.Gordon asked if he was at home and told him to look out of the kitchen window.Leo was surprised and excited and said he would be coming soon.Gordon waited for him to arrive.
Inference Phase.In the inference phase, an optimized prompt is generated from an original prompt using the MAPO technique.The time complexity here is dominated by a single feed-forward operation, which is O(T ).There is almost negligible extra space required, making the space complexity effectively O(1) for this phase.
We also caclulate how long it roughly takes for a complete training run.For a LLaMA-7B model running on four A100 80GB GPUs, SFT on a highscale task (such as the News classification task with 120,000 training data) takes about 8 hours, RL takes about 12 hours, and the complete MAPO process takes roughly 20 hours in total; For a Bloom-7B model under the same hardware conditions, SFT takes about 5 hours, RL takes about 9 hours, and the total time for MAPO takes about 14 hours; For a GPT-J-6B model, SFT takes about 10 hours, RL takes about 16 hours, and the total time for MAPO takes about 26 hours.

D Baselines
We compared MAPO with several State-Of-The-Art (SOTA) prompt optimization baselines, including the following: • Finetuning (Devlin et al., 2018): Finetuning (few-shot) involves finetuning the entire language model with a classification head using a few-shot dataset.
• Soft Prompt (Qin and Eisner, 2021b; Li and Liang, 2021): Soft Prompt Tuning utilizes continuous embeddings as a variant of parameter-efficient transfer learning, replacing discrete prompts.
• Black-Box (Sun et al., 2022): Black-Box Tuning combines discrete and soft prompts, with the soft part trained using gradient descent and the discrete part optimized using a gradientfree tuner.
• Manual (Brown et al., 2020;Schick and Schütze, 2020;Sanh et al., 2021): Manual prompt achieves strong performance on various natural language understanding and natural language generation tasks without relying on training examples.
• In-Context (Brown et al., 2020): In-Context Demonstration randomly selects a training example and concatenates it with the input query.
• Instructions: Self-Instruction manually creates prompts for each task following Natural Instructions (Wang et al., 2022b), where the prompt is concatenated with the inputs.
• GrIPS (Prasad et al., 2022): GrIPS performs phrase-level editing on the instructions and selects the best one.
• TEMPERA (Zhang et al., 2023): TEMPERA is a test-time prompt editing method that uses reinforcement learning, efficiently leveraging prior knowledge and adapting to different queries, while providing an interpretable prompt for each query.
• AMA (Arora et al., 2022): AMA recursively reformats tasks and prompts using the LLM to effectively aggregate predictions across prompts using weak supervision.
For a fair assessment, we adopt the same experimental setup as in LM-BFF (Gao et al., 2020) and RLPrompt (Deng et al., 2022).We take 16 training samples from each class in our training dataset for every task, making them our few-shot dataset.So, if we consider all the classes (Y), we have a total of 16 times the number of classes as our training samples.Similarly, we pick 16 samples from each class to form our validation dataset.Besides this usual setup, we also select n random examples from our training data.We call this our "in-context exemplar pool".For consistency, we repeat our experiments four times using different random seeds.Afterward, we calculate the average results and note down the usual variation we see between the results.For our language model, we've chosen to use RoBERTa large (Liu et al., 2019).We base our initial guidelines on the Natural Instructions (Mishra et al., 2021).We also ensure that the first examples we give for context are randomly picked from a set of 16.This set is different from our few-shot dataset and is also randomly picked from our main training data.By comparing MAPO with these SOTA baselines, we gain insights into the performance and effectiveness of MAPO in various downstream tasks.
Specifically, for Table 3, the training data aligns with that used by TEMPERA (Zhang et al., 2023), that is all experiments, including our own, use Roberta-large as the backbone for validating the downstream tasks.Because the setup employs a "few-shot" methodology that has elaborated before, we name Table 3 as "few-shot".For Table 5 and 6, there is no training data involved; the LM performs zero-shot inference.That means all reported results occur without training on the datasets in Table 5 and 6.The purpose is to demonstrate the generalization (domain transfer) ability of our MAPO method.If one wishes to further enhance performance on these datasets, additional training with labeled data on Table 5 and 6 becomes necessary.

F Additional Experiments
The performance of the reward model.We plot the performance of the reward model during the training process of MAPO as shown in Fig. 4. As the training progresses, the reward model exhibits consistent growth and improvement.The consistent increase indicates that the reward model is gradually becoming more proficient in downstream tasks.It successfully adapts to its environment, leading to improved outcomes and higher task completion rates.Therefore, it can serve as a discriminator of the goodness of an optimized prompt.
The original capabilities maintaining ability of MAPO.We further analyze the original capabilities maintaining ability of MAPO.We use a language model trained with MAPO, which has the ability to optimize prompts but without losing its original capabilities, to modify prompts and accomplish downstream tasks.We believe that the GLUE and SuperGLUE tasks are representative, hence we use them as pre-training tasks.However, the improvements in Table 5 and 6 are not significant, possibly due to the limited scope of our pre-training tasks.Future work can explore using a broader range of datasets for pre-training, which may lead to more significant improvements in various downstream tasks.Moreover, for Table 3, the training and validation data for SFT, RM, and RL are different from the data used for generalization, although they all come from Table 3.This is because we consider GLUE and SuperGLUE tasks to be representative, hence we use them as pre-training tasks.Theoretically, a more diverse NLP dataset should be selected for this part, but we happened to choose this subset.To mitigate the impact on the results, we also run another test with two steps: using the optimized prompts generated by MAPO and then using the original Roberta-Large model to make inference.As shown in Table 3 (the row "MAPO-w/o g"), the results do not show a significant decline, with a t-test greater than 0.05.The use of data from Table 3 for generalization is merely to ensure that the prompt-optimized model retains its original capabilities for downstream tasks instead of data leakage.

G Additional Cases
We list more cases whose prompts have been optimized by our proposed MAPO as shown in Table 19.We make detailed analysis for the difference among LLMs as follows: • In SST-2, BLOOM and LLaMA both use phrases like "terrific flair" and "remarkable skill" to describe Khouri's ability, emphasizing positive sentiment.GPT-J uses the phrase "tremendous artistry," highlighting the artistic aspect, but does not explicitly convey the positive sentiment as strongly as BLOOM and LLaMA.
• In Yelp, BLOOM and LLaMA use phrases like "quality of the food is commendable" and "service provided is inconsistent" to provide a balanced assessment.GPT-J and the original version have the same wording, emphasizing the hit-or-miss nature of the service.
• In MR, BLOOM and LLaMA use phrases like "admirable endeavor" and "praiseworthy pursuit" to highlight the positive qualities of the venture.GPT-J and the original version use neutral language without explicitly conveying positive or negative sentiment.
• In CR, BLOOM, GPT-J, and LLaMA all express confusion or potential confusion regarding the positioning of the space key on a phone.The wording in BLOOM and LLaMA suggests that using a different key for text input is more common in phones, implying a deviation from the norm.
• In RTE, BLOOM and LLaMA emphasize the impact of the situation by using phrases like "somber site" and "distressing sight" when describing the washed-up marine animals.GPT-J and the original version provide more neutral descriptions without explicitly conveying the emotional aspect.
• In QNLI, BLOOM, GPT-J, and LLaMA all rephrase the sentence 2, maintaining the same overall meaning.The variations in wording are mainly stylistic, with BLOOM, GPT-J,  and LLaMA using different synonyms to convey the same information.
• In SNLI, BLOOM, GPT-J, and LLaMA rephrase the sentence 1 by adding additional details related to the slip and slide activity and the celebratory context.The variations in wording are mainly stylistic, enhancing the description of the baby's experience and the wetness.
• In MNLI, BLOOM, GPT-J, and LLaMA maintain the same wording as the original sentence 1.The variations in wording occur in sentence 2, with BLOOM and GPT-J emphasizing the need for interest rates to increase, while LLaMA focuses on the importance of boosting savings.
• In MRPC, BLOOM, GPT-J, and LLaMA all maintain the same wording as the original sentences.The variations in the rephrased sentence 1 (BLOOM and LLaMA) emphasize the 15 percent drop in revenue, while GPT-J maintains a more neutral tone.

H Additional Exploratory Analysis
We further analyze the distribution of the top 3 words from the original prompts in the optimized prompts of different LLMs in both the QA and classification tasks as shown in Fig. 8 and Fig. 9, respectively.In the QA task, we observe minimal variations when considering whether to remove the instruction.After prompt optimization, BLOOM has a higher proportion of words like "contemporary", "french", "Methodist", "places", "education", "power", and "life" compared to the other two models.GPT-J has a higher proportion of words like "church", "time", "order", "early", and "year", indicating a focus on temporal and sequential aspects.And LLaMA has a higher proportion of words like "earlier", "similar", "number", "song", and "property" compared to the other two models.In the classification task, we also observe minimal variations when considering whether to remove the instruction.After optimization, BLOOM has a higher proportion of the word "year", "new" compared to the other two models.GPT-J has a higher proportion of words like "largest", "music", "national", "school" and "poland".LLaMA has a higher proportion of words like "increase", "goverment", "executive", "medical", "warsaw", and "parliament" compared to the other two LLMs.These findings strongly suggest that each LLM exhibits unique preferences and patterns in prompt optimization across different tasks.The observed variations in word distribution clearly indicate the specific areas of focus and the semantic nuances that each LLM emphasizes during the optimization process.Additional experiments will contribute to a more comprehensive understanding of the prompt

SST-2
In this task, you are given sentences from movie reviews.The task is to classify the sentiment of the sentence.Your answer must be in the form of the letters "positive", and "negative" respectively.

Yelp
In this task, you are given sentences from Yelp reviews.The task is to classify the sentiment of the sentence.Your answer must be in the form of the letters "positive", or "negative" respectively.

MR
In this task, you are given sentences from movie reviews.The task is to classify the sentiment of the sentence.Your answer must be in the form of the letters "positive", or "negative" respectively.

CR
In this task, you are given sentences from customer reviews.The task is to classify the sentiment of the sentence.Your answer must be in the form of the letters "positive", or "negative" respectively.'

RTE
In this task, you're given a pair of sentences, sentence 1 and sentence 2. Your job is to choose whether the two sentences clearly agree (entailment)/disagree (not entailment) with each other.Your answer must be in the form of the letters Yes, and No respectively.

QNLI
You are given two sentences(Sentence1 and Sentence2).The task is to determine whether Sentence2 contains the answer to Sentence1.Your answer must be in the form of the letters Yes, and No respectively.

SNLI
In this task, you're given a pair of sentences, sentence 1 and sentence 2. Your job is to choose whether the two sentences clearly agree (entailment)/disagree (contradiction) with each other, or if this cannot be determined (neutral).Your answer must be in the form of the letters "Yes", "No", and "Maybe" respectively.

MNLI
In this task, you're given a pair of sentences, sentence 1 and sentence 2. Your job is to choose whether the two sentences clearly agree (entailment)/disagree (contradiction) with each other, or if this cannot be determined (neutral).Your answer must be in the form of the letters Yes, No, and Maybe respectively.

MRPC
In this task, you're given a pair of sentences, sentence 1 and sentence 2. Your job is to choose whether the two sentences clearly agree (entailment)/disagree (not entailment) with each other.Your answer must be in the form of the letters Yes, and No respectively.Table 14: (↑) denotes the absolute performance increase achieved using MAPO-optimized prompts versus a frozen LLM, while (↑(%)) highlights the relative performance boost.Symbols ↑ and ↑(%) represent the average absolute and relative enhancements across all three LLMs, respectively.These enhancements pertain to specific downstream tasks, with "CLS" signifying classification and "GEN" indicating generation tasks.M, M-0.2, M-0.5, and M-0.8 correspond to using MAPO with temperature settings of [0,0.5],0, 0.2, 0.5, and 0.8, respectively.

Figure 1 :
Figure 1: Variance on answers from different LLMs (b) when they are given the same task-specific prompts (a).

Figure 2 :
Figure 2: The performance of different LLMs on task-specific prompts for three tasks: question-answering (a), classification (b), and generation (c).The results reveal significant variations across different LLMs' performance.

Figure 3 :
Figure 3: Framework of the proposed MAPO, including warm-up dataset establishment and prompt optimizer construction.
For prompt retrieval, for example, Ma et al. (2023) adopt greedy search to identify near-optimal prompts.Zhou et al. (2022) introduce APE for automatic instruction selection, etc.For prompt generation from scratch, Pang et al. (2023) introduce SharpT, which learns a shared latent space and generates soft prompts.White et al. (2023) describe a catalog of prompt engineering techniques.Zamfirescu-Pereira et al. (2023) investigate end-user prompt engineering using a prototype LLM-based chatbot design tool.Wang et al. (2022a) present Self-Instruct for improving instruction-following capabilities of PLMs.For prompt editing, Gao et al. (2020) automatically select label words and generate templates.Pryzant et al. (2023) introduce APO based on "gradients" to provide critical feedback on the current prompt.Deng et al. (2022) propose RLprompt based on RL.Zhang et al. (2023) propose TEMPERA, which provides interpretable prompts for different queries.Prasad et al. (2022) introduce GrIPS, a gradientfree approach for improving task instructions for LLMs.Moreover, some research focuses on incor-porating additional knowledge to enhance prompt editing.For example,Li et al. (2023) propose DSP to generate "directional stimulus" of each input.Qin and Eisner (2021a) optimize a mixture of prompts using gradient descent to generate relational knowledge.Shin et al. (2020) develop Autoprompt, a gradient-guided approach to find the best tokens in the prompt.Jiang et al. (2020)  propose mining-based and paraphrasing-based methods to automatically generate diverse prompts.Furthermore, some research focus on continuous prompt optimization instead of discrete prompt optimization mentioned before, such as research by Zheng et al. (2023),Hambardzumyan et al. (2021), Zhong  et al. (2021), etc.

Explan
Original: Question: What does a Christian do when they get what they needed?Options:...The answer is "thank god" because Candidate 1: Question: How should a Christian proceed after they have received what they required?Options: ... The answer is "thank god" because Candidate 2: Question: When a Christian receives what they needed, what actions do they take?Options:...The answer is "thank god" because Candidate 3: Question: What should a Christian do upon receiving what they required?Options: ... The answer is "thank god" because Table8: One sample prompt and the corresponding three candidate prompts generated by GPT-3.5 for each selected dataset in P3.model, both time and space complexities are mainly O(N × M × log M ).The reinforcement learning (RL) part requires O(E ′ × B ′ × T ) time, where E ′ is the number of epochs specific to RL, B ′ is the batch size in RL, and T remains the model parameters.The space complexity is O(T ).Summing these up, the total time complexity for the training phase becomes O

Figure 4 :Figure 5 :
Figure 4: The performance of the reward model in three LLMs during the training process of MAPO.

Figure 6 :
Figure 6: The separate effects of PPO and RRMF during the process of RL in constructing MAPO.

Figure 7 :Figure 8 :
Figure 7: The distribution of three most frequent words, which extracted from the original prompt, in the optimized prompt among different LLMs in the generation task.(a) retaining frequent words in the instruction, (b) removing frequent words in the instruction.

Figure 9 :
Figure 9: The distribution of three most frequent words, which extracted from the original prompt, in the optimized prompt among different LLMs in the classification task.(a) retaining frequent words in the instruction, (b) removing frequent words in the instruction.

Table 1 :
The amount of the warm-up dataset on various downstream tasks.

Table 3 :
The few-shot performance of SFT, MAPO with SOTA prompt optimizing baselines in downstream tasks.F: Finetuning, C: Continous prompt, D: Discrete prompt.

Table 5 :
Zero-shot domain transfer performance based on BLOOM and GPT-J with original, SFT-optimized and MAPO-Optimized prompts.CLS: Classification, M: MAPO.The (↑(%) ) and ↑(%) represent the increase degree of MAPO-optimized prompts compared with original prompts in each dataset and task, respectively (The same below).

Table 6 :
Zero-shot domain transfer performance based on LLaMA with original, SFT-optimized and MAPO-Optimized prompts.RS: Commonsense Reasoning.
How do people in Mexico refer to the largest Presbyterian church?BLOOM: What is the way in which the biggest Presbyterian church is referred to by individuals in Mexico?GPT-J: What is the term used by Mexicans to denote the biggest Presbyterian church?LLaMA: What is the commonly used term for the biggest Presbyterian church in Mexico?
Question: "Which happened earlier, the Chinese entered the war or President Truman dispatched the United States Seventh Fleet to the Taiwan Strait?".Context: "On 27 June 1950, ...".Answer: Candidate 1: Question: "Did President Truman dispatch the United States Seventh Fleet to the Taiwan Strait before or after the Chinese entered the war?" Context:"On 27 June 1950, ...".Answer: Candidate 2: Question: "Did the Chinese enter the war before President Truman dispatched the United States Seventh Fleet to the Taiwan Strait, or vice versa?"Context:"On 27 June 1950, ...".Answer: Candidate 3: Question: "Did the Chinese enter the war first or did President Truman send the United States Seventh Fleet to the Taiwan Strait earlier?".Context:"On 27 June 1950, ...".Answer:

Table 9 :
Hyperparameters used for MAPO in all the tasks.

Table 10 :
The separate effect of PPO and RRMF, which demonstrate the important roles played by both PPO and RRMF in enhancing the performance of MAPO.

Table 11 :
Performance of different proportion of warm-up dataset in various downstream tasks by three LLMs.Q: QA, C: classification, G:generation.↓means the number of performance decline.↓(%) means the percentage of performance decline.D-↓ means the number of data reduction.D-↓(%) means the percentage of data reduction.

Table 12 :
The few-shot performance of MAPO with SOTA prompt optimizing baselines in downstream tasks.

Table 13 :
Natural Instructions of various downstream tasks.optimization dynamics exhibited by each LLM.

Table 18 :
5,rformance of different number of epochs when training SFT.Best Performance means the best performance within 20 epochs.Best Epoch means the epoch corresponding to the best performance.We list the performance in the 1,5,10,15,20,50epochs.We bold the performance metrics where a longer training epoch (epoch=50) results in a decline in performance.