Survival of the Most Influential Prompts: Efficient Black-Box Prompt Search via Clustering and Pruning

Prompt-based learning has been an effective paradigm for large pretrained language models (LLM), enabling few-shot or even zero-shot learning. Black-box prompt search has received growing interest recently for its distinctive properties of gradient-free optimization, proven particularly useful and powerful for model-as-a-service usage. However, the discrete nature and the complexity of combinatorial optimization hinder the efficiency of modern black-box approaches. Despite extensive research on search algorithms, the crucial aspect of search space design and optimization has been largely overlooked. In this paper, we first conduct a sensitivity analysis by prompting LLM, revealing that only a small number of tokens exert a disproportionate amount of influence on LLM predictions. Leveraging this insight, we propose the Clustering and Pruning for Efficient Black-box Prompt Search (ClaPS), a simple black-box search method that first clusters and prunes the search space to focus exclusively on influential prompt tokens. By employing even simple search methods within the pruned search space, ClaPS achieves state-of-the-art performance across various tasks and LLMs, surpassing the performance of complex approaches while significantly reducing search costs. Our findings underscore the critical role of search space design and optimization in enhancing both the usefulness and the efficiency of black-box prompt-based learning.


Introduction
Many of the recent astounding breakthroughs in artificial intelligence have revolved around pretrained large language models (LLMs).Though capabilities of LLMs have advanced at a breakneck speed, modern LLMs are remarkably consistent in that they are almost invariably powered by Transformerbased architectures (Vaswani et al., 2017)  Figure 1: Our proposed method achieves the best anytime performance: Anytime test accuracy against wallclock time CLAPS (our proposed method) compared to other baselines in a few-shot learning setup on SST-2 with Flan-T5 base .Lines and shades denote the mean and standard deviation over 5 random seeds, respectively (single seed for CLAPS (greedy)).
trained with simple, self-supervised text completion on a large corpus.This is typically followed by fine-tuning and/or, more recently, prompting-based methods on specific tasks (Lyu et al., 2022;Kojima et al., 2022;Chen et al., 2023).
Prompt-based learning is particularly appealing for modern LLMs due to its sample efficiency and flexibility compared to conventional fine-tuning.This enables few-shot or even zero-shot learning (Brown et al., 2020;Liu et al., 2023).It can be categorized into two types: soft and hard prompt tuning.Soft prompt tuning directly optimizes the embedding space of the model with other model parameters frozen (Li and Liang, 2021;Lester et al., 2021, inter alia).Although these methods do not require full gradient updates like fine-tuning, they still require parameter access, back-propagation through massive models, and are typically modeland/or task-specific.
Hard prompt tuning (HPT), on the other hand, is an emerging paradigm that directly searches for discrete tokens to be added to text input.Hard prompts are more portable and more amenable to human interpretation, as they are actual tokens rather than abstract arrays in the embedding space (Shin et al., 2020).More importantly, unlike soft prompting, which invariably requires parameter access of LLMs due to the need to modify the embeddings, HPT is feasible even if the task LLM is only available as a 'black box', i.e., only the model outputs, but not information like parameters and gradients, are available.Indeed, methods leveraging reinforcement learning (RL) (Deng et al., 2022;Zhang et al., 2023) and gradient estimation (Diao et al., 2023) have been recently proposed to exploit this powerful property, particularly since many advanced LLMs (e.g., GPT-4 (OpenAI, 2023) and Bard) are increasingly made available in a modelas-a-service (MaaS) manner, on which parameter or gradient access is expensive or impossible -thus, in this paper, we also focus on this practical but challenging black-box setup.
Despite the promising progress, one challenge plaguing the aforementioned black-box HPT approaches is the difficulty of the discrete and combinatorial optimization inherent to this problem when no gradient guidance is available -it is common that existing methods require a large number of model queries, frequently in the order of O(10 3 ) or more, before convergence.While previous works have attempted to alleviate this problem by improving the search strategy, search space design has been largely overlooked.For example, previous works take the natural decision of using the entire tokenizer vocabulary as the search space (Deng et al., 2022), by a convenient extension from the soft prompt tuning.However, as we will show with an analysis of the search spaces of discrete prompts, such a practice is actually suboptimal and has made the optimization unnecessarily difficult.Similar to the phenomenon observed in related discrete optimization problems such as neural architecture search (Wan et al., 2022;Ru et al., 2020;Zhou et al., 2023b), we find the influence exerted by different tokens on the LLM when prepended to the text queries as discrete prompts to be highly non-uniform, with a small number of tokens (e.g., 0.1 -1% of all tokens) exerting a disproportionate amount of influence.Meanwhile, the models are insensitive to or even harmed by the vast majority of the other, 'non-influential' tokens, which nevertheless act as nuisance variables during the search to substantially increase the optimization difficulty and resources required.
Inspired by these findings, we then propose Clustering and Pruning for Efficient Black-box Prompt Search (CLAPS), a simple black-box search method that first clusters and prunes the search space to focus on this subset of influential tokens, followed by the discrete prompt search on a few-shot objective.We find that after pruning, even the simplest search strategy (e.g., random or evolutionary search) can outperform state-of-theart methods with much more complicated search strategies, often at a fraction of the search costs over these competing methods (e.g., CLAPS outperforms RLPrompt with only 2.8% of its cost measured in terms of wall-clock time).In summary, in this paper, we offer the following contributions: 1) We analyze the influence different tokens in the vocabulary exert on LLM predictions, and find that only a small fraction of tokens positively influence LLMs when used as discrete prompts.
2) We propose CLAPS, a black-box discrete prompt search method compatible with a few-shot learning setup, via a cluster-prune-then-search routine that focuses on a small set of influential tokens as discrete prompt candidates.
3) We then show that while conceptually simple, CLAPS attains state-of-the-art performance, often achieved at a very small fraction of the cost of competing methods in more than 8 tasks with instruction-finetuned Flan-T5 models.

Preliminaries
Hard prompt tuning (HPT).As mentioned in §1, HPT aims to find discrete tokens to be concatenated directly to the test queries with the goal of maximizing task performance.Formally, HPT may be represented as an optimization problem: where {x i , y i } denotes a query-target pair, p = {p 1 , ..., p K } are the additional tokens to be concatenated with the text query x -this is often referred to as the discrete prompts, whose optimization is the focus of HPT and we use P to denote the prompt search space, the set of all possible discrete prompts.C(p, x i ) refers to the concatenation of p and a formatted query x i : where template(•) denotes any human-designed pre-processing procedure that formats is the output probability distribution of the model given x i over all possible classes Y (defined by the verbalizers) with |Y| j=1 f (j) C(p, x i ) = 1; it is worth stressing again that under a black-box setup considered in this paper, the output probabilities are the only observation available to us and we assume no access to other information, including but not limited to the model architectures, parameters or gradients.Finally, R(•, •) refers to a reward function given the model predictions and the ground-truth labels (an example is the negative cross-entropy loss).The goal of HPT is thus to find the optimal p * that maximizes this reward on expectation over some data-generating distribution D. Since the true data-generating distribution is always assumed to be latent, in practice we solve Eq. ( 1) via empirical risk minimization with a standard train-validation-test split.
Search strategy and search space.Solving Eq. ( 1) is, in general, challenging, as it involves difficult combinatorial discrete optimization, and the gradients essential for standard first-order optimization are not available.A natural recourse, that most previous works have focused on, is developing better zeroth order search strategies, via, for example, reinforcement learning and Monte Carlo gradient estimation.Search space (i.e., P), on the other hand, is much less well-studied despite the fact that its design has been previously shown to be one of the most important influencing factors in related discrete optimization problems.In HPT, the overall search space P can be decomposed as a Cartesian product over the search space of individual tokens: P = K k=1 P k , which is in turn often designed heuristically, and popular choices include the entire tokenizer vocabulary P k = V (and thus |P| = |V| K for a K-token discrete prompt) (Deng et al., 2022) or a subset of frequent n-grams from it (Diao et al., 2023) -given the exponential scaling w.r.t. the value of K, |P| is typically huge even for modest |P k | and/or K.

Analyzing Prompt Search Spaces
General search spaces are highly redundant.We argue that, like any other optimization problem, the search space, in our case, may also have a profound effect on both the search strategy and the downstream performance.As the research community of HPT grows, we argue that a systematic  study of the search space design is crucial.As discussed, existing search spaces are often expensive and heuristically designed.However, a large search space is not necessarily well-designed: crucially, it is unknown whether all parts of P positively contribute to downstream task performance, or it could simply be highly redundant, i.e., a large fraction of P might in fact be unimportant or even harmful, which simply increase complexity but nevertheless act as confounding factors that make the optimization in Eq. ( 1) unnecessarily hard.
To answer this question, we analyze the building blocks of the most general search space where the individual tokens of the discrete prompts may be any token in the vocabulary V. To quantify the incremental influence for a token v ∈ V, we define: where we treat a token v as a single-token discrete prompt to be concatenated to text queries x i and its influence ∆R(v) is the change in reward compared to the case where a formatted input without any prompt token C(x i ); N denotes the number of labeled samples randomly sampled from the training set of the target task -we use N = 16 throughout this paper, and we define R(•, •) as the negative cross-entropy: We visualize the results of the above analysis on a representative task in Fig. 2 where we compute ∆R(v) for all tokens in the vocabulary1 , and we find the distribution of influence over the vocabulary of tokens is, in fact, heavily non-uniform, with a small fraction (roughly 1%, marked in green) of all tokens exerting a disproportionate amount of influence on the prediction of LLMs whereas the vast majority of tokens either actively harm LLM predictions or exert negligible influence.
Search space pruning.The finding above means that it would be highly challenging for any search method to navigate in the original search space, especially in a black-box setup: the method has to learn to both identify the small fraction of functioning tokens and to avoid the vast majority of unimportant or harmful ones.Instead of doing so, we propose to prune P k by focusing on the small fraction of the most influential tokens identified above only -given the Cartesian structure of P, this results in an exponential reduction of the overall search space P: with a representative K = 5 and if we retain the top-1% tokens in terms of ∆R(v) given by Eq. 3, there is a O(10 10 ) reduction in |P|.
To validate the effectiveness of the pruning procedure and that the search space reduction does not lead to sacrifices in performance, we randomly sample 100 5-token discrete prompts from the reduced search space after the aforementioned pruning procedure and use their performances as an approximation of the overall search space quality, and we compare the results against the samples drawn from 1) the original, unmodified search space (Vocab), 2) a reduced search space with P k reduced to 10% of the original, but the tokens are Algorithm 1 CLAPS.randomly selected (Random), and 3) a P k consists of frequent n-grams selected via pointwise mutual information as in Diao et al. (2023) (BDPL).We visualize the test accuracy distribution in the RTE task in Fig. 3, and we find pruning to massively improve search space quality and reduce search difficulty compared to both random pruning and the pruning strategy proposed in BDPL, the latter of which does outperform Random and Vocab but is nevertheless outperformed by our pruning strategy.Crucially, the fact that the median of the 100 randomly sampled discrete prompts already performs similarly to RLPrompt (Deng et al., 2022), a state-of-the-art method that features much more complicated and expensive RL search strategy and a tailored reward function, highlights the extreme importance of search space design.

Efficient Black-Box Prompt Search via Clustering and Pruning
Inspired by the analyses presented in §3, we now present Efficient Black-Box Prompt Search via Clustering and Pruning, or CLAPS in short, with the overall procedure illustrated in Fig. 4 and Algorithm 1.At a high level, CLAPS utilizes a multistep approach, combining the search space pruning proposed in §3 with an optional clustering step to reduce further the computational cost and a simple black-box prompt search routine.We describe the procedure in detail below.
Clustering.By default, CLAPS enumerates the tokens in V and obtains the influence score (Eq. 3) ).We then (c) prune the tokens using the procedure described in §3 to retain a small fraction of influential (∼ O(10 2 )) tokens as the search space.We finally perform (d) black-box prompt search over the reduced search space to identify the final K-token discrete prompts.
of each token by evaluating on a 16-shot training set.While this procedure, which requires O(10 4 ) model evaluations can be already tractable, here we propose an additional optional step to accelerate further our method: instead of enumerating all tokens, we may use an unsupervised algorithm on the token embedding space to obtain a subset of diverse tokens V c that well-represent V (illustrated in Fig. 4(b)) -while alternative methods that explicitly optimize for diversity set selection exist, we opt for the simple greedy K-means++ (Arthur and Vassilvitskii, 2007) to generate V c (we set |V c | = 2000 unless otherwise stated).Formally, for each centroid e c identified by K-means++, we collect the closest token in terms of its embedding ℓ 2 distance: (5) The size of the retained vocabulary |V c | is a hyperparameter of the search algorithm (to be discussed in detail at the end of this section) and determines the number of model queries in the next stage, with a smaller |V c | leading to more aggressive reduction and improved query efficiency but may lead to some performance loss as some influential tokens may be removed from the search space at this stage.
In our experiments, we set |V c | = 2000 for all model and task combinations without further hyperparameter tuning, and after the above procedure, the number of LLM queries at the pruning stage reduces from O(10 4 ) to O(10 3 ).Empirically, as shown in §6, we find this additional procedure to reduce the cost by roughly 3/4 relative to enumeration (i.e., no pruning) in terms of wall-clock time at only a small performance impact.A sensitivity study of hyperparameters is also performed in §6.
Ranking and pruning.As illustrated in Fig. 4(c), we prune V c (with clustering) or V (without clustering) using the procedure described in §3 to obtain the set of influential tokens for prompt search V it .The size of V it is another hyperparameter, which in this case encodes the greediness with a small |V it | suggesting a more greedy algorithm that only considers tokens that minimize the validation loss.However, as we empirically show in §6, combining the most influential tokens does not necessarily lead to the optimal prompt, and balancing greediness with prompt search in the next stage leads to the optimal outcome -in this paper, we set |V it | = 200 for all experiments without further model-or taskspecific hyperparameter tuning.
Black-box prompt search.The final step of CLAPS, as illustrated in Fig. 4(d), is search.To demonstrate that CLAPS is search methodagnostic, we consider three different search strategies in our experiments.To differentiate from previous work focusing on search strategies, we first consider a lightweight search strategy with a basic evolutionary search algorithm with the following ingredients: • Initialization: we initialize with a population of M uniformly sampled K-token discrete prompts from the pruned search space, and we evaluate the accuracy of each discrete prompt on a heldout, 16-shot validation set.
• Evolution: after evaluating all prompts in the population, at each search epoch, we retain the top 10% of the population in terms of the validation loss as seed prompts.We then generate the next population of M 2 prompts via crossover, where two randomly selected seed prompts ex-13068 change tokens to create a new offspring, and M 2 new prompts via mutation, where we swap a token in a seed prompt with another token in the (pruned) vocabulary with a fixed probability.
• Termination: at the end of the final search epoch, we simply return the prompt that leads to the best validation loss seen as the final p * .To demonstrate the versatility of CLAPS, we also consider two additional search strategies, namely greedy search and particle swarm optimization (Kennedy and Eberhart, 1995;Bonyadi and Michalewicz, 2017).The greedy algorithm is a commonly used baseline in combinatorial optimization: Starting with an empty string p * 0 := ∅, at the k + 1-th iteration, we iterate through the search space V it (with |V it | = 200 following the previous paragraph) and simply select the token that leads to the highest reward, conditioned on partial prompt p * ≤k with k tokens already selected so far.More formally, the (k + 1)-th token of p * is recursively selected by: and the algorithm terminates when all K tokens are selected.For the particle swarm optimizer, we use an adapted version of the algorithm described by Zang et al. (2020) to work in the discrete search space, and we refer the reader to Appendix A for further implementation details.
It is worth noting that we only consider a small representative, and definitely non-exhaustive set of search algorithms.CLAPS, which focuses on search space design, can be deemed as a metamethod that is compatible with any search strategy, including but not limited to the ones proposed in previous work, in a plug-and-play manner.It is therefore possible that combining CLAPS with a more advanced search method would lead to even stronger performance -we defer a thorough investigation to future work.

Related Work
Prompt learning.Prompt learning is a class of powerful methods for LLM adaptation and has become an efficient alternative to full model finetuning (Liu et al., 2023).Earlier methods (Li and Liang, 2021;Lester et al., 2021;Liu et al., 2022b) typically feature soft prompt tuning, where continuous prompts which modify the input embedding of an otherwise frozen LLM are optimized.
Other methods, such as the parameter-efficient finetuning (PEFT) techniques (He et al., 2022), which only tune a small fraction of the model parameters (Houlsby et al., 2019;Hu et al., 2022), may also be regarded as soft prompt learning.While promising, a drawback of the soft prompting methods is that since the model-specific input embedding layers often need to be modified, these methods inevitably require internal model access.Furthermore, with a few exceptions like BBT (discussed in the next paragraph), many soft prompting methods still require back-propagation of gradients through massive models, which can still be computationally expensive.In contrast to soft prompting, hard prompt learning learns discrete tokens: AutoPrompt (Shin et al., 2020) uses model gradients to select appropriate tokens automatically, but is nevertheless restricted to a 'white-box' setup.
Black-box prompt optimization.In contrast to the white-box methods discussed above, several methods are proposed to tune discrete prompts in a black-box manner (i.e., not using internal knowledge about the pretrained LLM).Black-box tuning (BBT) and BBTv2 (Sun et al., 2022b,a) use gradient-free optimization to learn soft prompts that are projected back to the embedding/weight space and concatenated to the query embedding and/or weights.While not using model gradients, these methods nevertheless require access to input embedding of the task model itself, and hence are not black-box in the strictest sense.In the strictly black-box setup, methods using reinforcement learning (Deng et al., 2022;Zhang et al., 2023), discrete optimization (Prasad et al., 2023), and gradient estimation (Diao et al., 2023) have been proposed; we empirically compare against them in §6.Furthermore, as discussed in §4, CLAPS is fully orthogonal to the previous work since these techniques focus on improving search strategy.Several other works have focused on optimizing specific components of the prompt design, e.g., Rubin et al. (2022); Liu et al. (2022a); Wan et al. (2023a,b) focus on selecting in-context examples and Zhou et al. (2023a) mitigate the in-context bias by calibration.We argue that these methods are again orthogonal to our contributions and thus may offer combining benefits.

Experiments and Results
Evaluation data.We include various tasks from single-sentence to multi-sentence classification tasks, from mono-lingual to multi-lingual NLI datasets for widely validating the performance of CLAPS at different levels of task difficulty.
We conduct experiments on the standard GLUE dataset (Wang et al., 2018) including: SST-2, RTE, QNLI, MNLI, MRPC, QQP.Furthermore, we include AG's News (Zhang et al., 2015) and SNLI (Bowman et al., 2015) following the previous hard prompt tuning papers (Deng et al., 2022;Zhang et al., 2023).In addition, we include XNLI (Conneau et al., 2018), a multilingual NLI task, as the most challenging unseen dataset for revealing the potential of our method in different languages.For all tasks, we follow the standard few-shot setting (Perez et al., 2021), where 16 shots represent 16 examples per class for both training and validation sets.Since the test labels for GLUE tasks are unavailable, following standard practice we take validation shots from the training sets and treat the validation set as the test set.
Baselines.In the few-shot learning setup, we mainly compare CLAPS with gradient-free blackbox baselines.Training details of all the methods in comparison are included in Appendix A.
• BDPL (Diao et al., 2023): BDPL first models the prompt generation as samples drawn from a multi-dimensional categorical distribution, and uses Monte Carlo-estimated gradients to optimize the distribution parameters.The search space is over a subset of V that appear as frequent n-grams in the task training corpus.
• RLPrompt (Deng et al., 2022): It trains a policy network that generates discrete prompts (an MLP layer on top of a frozen, pretrained GPT-2 model) with a bespoke piece-wise reward.The search space is over the whole vocabulary.
• Search and Prune & Search: we include these baselines both as ablation experiments and to directly gauge the impact of search space design on the downstream task performance.Search baseline utilizes the genetics search algorithm described in §4 directly in the full, non-pruned vocabulary search space without clustering and pruning.Prune & Search refers to CLAPS without the clustering step where we prune on the whole vocabulary followed by the genetics search.
Models.We explore the potential of CLAPS with instruction-finetuned models, and we test on a wide range of challenging tasks with Flan-T5 base and Flan-T5 large models, one of the most powerful opensourced models of their size (Chung et al., 2022).We refer readers for detailed hyperparameters and training setups to Appendix A.
Discussion of main results.We present the results on all tasks except XNLI in Table 1, whereas the XNLI results are provided in Table 2.For CLAPS results, we present CLAPS with genetics and greedy search algorithms in the main text and show the results with particle swarm optimization in Appendix B. Across both sets of tasks, we find CLAPS (i) to consistently improve on standard, no-prompt Manual baseline and (ii) to outperform the other prompting baselines across the models and tasks.More specifically, CLAPS (genetics) outperforms RLPrompt 0.6% and 1.8% on average for Flan-T5 base and Flan-T5 large , respectively.In addition, we find that when used with CLAPS, the greedy search algorithm, although straightforward, can be surprisingly strong across many experiments except for XNLI with Flan-T5 large ; this concretely shows that CLAPS may orthogonally benefit different suitable search algorithms.Furthermore, in contrast to other prompting baselines like BDPL and RLPrompt, which occasionally lead to performance deterioration from Manual, CLAPS consistently improves over the latter.We hypothesize that it is exactly due to the stability of our approach enabled by searching only on pruned search space featuring positively influential tokens, whereas the competing methods may suffer from unstable and noisy gradient estimations and/or RL policies over a search space with more harmful sub-components.
Finally, we emphasize that CLAPS achieves state-of-the-art performance with rather naïve search strategies, which stands in stark contrast to the competing methods that are both much more complicated methodologically and often orders-ofmagnitude more expensive -we argue that this highlights that methods focusing on search space design warrant further investigation in future work.

Efficiency analysis.
We analyze the performancecost trade-off of various methods on a represen-  tative task in Table 3, which highlights the muchenhanced practicality of CLAPS compared to the baselines: CLAPS is extremely storage-efficient as it requires no additional parameters to be stored in the GPU, and the only memory requirement is to maintain the task model under an inferenceonly (i.e., no gradient storage) mode.CLAPS also achieves the best trade-off between time efficiency and performance as faster methods (FT and BDPL) perform much worse, whereas methods like RL-Prompt perform better, but are orders-of-magnitude slower.It is worth noting for fairness of comparison, we also perform additional experiments by running BDPL longer than the default, but we find that doing so only brings marginal improvement over the Manual baseline, as illustrated in Fig. 1.
Examples of discovered discrete prompts.Table 4 presents examples of CLAPS-discovered prompts, and interestingly, we often observe some interpretability even though CLAPS has not been explicitly tuned towards fluency.For example, in SST-2, a movie review sentiment-classification task, CLAPS picks 'review' as a part of the best prompt.On the other hand, RTE and XNLI EN are both textual entailment tasks and CLAPS again spontaneously discovers prompts provide an instruction-like signal to 'ask' the model to 'answer' the question.While the other prompts are less immediately interpretable, we hypothesize that they nevertheless act to tune the model embedding towards the optimal direction for the target task for performance improvement.CLAPS does share some words with the competitive baseline, RLPrompt, and these words (e.g., 'review' and 'answer') are usually 'influential prompts' identified by our pruning strategy and have significant impacts on the model's prediction.With a similar or even better quality of prompts, CLAPS stands out by first establishing an efficient search space while saving substantial computation costs.

ClaPS
Ablation and sensitivity studies.In Fig. 5, we first study the performance impact of the use of clustering by comparing CLAPS against Prune&Search: we find that in the tasks considered, clustering minimally affects the performance, but leads to a ∼75% speed-up in terms of wall-clock time.We also investigate the effect of different pruning strengths, and find that 1) pruning generally improves performance, 2) performance is rather insensitive to (reasonable) pruning strength, and 3) the threshold of 1% (corresponding to 99% in Fig. 5) is a generalizable choice across tasks.Finally, we conduct additional ablation experiments to test the robustness of CLAPS w.r.t.other hyperparameters, such as the number of clusters during clustering and the prompt length; the readers are referred to Appendix B for details.

Conclusion
We first analyzed the search spaces in the general paradigm of hard prompt search.Inspired by the findings that only a small fraction of tokens exert a positive influence on prediction, we proposed CLAPS, an efficient black-box prompt search method via clustering and pruning.The CLAPS method is methodologically simple, easy to implement, and cost-effective, and we showed that it achieves state-of-the-art performance in both mono-lingual and multi-lingual tasks with Flan-T5 models.CLAPS is a meta-method orthogonal to the search strategy, and we expect more efficient and effective prompt search algorithms can be created on top of it.We hope that future work will invest more time into the important problem of search space design.

Limitations
We argue that CLAPS only serves as a first step towards the promising direction of better search space design and automation, and thus, the room for improvement is ample.First, we have only considered a suite of natural language understanding (NLU) tasks that may be cast as classification in the present study, whereas prompting techniques for generative tasks are, in general, less developed.Second, we have only explored a token-based search space for hard prompts as it is the most general, but alternative search spaces built on the overall instruction templates and exemplifiers exist (such as the ones used in Zhang et al. (2023) and Prasad et al. (2023).We hypothesize that since these search spaces are also often heuristically designed, the search space issues and the pruning procedure may also apply to these search spaces, which are often claimed to be more interpretable, and thus, it would be interesting to extend our analysis, and methodology to these alternative spaces.
Third, as we discussed in §4, while the present paper primarily focuses on search space, it is possible to combine CLAPS with more advanced search methods for further potential gains: some promising strategies include reinforcement learning, as used in Deng et al. (2022) and Zhang et al. (2023), and sample-efficient zeroth-order algorithms that may operate directly over the token search spaces, such as the recent advancements in Bayesian optimization over discrete and/or combinatorial variables (Baptista and Poloczek, 2018;Wan et al., 2021;Daulton et al., 2022)

B Additional Experimental Results
We attach the main experimental results from Table 1 with standard deviation and one additional CLAPS results by particle swarm optimization in Table 5.Based on three different CLAPS search strategies, we find that, in absolute terms, it does matter what search strategy is used to yield im-proved task performance, and this is thus largely task-dependent.In relative terms, CLAPS improves almost all of the tasks with significantly enhanced efficiency, validating its orthogonality to the selected search algorithm.
We provide an ablation study that conducts the CLAPS pipeline with different numbers of clusters in the phase of clustering in Table 6.It reveals that having the number of clusters at 2000 stands as a good empirical trade-off point for saving the cost while showing strong performance across tasks.
We then provide an ablation study with various token lengths in Table 7. First, increasing the token length from 2 to 5 leads to an improvement in performance over the two tasks.It shows that a longer sentence can provide more expressive control and description for prompting the language model.By further increasing the token length from 5 to 10, we observe a decrease in performance and consider this to be due to the dimensionality problem in derivative-free optimization.

C Prompt Template
We present the prompt template of the tasks considered in Table 8

Figure 2 :
Figure 2: Only a small fraction of tokens improve performance.Distribution of the incremental reward ∆R(v) (Eq. 3) evaluated on 16-shot RTE samples with Flan-T5 base .The top-{1,5,10}% tokens in terms of their incremental reward are highlighted in colors.

Figure 3 :
Figure 3: Pruning improves prompt search.Distribution of accuracy on RTE with Flan-T5 base by random sampling 100 5-token prompts from different vocabulary spaces.Random refers to a random vocabulary set, and BDPL prunes a context-relevant vocabulary set based on task-dependent n-gram scores.Pruning indicates our reward pruning on the vocabulary space.RLPrompt denotes the final test accuracy achieved by RLPrompt (Deng et al., 2022) on this task.

( a )
Figure 4: Illustration of the CLAPS pipeline.Starting from (a) the original search space (in this case, the entire vocabulary V with |V| ∼ O(10 4 ), visualized via t-SNE plots of vector embeddings for illustration only), (b) we first perform the optional, unsupervised step of K-Means clustering to retain a fraction of representative tokens V s with |V s | ∼ O(103 ).We then (c) prune the tokens using the procedure described in §3 to retain a small fraction of influential (∼ O(10 2 )) tokens as the search space.We finally perform (d) black-box prompt search over the reduced search space to identify the final K-token discrete prompts.
Input: Original token search space P k (typically the entire vocabulary V); search space size to retain after clustering |Vc| (can be set to |V| if no clustering is required); search space fraction to retain after pruning α; discrete prompt length (in terms of # tokens) K. 2: Output: Optimized discrete prompts p * 3: if |V ∫ | < |V| then Rank and prune the tokens in P k and only retain the top-α fraction of tokens in terms of ∆R(v) as the new token search space.10: [Search]: Run black-box prompt search in the prompt search space P = K k P k to solve Eq. 1 to obtain an optimized discrete prompt p * . 1:

Table 1 :
Accuracy on Flan-T5 base(Left)and Flan-T5 large(Right).We reproduce all baselines and report the mean for 5 random seeds for Flan-T5 base .For computation-expensive experiments, we report single-seed results for Flan-T5 large .The best and second-best results are marked in bold fonts and ranked by color.

Table 2 :
Accuracy in 9 languages on XNLI with Flan-T5 base (Left) and Flan-T5 large (Right).Methods in other languages show similar performance to Hindi and Swahili with marginal improvements over random prediction and are omitted from the table.We report single-seed RLPrompt results due to its computation costs for XNLI tasks.Refer to Table1for additional explanations.

Table 3 :
Comparing the efficiency of CLAPS with baselines in the few-shot learning setup with Flan-T5 base .We report the number of trainable parameters, the peak VRAM load, and the wall-clock time for the training phase of all methods.The pruning-phase time span is included in CLAPS.Note that RLPrompt and BDPL are run under their default settings, respectively.

Table 4 :
Examples of CLAPS-discovered prompts compared to RLPrompt for a collection of tasks using Flan-T5 base .The CLAPS prompts are prepended to the formatted text queries, whose templates are listed in Appendix A.
. We defer thorough investigations to future work.

Table 5 :
Accuracy on Flan-T5 base with three different search algorithms on CLAPS.We reproduce all baselines and report the mean and standard deviation for 5 random seeds for Flan-T5 base .The best and second-best results are marked in bold fonts and ranked by color.

Table 6 :
Performance of CLAPS (Genetics) with respect to the number of clusters in the phase of clustering.

Table 8 :
Prompt templates for prompt search experiments, where we also implement the same template for both BDPL and RLPrompt experiments.We use Template 1 for Flan-T5 base experiments.Due to the level of difficulty of SNLI/MNLI/XNLI, we evaluate Template 2 with task instruction for Flan-T5 large experiments, which demonstrate better in-context learning capability than small language models.All templates are manually created and fixed without iterations.