Universal Self-Adaptive Prompting

A hallmark of modern large language models (LLMs) is their impressive general zero-shot and few-shot abilities, often elicited through in-context learning (ICL) via prompting. However, while highly coveted and being the most general, zero-shot performances in LLMs are still typically weaker due to the lack of guidance and the difficulty of applying existing automatic prompt design methods in general tasks when ground-truth labels are unavailable. In this study, we address this by presenting Universal Self-Adaptive Prompting (USP), an automatic prompt design approach specifically tailored for zero-shot learning (while compatible with few-shot). Requiring only a small amount of unlabeled data and an inference-only LLM, USP is highly versatile: to achieve universal prompting, USP categorizes a possible NLP task into one of the three possible task types and then uses a corresponding selector to select the most suitable queries and zero-shot model-generated responses as pseudo-demonstrations, thereby generalizing ICL to the zero-shot setup in a fully automated way. We evaluate USP with PaLM and PaLM 2 models and demonstrate performances that are considerably stronger than standard zero-shot baselines and often comparable to or even superior to few-shot baselines across more than 40 natural language understanding, natural language generation, and reasoning tasks.


Introduction
The recent advancements in large language models (LLMs) are among the most astonishing breakthroughs in artificial intelligence.The modern, massive attention-based (Vaswani et al., 2017) LLMs not only surpass human and previous models in specific natural language processing tasks, but they have also demonstrated impressive general capabilities (Bubeck et al., 2023).Indeed, thanks to both the scaling of LLM sizes and advances in * Work done during internship at Google.training and fine-tuning techniques (Brown et al., 2020;Sanh et al., 2021;Wei et al., 2021), one of the most prominent and impressive abilities of modern LLMs is their zero-shot generalizability handling diverse and sophisticated tasks, even if the models have not been explicitly trained on them.Beyond zero-shot abilities, when a few demonstrations are available, the few-shot capabilities can take advantage of the information in them with in-context learning (ICL) (Brown et al., 2020), leading to further improvements.
Such few-shot capabilities are often observed to improve as the LLMs scale (Brown et al., 2020;Wei et al., 2023).Along with careful prompting, in many cases, LLMs can perform similarly to, or even better than, fine-tuning, even though the latter is both more computationally expensive (due to gradient back-propagation) and more data-intensive.As such, in many scenarios, prompt-based learning has drastically reduced the barrier to the use of even the most massive LLMs.
Notwithstanding the breakthroughs, many open questions remain.While the zero-shot performances of LLMs are highly valued and widely used as a key yardstick of LLM capabilities (Chowdhery et al., 2022;Tay et al., 2022), LLMs still often show weaker performances and/or larger performance fluctuations in the zero-shot setting because of the lack of guidance or readily-available template solutions.While many automatic prompting methods have been proposed (refer to §4 for details), few existing works target the zero-shot setup, and heuristic manual prompt design is still often heavily relied upon (Reynolds and McDonell, 2021;Mishra et al., 2022).
On the other hand, even though the ICL paradigm has reduced the cost of data collection and labeling considerably, given that modern LLMs are typically used for an extremely diverse set of tasks, obtaining even a small number of labeled examples per task can easily become expensive for many tasks.Furthermore, in some tasks, obtaining even a few examples might require a non-trivial amount of human effort (e.g., summarization of long articles, translation of low-resource languages, and/or domain-specific question answering requiring research or expertise), or simply impossible for novel tasks that are only revealed at test time.
To address this, we introduce USP (Universal Self-Adaptive Prompting) that specifically pushes the state-of-the-art with ICL in zero-shot settings (while remaining compatible with fewshot) via pseudo-demonstrations (pseudo-demos) constructed from unlabeled queries and modelgenerated outputs.USP works with fully blackbox, inference-only LLMs, and the use of pseudodemos ensures that USP may operate entirely in the transductive zero-shot setup (Xian et al., 2017) using only unlabeled data.This makes USP ex-tremely versatile, as unlabeled data is typically readily available via, e.g., continuous, on-the-fly collections of user queries.Unlike alternative methods often requiring task knowledge beforehand (e.g., class names), USP requires only the task type information to select an appropriate confidencequantifying metric (e.g., natural language understanding (NLU) or generation (NLG) -these need to be known anyway), while still remaining capable of using additional information like class names if they are indeed available ( §3.3).This enables USP to work in arbitrary, potentially novel tasks at test time and/or tasks that simply cannot be cast as classification problems (e.g., open-domain QA and other generative tasks).USP is inspired by recent works leveraging confident predictions for model self-improvements on chain-of-thought tasks (Wang et al., 2022;Huang et al., 2022;Wan et al., 2023) but inherits the benefits of these works and generalize them considerably in terms of the scope of applicability.To achieve this, we derive various criteria capable of selecting high-quality pseudo-demos in the absence of any ground-truth labels.To summarize: 1) We propose USP, a versatile and black-box automatic prompting method that can be zero-shot using only unlabelled data.
2) To achieve this, we select pseudo-demos from model-generated outputs via 3 carefully designed scoring functions suitable for different task types.
3) As shown in Fig. 1, we show USP realizes large performance gains over more than 40 NLU, NLG and reasoning tasks with PaLM & PaLM 2 models.

Preliminaries
In-context Learning (ICL).ICL enables LLMs to perform few-shot learning by processing several labeled, exemplary queries similar to the test queries we are interested in solving as demonstrations, or demos in the prompts (Brown et al., 2020;Dong et al., 2022;Logan IV et al., 2022) (Fig. 2b).Formally, denoting a test query as x and if we have k pairs of related concatenated queries and labels s (i) = Concat(x (i) , y (i) ) ∀ i ∈ {1, ..., k} serving as demos, we augment the test query by prepending the demos (and instructions, if any) to it: (1) , ..., s (k) , x). ( ICL is achieved by obtaining the prediction ŷ by querying C(x) instead of just x.In our zero-shot setup, none of the ground-truth labels (i.e., the ys) are available, and we propose to use the LLM predictions themselves as pseudo-demos.Thus, our zero-shot ICL instead has the form of: where ŝi = Concat(x (i) , ŷ(i) ), and the ultimate objective of USP is to generate and identify the most suitable set of such pseudo-demos.
Self-consistency.For LLMs, Wang et al. (2022) introduce self-consistency (SC) for chain-ofthought (CoT) reasoning tasks (Wei et al., 2022b) as an effective approximation of the model confidence -SC decodes each test query multiple times using a non-zero temperature * ) to introduce stochasticity.The majority of the predictions are then chosen as the final predictions.
COSP.Inspired by Wang et al. (2022) and entropy minimization (Grandvalet and Bengio, 2004), Wan et al. (2023) propose Consistency-based Self-Adaptive Prompting (COSP) to improve zero-shot CoT reasoning.COSP is the most influential prior work to us: as shown in Fig. 2c, COSP uses a two-stage approach.In Stage 1, COSP performs zero-shot inference with multiple decoding paths in a similar manner to SC and then computes the normalized entropy to quantify model confidence via discrepancy in predictions from the same query on different decoding paths.COSP then ranks the Stage 1 outputs based on the entropy (and other metrics such as diversity and repetition) and selects the confident outputs as the pseudo-demos.In Stage 2, these pseudo-demos are prepended to the test queries in a manner similar to few-shot ICL, and the final predictions are given by the majority vote over outputs in both stages.

Motivation and Challenges of USP
Inspired by the success of COSP, we argue that the principle of confidence-based prompting should be universally applicable to all tasks, rather than being exclusive to a narrow set of reasoning tasks COSP considered; this forms the motivation and the goal of this paper.However, a number of limitations and challenges prohibit a trivial generalization: first, a universal prompting strategy needs to accommodate numerous, vastly diverse tasks that vary significantly in terms of objective, prompting, evaluation, and, unsurprisingly, confidence/uncertainty quantification.As a result, SC and the techniques developed by Wan et al. ( 2023) may be sub-optimal or even inapplicable for other task types: for instance, many problems are cast as classification where the output well-calibrated logits are useful for uncertainty quantification, but such information is not used in the original formulation of COSP.Also, the notion of majority voting crucial to COSP and SC may not even exist for creative and generative tasks with many plausible solutions.

Overview of USP
To address the challenges, we present USP (Fig. 2d and Algorithm 1).USP shares some high-level similarities to the COSP formulation: USP also adopts a two-staged approach where in Stage 1, the LLMs are prompted in a zero-shot manner to generate a collection of candidate responses from which a few model-generated pseudo-demos are selected; in Stage 2, USP prepends these pseudo-demos to the test queries in a few-shot manner (Eq.( 2)) and prompts the LLM again to obtain the final predictions.However, we highlight a few key design decisions, in particular those differing from COSP, that effectively overcome the aforementioned challenges and enable USP to generalize: 1. Task-specific pseudo-demo selector.The pseudodemo selector, which selects the most suitable query-response pair from the zero-shot outputs, is central to USP.With reference to Fig. 2c and 2d, whereas COSP only uses the consistency-based selector and hence is only applicable to a limited number of tasks, USP instead uses a task-type spe- [Stage 2] Concatenate the S to x (i) (Eq.2) and query again (with greedy decoding for generative (SFG/LFG) tasks) to obtain the final LLM prediction ŷ(i) .10: end for cific selector that is key for its versatility -we explain this in detail in §3.3.

2.
Separating test set and the demo-generating dataset.Instead of expecting the full test set T in Stage 1, USP expects a general unlabeled dataset D, which can be the full test set T , a subset of it, a different unlabelled set, or possibly even a modelgenerated dataset like Schick and Schütze (2021) (although we always use a subset of D for simplicity in this work).Its sole purpose is to generate the pseudo-demos, enabling USP to work even if T is not known a-priori in its entirety.Indeed, as we will show in §5, USP is capable of generating high-quality pseudo-demos with only 64 unlabeled samples per dataset.This makes USP more sample efficient, due to the smaller number of unlabeled samples required, and more computationally efficient, as the algorithm only needs to iterate through D, which can be modestly sized, in Stage 1.
3. Dropping reliance on majority vote.The use of majority vote (as shown in Fig. 2c) is crucial for COSP, but as discussed, the procedure is also computationally expensive and inapplicable when the majority itself is ill-defined.To address this, by default, USP instead only decodes once in Stage 2 with greedy decoding (i.e., temperature = 0) and uses the maximum likelihood estimated (MLE) outputs as the final predictions.It is worth noting that USP remains compatible with majority voting over multiple decoding (if it can be used) for further performance improvements, but no longer depends on these to function.

Task-specific Selector
The objective of the selector (Step 7 in Algorithm 1) is 1) to build a pool of candidate pseudo-demos P, whose elements p (j) are formed concatenating dataset queries {d (j) } Nu j=1 and their zero-shot LLM predictions {ẑ (j) } Nu j=1 and 2) to select S, a subset of K pseudo-demos from P to be prepended to the test queries.We use a function F : P → R (the design of F is explained later in this section) to "score" each candidate.We select the first pseudo-demo in S by finding the maximizer of F(•) in P. For each of the subsequent pseudo-demos k ∈ {2, ..., K}, we instead repeatedly find the maximizer of F(•) with a diversity-promoting term to penalize candidates that are too similar to any of the pseudo-demos already selected and add to S: where we follow Wan et al. (2023) to set λ, the trade-off parameter, to 0.2 in all experiments without further tuning and use z-score standardization for the two terms in Eq. ( 3) over P to ensure they are of a comparable magnitude; S c (•, •) denotes the cosine similarity and ϕ(•) is the sentence-level embedding given by an auxiliary model, as in COSP.The design of F(•), therefore, encodes our preference on which pseudo-demos should be prepended to the test queries for ICL.To achieve universal prompting, we categorize a possible task into one of the three generic types in Table 1.We use this categorization to design task-specific scoring functions F(•) below, and empirically validate the effectiveness of these designs in §5.Classification (CLS).With reference to Table 1, we first consider problems that feature the selection of a single correct answer from a few possible options -we use the descriptor CLS for "classification", as the label space C in this case is small and known, and the task is to pick the most probable class C: ẑ(j) = arg max c∈C P(c|d (j) ).Since the logits are available in this case, we do not need self-consistency to estimate the prediction confidence, although we may still choose to use a self-consistency-based confidence metric if, the model would be poorly calibrated with logits, or self-consistency would be preferable due to other reasons (e.g., when CoT prompting is used and generating diverse reasoning paths via multiplepath decoding is beneficial -see the next paragraph on SFG for details).Instead, for p (j) = Concat(d (j) , ẑ(j) ) ∈ P, we simply query the LLM once and use the negative entropy of the distribution over C as the function F for the CLS case: where P(c|d (j) ) is the normalized probability with c∈C P(c|d (j) ) = 1 -it is worth noting that orthogonally, an improved uncertainty metric like the semantic uncertainty (Kuhn et al., 2023)) may be used instead, although we do not consider these in the present work.We further use the knowledge of C to ensure good coverage of the label space, which has been shown to be important for a strong ICL performance (Min et al., 2022).Specifically, to build S, instead of simply generating K pseudodemos from P, we generate K/|C| pseudo-demos for each c ∈ C from a subset P c ⊂ P where: This is because LLMs can be more confident in some classes, and simply choosing the most confident predictions overall as pseudo-demos may lead to bias towards these classes; we mitigate this to ensure that the selected pseudo-demos K feature each class approximately uniformly.Note that it is possible that K < |C| or mod(K, |C|) ̸ = 0.In these cases, we generate ⌈ K |C| ⌉ pseudo-demos per class and prepend each test query x (i) ∈ T with K randomly sampled pseudo-demos to ensure fairness in expectation over T .Lastly, it is possible that some classes are never predicted in D, e.g., an over-confident model may never predict the "not sure" option in inference tasks.As a result, the set P c in Eq. ( 5) is empty for these unpredicted classes.To nevertheless generate the most plausible pseudo-demos for them, for an unpredicted class c u , we pick the top queries in D with the highest model-assigned probability in c u : noting that the indexing is over the unlabeled dataset D. These queries are then concatenated with class label c u to form the pseudo-demos for these unpredicted classes.
Short-form Generation (SFG).We use descriptor SFG (for Short-form Generation) to denote the class of generation problems typically with many possible responses but only one to a few correct responses, and examples include Question Answering.Alternatively, as we discussed in the previous paragraph, we may use the SFG formulation for CLS tasks if we use the text-to-text formulation like T5 (Raffel et al., 2020), have no access or prefer not to rely on logits, or as discussed when self-consistency-style multiple decoding is preferable.Unlike the CLS case, we assume access to only the model outputs ẑ(j) but not the logit distribution.This covers the case covered in COSP (problems such as arithmetic reasoning considered in COSP fall into this category), and thus we may use the normalized entropy in Wan et al. ( 2023) to gauge the model confidence, except that for non-CoT prompted tasks, we skip the rationale generation step and prompt for answers directly.Specifically, for each d (j) ∈ D, we query the LLM m repetitions, under temperature sampling to obtain m predictions {ẑ , we use all m predictions to score the model confidence for each p (j) ∈ P: where µ ≤ m is the number of unique answers and P(ẑ α ) is the empirical frequency of an unique answer ẑ(j) α in all m predictions for d (j) .Long-form Generation (LFG) The final category, LFG for Long-form Generation, features NLG tasks with longer responses and many plausible responses with typical examples being summarization and translation.As discussed, Eq. ( 7) does not effectively approximate confidence/uncertainty in this case, as decoding the same query with temperature sampling m times is unlikely to yield identical responses in terms of surface texts due to the length of generation, even for the confident predictions.On the other hand, it would also be challenging to apply logit-based modeling in the face of the highdimensional joint probabilities & the presence of sequential relationships amongst the generated tokens.To measure confidence in this case, we first follow the SFG case by querying each d (j) ∈ D for m repetitions {ẑ (j) ℓ } m ℓ=1 .Instead of using Eq. ( 7), we compute the average pairwise ROUGE score between all pairs of the m responses: where another overlap metric, such as the pairwise BLEU (Shen et al., 2019) or the sentence-level embedding cosine similarity from an auxiliary model, may be used instead.Another challenge for LFG tasks is that unlike SFG where P can be simply built from majority predictions for each query d (j) ∈ D, "majority" is no longer well-defined.We thus use F LFG to rank the confidence of the queries in D & determine which queries to be used in S only.
For the response part of the pseudo-demos, we decode the LLM again with argmax decoding to obtain the MLE predictions on the selected queries to build S. Lastly, given that zero-shot text generation is purely driven by prompting and instructions, we observe that the LLMs sometimes generate extremely confident text completions instead of actually completing the instructed tasks (e.g., summarization); selecting these outputs as pseudodemos, as we investigate in §5, can significantly degrade performance.Given that these outputs often feature an abnormally high F LFG score, we apply a simple but canonical outlier filtering technique to remove queries with score > upper quartile + 1.5×interquartile range (IQR) (Tukey et al., 1977).

Cost Analysis
Computing the USP scores itself is cheap, and the cost is thus bottlenecked by the amount of processing from the LLM side.In particular, the additional costs are: • Stage 1: with |D| unlabeled samples, we require |D| additional model queries for the CLS task and 64m (we use m = 6) for SFG and LFG tasks -it is worth noting that we can also use batching to parallelize this step.As seen in Table 5 in App.B, the column |D|/|T | represents the fraction of the unlabeled samples to the size of the entire test set, the additional cost is always negligible compared to the cost we need to incur anyway by iterating over the test set, except for some very small-scale toy tasks with small test tasks.
• Stage 2: This stage is completely identical to standard few-shot in-context learning.Thus, compared to standard zero-shot learning, USP requires the additional Stage 1, which typically only adds a small amount of cost, as discussed above.In Stage 2, the LLM needs to process a longer context due to the use of pseudo-demos for in-context learning.However, this is due to the use of in-context learning and is not an additional cost uniquely attributable to USP -it is true for all other methods relying on ICL.Compared to few-shot learning, the only additional overhead is the use of Stage 1, but crucially, no labeled data is required at any point in time.

Related Works
Besides those covered in §2, here we discuss other related works in zero-shot automatic prompting.We include an additional literature review in the App. A.
AutoCoT (Zhang et al., 2022) also uses modelgenerated output as pseudo-demos but differs in the selection procedure -it computes a sentence embedding of available queries and uses clustering to select the centroid queries as pseudo-demos.This process, unlike USP, is purely based on the query (dis)similarity rather than the output quality, and the quality of the selected pseudo-demos is thus, in expectation, the same as the average model performance -we empirically compare against a generalized version of it in §5, which is originally designed for reasoning tasks only (hence the name).
Another method, Z-ICL (Lyu et al., 2022), generates pseudo-demos with synonyms of random class names.It, however, by assuming label knowledge, is limited to a subset of CLS tasks where it is reasonable to do label synonym replacement.For example, while it is reasonable to replace simple sentiment-describing labels like "good" or "bad", the same may not be possible for factual labels or when labels are beyond single words (e.g., the race{h,m} examples shown in Table 9).Randomly selecting labels also only generates correct demos with a probability of 1  |C| -given the recent discovery that modern LLMs genuinely learn from the demos and can be sensitive to their correctness (Wei et al., 2023), providing mostly wrong demos is sub-optimal.To represent this class of methods, we compare against a Random demo baseline in our experiments (see §5 for details).Discussion of main results.We show the results of CLS, SFG and LFG tasks with PaLM-540B in Tables 2, 3 and 4 2 and 3).Refer to Table 2 for further explanations.performs other zero-shot methods using pseudodemos, and is often competitive to or better than few-shot prompting, all achieved with only 64 unlabeled samples per task.Generally, we find the gain margin to be larger in generative tasks and in larger and/or more advanced models.We hypothesize that 1) LLMs benefit more on guidance from the demonstration in generative tasks, which essentially feature unbounded action spaces, whereas in CLS, the LLM only needs to select a response out of a few; 2) larger models and/or those trained with more advanced techniques (e.g., instruction fine-tuning) have stronger ICL capabilities to take advantage of the demos of better quality.
Few-shot USP.On the BBH tasks on PaLM 2, we also test a few-shot variant of USP (termed USPfs) to generate additional pseudo-demos on top of scarce, manual demos.We show the results in Fig. 9 in App.D, and USPfs outperforms both the zero-shot USP reported in Fig. 3 and standard 1-shot, thereby highlighting the generality of USP.

How does USP work?
To analyze how the USP procedure ( §3.3) improves performance, we plot the USP scores against the ground-truth performance (accuracy, EM or ROUGE) of the queries in unlabeled datasets D (with |D| = 64) in Fig. 4 (additional results are reported in App.D), and we observe that across task types and difficulty levels (as measured by the average performance marked by the gray dashed lines in Fig. 4), the USP scores are generally well-correlated with the ground-truth performance, which also validates the finding that LLMs "mostly know what they know" (Kadavath et al., 2022).The recent findings that larger LLMs genuinely learn information from incontext examples (instead of simply following a prompt format) and thus benefit more from correct examples (Wei et al., 2023) are consistent with the results of USP, which, as we show, is more likely to generate correct/high-quality pseudo-demos.Interestingly, a concurrent work (Margatina et al., 2023) also shows that even when golden labeled examples are available, better in-context examples still tend to exhibit low uncertainty and diversity.When does USP work better?While USP improves generally, there are cases where USP underperforms standard zero-shot -this seemingly counter-intuitive phenomenon is not unique to USP and is common even for few-shot learning with golden examples from both our results and previous works (Brown et al., 2020;Chowdhery et al., 2022).Nonetheless, understanding when it happens for specific tasks can be crucial for users' decisionmaking.As shown in Fig. 5, we find the average Stage 1 USP score across D to be a good zero-shot indicator of the extent of improvement from USP.An intuitive explanation is the average USP score quantifies the general uncertainty the model has about the task (and potentially the task difficulty): with a high average USP score, the model is already confident under zero-shot, and the benefits from ICL are lower (and sometimes may even worsen performance).On the other hand, a low average USP score suggests high model uncertainty and larger potential gains from additional guidance.

Conclusion
We propose USP, a versatile, zero-shot automatic prompting technique applicable to a wide range of NLU, NLG, and reasoning tasks.We show large improvement over standard zero-shot prompting and other baselines in over 40 tasks with 3 LLMs.

Limitations
We believe that the room for future improvements is ample: First, the present work specifically targets incontext demonstrations, a sub-component of the overall prompt, and it does not attempt to optimize the other components; a future work would be relaxing this restriction and combining USP with orthogonal techniques (e.g., calibration methods (Zhao et al., 2021;Han et al., 2023;Zhou et al., 2023a) and black-box methods targeting other parts of the overall prompt (Deng et al., 2022;Zhou et al., 2023b)) for improved flexibility.
Second, while our method is general in terms of the tasks, it might be more demanding on the model capabilities: for the USP score to function as intended, we implicitly demand the model to generate well-calibrated outputs in terms of uncertainty, and the ICL formulation also requires strong in-context learning abilities, both of which have been shown to correlate strongly with model sizes (Kadavath et al., 2022;Wei et al., 2022a).Third, the present work only considers tasks with natural language outputs.Given the ever-improving capabilities of LLMs, it would also be interesting to apply the idea in more novel setups, including but not limited to planning (where LLMs act as autonomous, environment-interacting agents) and multi-modal settings beyond pure NLP problems.
Lastly, we note that especially for the generative tasks (SFG and LFG), in many cases USP greatly improves the zero-shot performance but does not always completely close the gap compared to the few-shot baseline using golden examples.There are also cases where USP does not meaningfully improve over zero-shot baselines.While we provide a brief analysis in §5 to investigate when that happens, it would also be helpful to investigate whether there are potential remedies, especially given that, as discussed, such occasional performance deterioration even occurs with few-shot prompting with golden demonstrations.We defer thorough investigations to future work.

A Additional Related Works
In this section, we discuss additional prior works that are related to USP in various aspects.
Bootstrapping LLM knowledge.The promising abilities of LLMs have led to efforts to improve them with their own outputs:  et al., 2020;Prasad et al., 2022;Wen et al., 2023), reinforcement learning (Deng et al., 2022;Zhang et al., 2023) and gradient estimation (Diao et al., 2022) have been proposed.While the discrete prompts are more interpretable and (in some cases) compatible with black-box, inference-only LLMs, to our knowledge, none works in the zero-shot setup and tasks beyond CLS problems (with our definition in §3.3) are scarcely investigated.Furthermore, unlike USP, these methods also often require hundreds if not thousands of LLM queries before converging to good prompts.As for ICL, most methods focus on retrieving the best in-context examples from a pool of golden examples instead of zero-shot (Rubin et al., 2022;Liu et al., 2022); an exception is AutoCoT which we discuss in §4.Additionally, several other prompting approaches like NPPrompt (Zhao et al., 2022) & Null Prompt (Logan IV et al., 2022) are also proposed, but these methods again only work for CLS tasks and are orthogonal to USP since they target other aspects of prompting other than the in-context examples.
On PaLM 2 models, we use the BIG-Bench Hard dataset consisting of 23 sub-tasks (data available at https://github.com/suzgunmirac/BIG-Bench-Hard/).The tasks, in alphabetical order, are (the results presented in Fig. 3 and Fig. 9 in App.D are also in the following order):

B.2 Models
We conduct experiments on two PaLM model variants -one with 540 billion parameters (PaLM-540B) and one with 62 billion parameters (PaLM-62B).PaLM is a transformer-based LLM "pretrained on a high-quality corpus of 780 billion tokens that comprise various natural language tasks and use cases.This dataset includes filtered webpages, books, Wikipedia articles, news articles, source code obtained from open source repositories on GitHub, and social media conversations" (Chowdhery et al., 2022).For the pretraining procedure, PaLM was trained over two TPU v4 Pods with 3072 TPU v4 chips (Chowdhery et al., 2022).
In all experiments, we use the quantized PaLM checkpoints (in int8 precision) for inference only without further pretraining or finetuning.
We also experiment on PaLM 2-M, a variant of the PaLM 2 models (Google et al., 2023).PaLM 2, a Transformer-based model trained on UL2-like objectives (Tay et al., 2022), is the successor of PaLM that features stronger multilingual and reasoning abilities.

C.1 Prompt Templates
We largely adopt the prompt format used in GPT-3 (Brown et al., 2020) where possible, and we show the detailed prompt templates in Tables 9, 10 and 11.BBH tasks are formulated as SFG tasks, but they use the CoT prompting templates.
BBH tasks.For experiments using few-shot prompting templates (including few-shot, USP, Au-toCoT, and Random demo when the pseudo-demos are acquired), we use following the prompt format to obtain both the rationales and the final answers in one prompting step.After the rationales are obtained, the LLM is prompted again to obtain the final answer.Q: [QUERY].A: Let's think step by step.[RATIONALE].So the answer is Non-BBH Tasks.It is worth noting that some datasets (raceh, racem and squad) are not zero-shot in their strictest sense even when no demonstration is provided -we follow the GPT-3 prompt format (Fig. G.1, G.3 and G.28 respectively for raceh, racem and squad in Brown et al. (2020)).In these datasets, each test query consists of a context article and several reading comprehension questions in relation to that article, and even in the absence of demonstrations (in the form of one or more other articles and answered questions associated with those articles), some questions (other than the test question itself) and their solutions to the same article are included nevertheless.Therefore, even in the zero-shot setup, the LLM is shown with some demonstration while being "zero-shot" in the sense that the context article itself is novel.Similarly, "K pseudo-demos" in these datasets refer to K (pseudo)-demonstrations, each of which consists of a single article and their associated questions (which can be multiple) -in this sense, (1) there are typically more than K solved questions prepended to the test queries and ( 2) even for the model-generated demos, there may be parts of the pseudo-demos that are guaranteed to be correct simply due to the prompting format.Another complication of such a prompt format for methods using pseudo-demos (AutoCoT, Random demo, USP) is that since the responses to a subset of test queries are used as demonstrations themselves, it is possible that a small number of solutions to some questions are revealed to the LLMs in the form of solved questions in some demonstrations.However, given that only 5 pseudo-demonstrations are used per question, the impact is insignificant as the test sets of each of these datasets contain thousands to tens of thousands of queries (detailed in Table 5).Furthermore, no method is given more unfair advantage over another one, as all methods, including USP and key baselines we compare against, are subject to the same complication, and thus we report results to these datasets nevertheless but mark the impacted results in Tables 2 and 3 with a special note.

C.2 Additional Experimental Details
USP. USP uses an auxiliary language model for computing the similarity term in Eq. ( 3).We use Sentence-T5-large (Ni et al., 2022) for all our experiments.We use a maximum decoding step of 128 tokens for all experiments.For summarization tasks, we apply an additional filtering rule to retain answers whose number of words is between 5 and 90 (to prune out overly short and overly long summaries, which are obviously sub-optimal).For all tasks, we use the following stop tokens as marks for truncation (words after any stop tokens, including the stop tokens themselves, are truncated): "Q:, A:, \n\n" and other special tokens used in PaLM to signal the end of the response.Additionally, 1 2 3 4 5 6 7 8 9    Markers and error bars denote mean ± SEM.It is evident that on expectation, queries with higher USP score tend be better performing compared to the average model performance (marked by the gray dashed line).
we also apply several additional post-processing steps for the generative tasks, in USP and all other baseline methods: 1. lambada: retain the first output word.
2. squad: remove punctuation marks, remove article words (a, an, the), and retain the portion of the outputs before any newline (\n).
3. web_questions & natural_questions: replace all punctuation marks with a white space, remove article words (a, an, the) and retain the portion of the outputs before any newline (\n) 4. LFG (summarization): since we used the prefix "Article: " at the start of each article to be summarized, we also add "Article: " to the list of stop tokens in addition to the general ones described above.
Baselines.We use the same filtering rule for the baseline methods as USP.As discussed, Random demo baseline uses an identical procedure to USP, with the sole exception that it does not rely on the scoring functions in 3.3 to select the set of pseudo-demos but rather, for each test query T = {x (i) } N i=1 , it samples K pseudo-demos randomly from all Stage 1 responses (note that for CLS tasks, it will also follow the procedures described in §3.3 to ensure fair allocation of pseudo-demos across classes).For AutoCoT, we adapt from the official implementation available at https:// github.com/amazon-science/auto-cotwith a few key modifications: (i) following COSP, we also replace the SentenceBERT (Reimers and Gurevych, 2019) with SentenceT5, a more powerful sentencelevel Transformer, for fair comparison against USP; (ii) given that AutoCoT is originally designed for chain-of-thought (CoT) tasks only, we also make necessary modifications such that it is compatible with the general setup.The changes are, in fact, minimal -we only replace the original filtering  rules in CoT with the ones we described above for USP.For the few-shot baseline, we closely follow existing works (Brown et al., 2020;Chowdhery et al., 2022) to sample K demonstrations from the training split of each dataset considered, which are prepended to the test queries; we perform sampling for each test query, and thus the choice and order of the demonstrations, in general, differ from one query to another.We use the identical postprocessing rules as USP mentioned in the previous paragraph for the baselines.

D Additional Experiments D.1 PaLM-62B Results
In this section, we show the PaLM-62B results in Tables 6, 7 and 8.

D.2 Few-shot USP
In this section, we show the results of applying USP in the few-shot setup.We conduct experiments on the BBH datasets with the PaLM 2-M model, as in the main text.Instead of using zero labeled samples (0-shot in Table 3) or 3 labeled samples (3-shot, or few-shot in Fig. 3), we use it is desirable to use USP to augment the set of golden demonstrations.We show the results in Fig. 9: we find that while using 3 golden examples is still the best, USPfs outperforms both standard 1-shot and USP without using any labeled example, and it also closes roughly half of the gap between 1-shot and 3-shotthis suggests that USP routine continues to be effective in few-shot setup, and thus can also be suitable for the setups less strict than zero-shot, but where obtaining many human-labeled demonstrations is still expensive or otherwise challenging.

D.3 Additional Comparison Between USP Scores and Ground-truth Quality
Complementary to Fig. 4 in §5, we show plots of the same relation for other tasks considered in PaLM-540B in Fig. 6, and the aggregated results (across CLS tasks) in Fig. 7; we also show the comparisons on selected BBH tasks with PaLM 2-M in Fig. 8 These give further evidence that USP heuristic described in §3.3 selects higher quality demonstrations in comparison to the average model performance.

D.4 Examples of Selected Pseudo-demos
We show some examples of the pseudo-demos generated by USP on a variety of representative tasks in Table 12.

Prompt template winogrande
The woman avoided the hole but easily stepped over the pit over the {hole / pit}, because the hole was very shallow piqa Q: To pour hot fudge over ice cream before serving,\nA: { pour the hot fudge over ice cream that has just been pulled from the freezer and scooped out of it's container with an ice cream scoop into a bowl / pour the hot fudge over ice cream that has been pulled out of the freezer and softened for fifteen minutes, then scooped out of it's container with an ice cream scoop into a bowl.} storycloze Neil wanted to see ancient temples and ruins.He decided Asia was a great place to start.He flew to Cambodia and went sightseeing.He saw so many old temples in the jungles there.{Neil was bored of the trip and went home.

Figure 1 :
Figure 1: We propose USP, a versatile zero-shot prompting method that improves over standard zero-shot prompting across more than 40 Classification (CLS), Short-form Generation (SFG) and Long-form Generation (LFG) tasks (see §3.3 for further explanations in PaLM-62B, PaLM-540B and PaLM 2 models.

Figure 2 :
Figure 2: Overview of (a) zero-shot setup, (b) few-shot setup with in-context learning, (c) Consistency-based Self-Adaptive Prompting (Wan et al., 2023) and (d) Universal Self-Adaptive Prompting, or USP, the proposed method in this work.The queries without demos with which LLMs are directly prompted (zero-shot, or Stage 1 in COSP and USP) are marked in red arrows, and the queries prepended with either the handcrafted demos (few-shot) or model-generated pseudo-demos (Stage 2 in COSP and USP) are marked in blue arrows.
only the majority predictions of each query are added to P := Maj {ẑ On PaLM-540B and PaLM-62B(Chowdhery et al., 2022), we consider a wide variety of CLS, SFG and LFG tasks, and the readers are referred to App.B for more details.We also experiment on the state-of-the-art PaLM 2-M(Google et al., 2023)   model and test it on BIG-bench Hard (BBH) tasks, a suite of challenging tasks often requiring complicated reasoning, logic, or manipulations where previous models underperform humans(Suzgun  et al., 2022).We compare USP against (i) standard zero-shot prompting (except for BBH tasks where we use standard zero-shot-CoT prompting (Kojima et al., 2022) (0-shot); (ii) an adapted version of AutoCoT(Zhang et al., 2022) for general NLP tasks (AutoCoT); (iii) Random demo, where we follow all of the USP procedure except we randomly sample K demos from P -this serves both as an ablation baseline to USP and as a generalization for methods like Z-ICL described in §4 which only work for CLS tasks, except that Random demo is arguably stronger as it samples from the model predictions rather than possible classes, the former of which is more likely to yield correct pseudo-

Figure 3 :
Figure 3: Accuracy on BIG-Bench Hard tasks with PaLM 2-M (each line represents a task of the suiterefer to App.B for full details).The gain/loss of USP over standard 0-shot is shown in percentages.Note that 3 (pseudo-)demos are generated per query following Google et al. (2023).Human refers to average human performance from Suzgun et al. (2022).
, respectively, and BBH results on PaLM 2-M are shown in Fig. 3 (examples of the generated pseudo-demos in representative tasks are shown in

Figure 4 :
Figure4: USP picks confident predictions that are more likely better.Ground-truth performance metrics in the Stage 1 unlabelled samples (D) against USP scores in selected tasks with PaLM-540B: F CLS against accuracy (CLS), F SFG against EM (SFG), and F CLS against ROUGE-LSum (LFG).CLS: single-sample accuracy is binary and we discretize F CLS into 10 deciles & show the mean acc.± 1 SEM in each bin.SFG: Same as CLS, except that F SFG is already discrete & no further discretization is performed; marker sizes are proportional to numbers of samples of each F SFG value.LFG: Both the evaluation metric and F LFG are continuous and we plot all data without aggregation -since we query each d (j) ∈ D 6 times, we show the mean ± SEM ground-truth ROUGE score for each d(j) ; gray × markers denote outliers.The overall mean performance over D (gray dashed lines) and linear trend lines & confidence intervals are shown in all plots.More results are provided in the App.D.3.

Figure 5 :
Figure 5: Gain from USP is larger with higher zero-shot uncertainty.Relative gain of Stage 2 over Stage 1 accuracy/EM in PaLM-540B/CLS tasks (left) & PaLM 2-M/BBH tasks (right) against average USP score: E z∼D [F CLS/SFG (z)].A higher average USP score denotes lower zero-shot uncertainty.Trend lines and confidence intervals (shades) are shown.
// Demos or pseudo-demos Q: [QUERY].A: Let's think step by step.[RATIONALE].So the answer is [ANS].... // Test query Q: [QUERY].A: Let's think step by step.For zero-shot experiments (including standard zero-shot, USP, AutoCoT, and Random demo in the stage of acquiring pseudo-demos, we use the following prompt format proposed in Kojima et al. (2022) to obtain the rationales and answers in two separate steps: Q: [QUERY].A: Let's think step by step.

Figure 6 :
Figure 6: Complementary to Fig. 4, we show the same plot (USP scores vs. ground-truth performance metrics) in additional tasks with PaLM-540B.Refer to Fig. 4 for further explanations.

Figure 7 :
Figure 7: Comparison between the USP score against accuracy averaged across all CLS tasks considered in this paper for PaLM-62B (left) and PaLM-540B (right).Markers and error bars denote mean ± SEM.It is evident that on expectation, queries with higher USP score tend be better performing compared to the average model performance (marked by the gray dashed line).

Figure 8 :
Figure 8: Complementary to Fig. 4, we show the same plot (USP scores vs. ground-truth performance metrics) in additional tasks (BBH tasks with PaLM 2).Refer to Fig. 4 for further explanations.

Figure 9 :
Figure 9: Few-shot accuracy on BIG-Bench Hard tasks with PaLM 2-M (each line represents a task -refer to App.B for full details).The gain/loss of USP over standard 1-shot is shown in percentages.USPfs generates 2 pseudo-demos on top of the 1 provided golden demo.Standard zero-shot USP and 3-shot results are reproduced from Fig. 3.
Algorithm 1 USP.Stage 1 steps are marked in red, and Stage 2 steps are marked in blue.Test set with size N : T = {x (i) } N i=1 , unlabeled set for demo generation with size Nu: D = {d (j) } Nu j=1 (can be same as or a subset of T , or a different but related set of unlabeled queries), Pool of generated responses P ← ∅, Task type t ∈ {CLS, SFG, LFG} ( §3.3).Build the pseudo-demo set S = {s1, .., sK } (with |S| = K) from P with one of the selectors in §3.3 depending on t.

Table 2 :
Accuracy (Chowdhery et al., 2022)th PaLM-540B(Chowdhery et al., 2022)(Refer to App.D.1 for results with PaLM-62B).Methods in the Zero-shot columns do not use ground-truth label guidance and generate 5 pseudo-demos if applicable, whereas the 5shot results use 5 human-labeled in-context demos.The top two results for each model are bolded and ranked by color: best and second-best.↑: larger is better.↓: smaller is better.* See notes in App.C.1.

Table 3 :
Exact Match (EM) / F1 on SFG tasks with PaLM-540B (Refer to App.D.1 for results with PaLM-62B).aOnly EM shown as lambada expects a single correct answer.bUsed lambada EM for the average F1 score.cRanked in terms of EM. * See notes in App.C.1.Refer toTable 2 for further explanations.
Table 12 and 13 in App.D.4 and PaLM-62B results are shown in App.D.1).We find that USP greatly improves upon standard zeroshot prompting without any pseudo-demos, out-

Table 4 :
ROUGE (Sellam et al., 2020)URT(Sellam et al., 2020)scores on LFG tasks with PaLM-540B (Refer to App.D.1 for results with PaLM-62B).Note that due to the longer context length in LFG problems considered, we generate 1 pseudo-demo under zero-shot setting (if applicable), and use 1 demonstration under few-shot setting (instead of 5 in Tables

Table 5 :
Details of the datasets used in this work for the PaLM models.Note that test set here refers to the split of the dataset on which results of this paper are reported -in some datasets, the test labels are not publicly available, and we instead report performance on the dev/validation set.The final column (|D|/|T |) denotes the percentage of the test set that is used as the unlabelled dataset D for pseudo-demo generation of USP, AutoCoT and Random demos.

Table 6 :
1 labeled sample per query, and use USP to generate 2 further pseudo-demos (we name this variant USPfs where fs stands for "few-shot") -this is to emulate the setup where scarce labeled data are available and Accuracy on CLS tasks (Table1) with PaLM-62B.
Telugu film directed by Puri Jagannadh.It features Varun Tej and Disha Patani in the lead roles while Revathi and Posani Krishna Murali appear in crucial supporting roles.The film was officially launched on 8 July 2015 in Hyderabad.Earlier makers revealed the first look posters and trailer of the movie which received good response in the social media.\nquestion:VarunTej had billing over Disha Patani in Lofar.Is it true, false, or neither?\nanswer: {true / false / neither} boolq Evil Queen (Disney) -This version of the fairy tale character has been very well received by film critics and the public, and is considered one of Disney's most iconic and menacing villains.Besides in the film, the Evil Queen has made numerous appearances in Disney attractions and productions, including not only these directly related to the tale of Snow White, such as Fantasmic!, The Kingdom Keepers and Kingdom Hearts Birth by Sleep, sometimes appearing in them alongside Maleficent from Sleeping Beauty.The film's version of the Queen has also become a popular archetype that influenced a number of artists and non-Disney works.\nquestion:aremaleficent and the evil queen the same\nanswer: {yes / no} copa The tree branch landed in the river {so the branch moved downstream./ the river's current became stronger.}rteTropical Storm Irene on August 11, 2005 at 16:15 UTC.Tropical Storm Irene will increase in strength over the next several days, possibly developing into a hurricane that will hit the east coast of the United States, said the National Hurricane Center of Miami, Florida in a report today.Irene was located approximately 975 kilometers south-southeast of Bermuda at 16:00 UTC today.Forecasters say that the storm is now moving in a west-northwest direction with top sustained winds of 40 miles per hour.\nquestion:A storm called Irene is going to approach the east coast of the US.Is it true or false?\nanswer: {true / false} October is getting closer and it also means that the year of 2014 is coming to an end."Hooray!It's a holiday!"While you are thinking of putting textbooks aside and playing video games, let's take a look at what children in other continents usually do during their holidays.Children in America don't have much homework to do.They keep themselves busy by playing camp games.A parent says, "My daughter Shirley usually attends different camps.We don't ask her to spend plenty of time on maths problems or spelling tests."Children in Australia take partin activities on over twenty different themes .They learn painting, dancing, singing, history, culture and so on.Parents can _ their kids to enjoy the learning process and to build a closer relationship with them.These are what African kids do: build a boat, have a camel race, make a drum and make a rag football.Don't you think it is interesting that kids in other places have no idea how to make a drum, but kids in Africa do?Plan your holiday well and try what you want to try.Make a good plan and you will have a lot of fun.Q: Where does Shirley come from?A: {America, China, Brazil, Australia}

Table 9 :
Prompt templates (with examples) of the CLS datasets used.Note that the anlir{1,2,3}, race{m,h}, arc_c,e} datasets are grouped together due to similar prompt format.The LLM is asked to output the log-likelihood using each of the options marked in blue as a possible text completion, and the option with the highest predicted probability is selected as the final prediction.Note that the race_{h,m} datasets are not strictly zero-shot as the prompt already contains several answered questions to the context passage leading up to the text query -see App.C.1 for detailed explanations.Yes, I am absolutely sure you did, Cook.I can see the empty egg boxes like you said, thirteen of them."\nCaptainPorter was used to getting to the bottom of these sorts of incidents, especially when it involved some of his boys.\n"Hasanyone else been in the kitchen, Cook web_questions Q: who were jesus siblings?\nA:{Jude the Apostle / James the Just / Simon (brother of Jesus) / Joses} natural_questions Q: how long is the bridge between new brunswick and prince edward island\nA: 2.9-kilometre triviaqa_wiki Q: How many medals did the United States win at the 2010 Winter Olympics?\nA:{37/ thirty seven} squad Title: Southern_California\n\nBackground: The San Bernardino-Riverside area maintains the business districts of Downtown San Bernardino, Hospitality Business/Financial Centre, University Town which are in San Bernardino and Downtown Riverside.\n\nQ:The Sand Bernardino -Riverside area maintains what kind of district?\n\nA:business\n\nQ: Other than San Bernardino, what is the name of the other city that maintains the districts including University Town?\n\nA: Riverside\n\nQ: Other than Downtown San Bernardino, and University Town, what is the name of another business district in the San Bernardino-Riverside area?\n\nA: Hospitality Business/Financial Centre\n\Q: What business districts does the San Bernardino area maintain?\n\nA:no answer\n\nQ: What business districts does the Riverside area maintain?\n\nA:no answer

Table 10 :
Prompt templates (with examples) of the SFG datasets used.The expected response(s) are marked in green.Note that the squad dataset is not strictly zero-shot as the prompt already contains several answered questions to the context passage leading up to the text query -see App.C.1 for detailed explanations.Upsetting events often make the news because they don't happen very often.\nThis section gives you some tips about what to do if you are feeling sad about what you've seen, heard or read.\nYoucan rely on Newsround to tell you the important facts about a story -but some things you hear might be a bit scary or make you feel worried.\nRememberthat worrying stories are often in the news because they are rare -they don't happen very often.\nIt is incredibly unlikely that what you're reading about or watching might happen near you.\nDiscuss the stories with your parents or friends.You'll feel better that you're not the only one worried.\nYou could also talk to your teacher about it -maybe you could have a class discussion which would help you understand the issue better.\nIfyou're having nightmares or trouble sleeping because of something you've heard in the news: \n\ntl;dr: Some stories reported by Newsround can make you feel sad -but you are not the only one and it's OK to have those feelings.wikilingua Article: The most commonly used classes of OTC pain medications include Acetaminophen (Tylenol), and a class of drugs called "NSAIDs."NSAIDs stand for "nonsteroidal anti-inflammatory drugs," and include medications such as Ibuprofen (Advil, Motrin), and Naproxen sodium (Aleve).Aspirin is also technically an NSAID, although it more frequently used in the prevention of heart attacks and strokes than it is in easing chronic pain.[Omitted] This can lead to gastrointestinal bleeding and anemia.Special care should be taken with those who drink alcohol.Always read the label of cold and flu medications carefully to see what ingredients are present in the mixture.If you need OTC drugs for more than 10 days, book an appointment with your physician to do a more detailed assessment of your pain, and to look into alternative modes of treatment that may be more effective (and also safer) for you moving forward.Also consult your doctor if you have other health concerns, such as ongoing heart disease, kidney disease, or liver disease, prior to using OTC medications for your pain.\n\ntl;dr:Be aware of acceptable doses of OTC pain medications.Understand the risks of overusing OTC drugs.Consult your doctor if you are unable to manage your pain without exceeding the recommended daily dosage of OTC drugs.

Table 11 :
Prompt templates (with examples) of the LFG datasets used.The reference summaries are marked in orange.We tried various prompts to elicit zero-shot summarization ability in PaLM and found that "\n\ntl;dr: " works the best, likely because it is a common shorthand used online forums where most of the PaLM pretraining data were obtained.