Let’s Sample Step by Step: Adaptive-Consistency for Efficient Reasoning and Coding with LLMs

A popular approach for improving the correctness of output from large language models (LLMs) is Self-Consistency – poll the LLM multiple times and output the most frequent solution. Existing Self-Consistency techniques always generate a constant number of samples per question, where a better approach will be to non-uniformly distribute the available bud-get based on the amount of agreement in the samples generated so far. In response, we introduce Adaptive-Consistency, a cost-efficient, model-agnostic technique that dynamically adjusts the number of samples per question us-ing a lightweight stopping criterion. Our experiments over 17 reasoning and code generation datasets and three LLMs demonstrate that Adaptive-Consistency reduces sample budget by up to 7.9 times with an average accuracy drop of less than 0.1%. 1


Introduction
The increasing adoption of large language models (LLMs) across various tasks, such as text generation and reasoning (Wei et al., 2022;Kojima et al., 2022;Wang et al., 2022a;Mishra et al., 2022), mathematical reasoning (Lewkowycz et al., 2022;Gao et al., 2022;Arora et al., 2023), and code generation (Li et al., 2022;Madaan et al., 2023b), has underscored the importance of improving the correctness of their outputs.A popular method for achieving this goal is Self-Consistency (Wang et al., 2022b), a majority voting technique where multiple output samples are generated for a given input, and the final decision is based on the most frequently occurring output among the samples.
Current Self-Consistency methods typically employ a fixed budget approach, wherein a predetermined number of samples (e.g., 40) are generated to make a decision.However, as LLMs continue to grow in size and complexity, the sampling time and computational costs associated with majority voting become increasingly challenging.This challenge is particularly evident in high-stakes applications like competition-level code generation (Li et al., 2022), where generating a large number of programs, sometimes up to a million, is essential for maximizing performance.
To address this challenge, we introduce Adaptive-Consistency, a cost-efficient, model-agnostic majority voting technique.Adaptive-Consistency employs a lightweight stopping criterion that dynamically adjusts the number of samples (n) for each input, as opposed to using a fixed budget (k).The intuition is that if a clear majority is established with high confidence after sampling fewer than k answers (n < k), there is no need to generate additional samples.
Adaptive-Consistency models the probability distribution over unique samples using a Dirichlet distribution, allowing us to quantify the confidence in the lead of the majority element over other elements.For instance, if the majority element has a count of 9 out of the first 10 samples, the likelihood of it remaining the majority element even after 40 samples is very high (> 99%).This allows Adaptive-Consistency to stop sampling at this point, reducing the cost by 30 samples, while Self-Consistency would continue to sample all 40 answers.As an inference-time technique requiring no additional training, Adaptive-Consistency provides a convenient off-the-shelf option for all pre-trained language models, offering the flexibility to balance computational cost and performance.
We evaluate Adaptive-Consistency on 17 diverse tasks and three LLMs of different scales (VICUNA-13B, CODE-DAVINCI-002 and GPT-3.5-TURBO).Our experimental results show that Adaptive-Consistency outperforms Self-Consistency regarding cost efficiency while maintaining comparable output quality.On CODE-DAVINCI-002, Adaptive- Consistency reduces the number of samples required by a factor of 3.4×, with no average drop in accuracy.On VICUNA-13B, it requires sampling 1.9× fewer samples, with almost no drop in accuracy.Similarly, on GPT-3.5-TURBO, it samples 4.4× fewer samples, with less than 0.2% drop in accuracy.In summary, our contributions are: • We propose Adaptive-Consistency, a costefficient sampling technique for large language models that dynamically adjusts the number of samples using a lightweight stopping criterion based on the stability of the majority element.• We conduct extensive experiments using three different LLMs on a diverse set of 17 datasets.These datasets encompass a wide range of tasks, including MATH, COMMONSENSE, SYM-BOLIC reasoning, and CODE GENERATION tasks.Adaptive-Consistency consistently and significantly outperforms fixed-budget methods like Self-Consistency, requiring an average of 3.3× fewer samples with less than 0.1% drop in accuracy across all datasets and models.
• Our analysis reveals that for a fixed sampling cost, Adaptive-Consistency consistently achieves better accuracy than Self-Consistency across all datasets (upto 5% absolute points).Additionally, we experiment with various stopping criterias and show the efficacy of Adaptive-Consistency in terms of speed and accuracy.

Background
In-Context Few-Shot Prompting In-context few-shot prompting is a technique employed by large language models (LLMs) to learn and generalize from a limited number of examples provided within the input of a given task.Listing 1: Comparison of Adaptive-Consistency (top) and Self-Consistency (bottom).Self-Consistency always generates a fixed number of samples.In contrast, Adaptive-Consistency uses a lightweight stopping criterion, allowing it to adaptively halt the sampling process, which can lead to improved efficiency and performance.Wang et al. (2022b) proposed Self-Consistency which improved performance by sampling multiple diverse reasoning chains and aggregating their outputs using a simple majority voting mechanism.However, higher accuracy is achieved with an increased computational cost, since the LLM must be prompted multiple times for the same question.

Adaptive-Consistency
Self-Consistency generates a predetermined number of answers (k) from the language model (LLM) before returning the majority answer.In contrast, the Adaptive-Consistency method takes an incremental approach to sampling outputs from the language model.After generating each sample, Adaptive-Consistency employs a lightweight stopping criteria to determine whether it should 1.) generate an additional sample from LLM or 2.) cease sampling and report the current majority answer.This flexible strategy enables Adaptive-Consistency to dynamically adjust the number of samples generated so far (n) for each input.As our experiments demonstrate, n is typically less than k (on average, 3.3× and up to 7.9× less in some cases), allowing Adaptive-Consistency to offer greater cost-efficiency compared to the fixed budget approach employed by Self-Consistency.
Adaptive-Consistency differs from Self-Consistency only in terms of the stopping criteria (Listing 1).The design of the stopping criteria is crucial to our method, as it aims to minimize the average number of samples generated from the LLM while maximizing accuracy.The simplicity of our algorithm allows for the use of various stopping criteria interchangeably, each with its own advantages and disadvantages.We expand on a particular choice of stopping function next.

Dirichlet Stopping Criteria
Let n be the number of samples generated from LLM so far, with be the counts of each element, and p i = v i n be the normalized count.For instance, if n = 10, and m = 3 (10 samples generated, with 3 unique elements), if v = [8, 1, 1], then we can be more confident that v 1 is the answer.On the other hand, if v = [4, 4, 2], then more samples need to be generated.Our goal is to formalize and quantify this intuition.
By convention, let p 1 = max(p i ).We want to assess the stability of p 1 as the majority element.2Specifically, we want to ask the following question: what is the probability that p 1 will be the majority element if we repeat the process of generating n samples again?Intuitively, if this probability is higher than some predetermined threshold C thresh , then we can be more confident in our decision to stop sampling and return p 1 as the majority element: To answer this question, we establish a connection with the Dirichlet distribution.Specifically, we note that the counts v parameterize a Dirichlet distribution, Dir(V ).3This connection allows us to explore the behavior of the sampling process by drawing more samples from Dir(V ) and observing the stability of p 1 as the majority element.To compute the probability of p 1 being the majority element, we can integrate the joint probability density function of the Dirichlet distribution over the appropriate region of the probability simplex.The integral can be expressed as follows: (1) In Equation 1, f (p ′ 1 , p 2 , ..., p m |V ) represents the joint probability density function of the Dirichlet distribution conditioned on the counts V .The bounds on the integral for p ′ 1 range from 0 to 1.The probability simplex S(p ′ 1 ) is defined for each p ′ 1 value, such that p ′ 1 > max m i=2 p i , and the remaining p i values sum to 1 − p ′ 1 .This constraint ensures that we are considering all possible values of p ′ 1 that would maintain its majority status.Here we assume, that the number of possible unique answers (m) is known, based on the current set of observations (V ).In Analysis (( § 5.3), we further evaluate a CHINESE RESTAURANT PROCESS (CRP) stopping criteria, which relaxes this assumption by not requiring the number of possible unique answers (m) to be known in advance.
Beta Stopping Criteria Since the number of unique answers in the observation set can be large, Equation (1) is computationally expensive to solve.As an approximation, we observe that establishing the majority of p 1 over the next largest probability, p 2 , is sufficient for our purpose.
In this setting, the probability in Equation (3) simplifies to a Beta distribution with parameters (v 1 + 1, v 2 + 1), and Equation ( 1) is replaced by Equation (2).This approximation, which assumes a non-informative prior of BETA(1, 1), allows us to efficiently compute the confidence in p 1 being the majority, enabling early stopping decisions without incurring substantial computational overhead.
Empirically, we show the performance to be similar to Dirichlet stopping criteria but significantly faster (See Section 5.3).Throughout experiments, we refer to this Beta Stopping Criteria as Adaptive-Consistency.

Code-Generation
We now turn our attention to CODE GENERATION tasks, which involve generating programs that can correctly pass multiple test cases.More details on test case generation can be found in Appendix A.4.
The configuration of code generation tasks significantly impacts the Self-Consistency measurement since different programs might yield varying outputs for a given set of test cases.This variation can cause simple majority voting schemes to be ineffective in evaluating stability.To address this, we explore two distinct methods for aggregating answers across multiple test cases.
In the first method, inspired by the approach used in AlphaCode (Li et al., 2022), we concatenate the outputs for all test cases into a single vector with t elements and apply Self-Consistency across the entire vector.This implies that two programs are considered identical only if their outputs for all t test cases match exactly.However, this simple setup may overestimate the output variance, as different programs can produce distinct outputs for the set of test cases.
To overcome the limitations of the simple setup, we propose an alternative method that treats test inputs as independent entities and applies Adaptive-Consistency to each test case separately: In this equation, P is computed using Equation 1.The Adaptive-Consistency method terminates the sampling process when the normalized probability-expressed as the geometric mean of P across all t test cases-exceeds a predefined threshold (e.g., 0.95).
Models We evaluate our method on three different language models: 1. GPT-3.5-TURBO:4An RLHF-finetuned GPT-3 based model (unreleased number of parameters).2. VICUNA-13B: (Chiang et al., 2023) an open-source transformer model fine-tuned on instruction-following dataset (Taori et al., 2023) from the base Llama series (Touvron et al., 2023).3. CODE-DAVINCI-002: A GPT-3-based publicly available model (Brown et al., 2020) which is a part of the Codex series (Chen et al., 2021) and has 175 billion parameters. 5rompting and Sampling We use similar prompts as in PAL (Gao et al., 2022) and CHAIN OF THOUGHT (Wei et al., 2022).Specifically, for mathematical reasoning and DATE UNDERSTAND-ING tasks, we use prompts from PAL.For other commonsense and SYMBOLIC reasoning tasks, we use COT (Wei et al., 2022).
For sampling, we follow the scheme suggested in Wang et al. (2022b).Specifically, we use a temperature of 0.7 for sampling and limit the number of generations to a maximum of 40.For coding tasks, we follow the exact procedure as used in CodeT (Chen et al., 2022), with 50 samples for APPS, 100 samples for HUMANEVAL and MBPP and 1000 samples in CODECONTESTS.

Hyperparameters
The only hyperparameters in Adaptive-Consistency are those related to parameters in stopping criteria (C thresh ).We use a high C thresh = 0.95 for Adaptive-Consistency.By using a high threshold, we aim to maintain high accuracy and prevent the algorithm from stopping too early.For other Stopping Criteria, we tune parameters on the training set of GSM-8K, and use the same thresholds across all the datasets.The impact of the chosen threshold on the performance of our method is further analyzed in the analysis section ( § 5.1).
Baselines We compare our method against Self-Consistency, which is the current state-of-the-art method.Further, in Section 5.3, we evaluate Adaptive-Consistency against different stopping criteria, such as RANDOM stopping and MAJOR-ITY (stopping at majority), ENTROPY, DIRICHLET and CRP.
Evaluation Metrics We evaluate the performance of our method and the baselines using two metrics: average generations sampled from the LLMs, and overall reasoning accuracy.Our results show that Adaptive-Consistency achieves similar performance to Self-Consistency while often reducing sample budget considerably.

Results
Table 1 presents the main results, and is divided into two parts showing results across different task categories (top sub-table) and on various language models (bottom sub-table ).We focus on the potential tradeoff between efficiency and accuracy.
Results Across Task Categories Our experimental results demonstrate the significant efficiency gains achieved by Adaptive-Consistency across different task categories -3.3× times fewer samples in mathematical tasks with a 0.1% accuracy drop, 2.9× times fewer samples in commonsense tasks with a 0.2% accuracy drop, 3.8× times fewer samples in symbolic reasoning tasks maintaining accuracy, and 2.4× times fewer samples in coding tasks while improving accuracy by 0.4%.These findings confirm the effectiveness of Adaptive-Consistency in identifying the majority element early, highlighting its potential across various applications, including reasoning and coding.Adaptive-Consistency achieves a significant reduction in the number of generations, with a negligible impact on accuracy.The ∆ columns display reductions in generations (Num.Gen.) and accuracy (Acc.) between Self-Consistency and Adaptive-Consistency. Detailed results are in Table 5.
Results Across Language Models Examining the results across different language models, we find that Adaptive-Consistency is model-agnostic, and consistently reduces the number of generations with minimal to no impact on accuracy.Adaptive-Consistency consistently reduces the number of generations required, with reductions of 4.4× for GPT-3.5-TURBO,1.9× for VICUNA-13B, and 3.4× for CODE-DAVINCI-002, highlighting its cost-effective nature and adaptability to different scales of models.Moreover, the minimal accuracy differences and slight improvements showcase the practical utility of Adaptive-Consistency, emphasizing its diverse applicability and model-agnostic characteristics.

Effect of Confidence Threshold in Adaptive-Consistency
The confidence threshold, C thresh , is a crucial hyperparameter for Adaptive-Consistency, as it determines when to stop sampling based on the desired level of confidence in the majority element.While we set the threshold to a stringent value of 0.95 for all experiments, in this section, we analyze the impact of varying C thresh from 0.5 to 1 to understand the trade-offs between model accuracy and cost-efficiency.
In Figure 2, we present a visualization that examines the relationship between the confidence threshold, C thresh , and the performance of adaptive consistency in terms of both accuracy and costefficiency.The x-axis represents the confidence threshold, varying from 0.5 to 1.The left y-axis displays the model's accuracy, while the right yaxis shows the average number of samples drawn.
The plot (for GSM-8K) shows the expected behavior of two curves: the blue curve (accuracy) increases gradually and then plateaus, while the red curve (average number of samples) initially increases linearly and then climbs more steeply.The plateau in accuracy signifies that the model has reached its maximum achievable accuracy, and further sampling will not improve it much.Meanwhile, the red curve's climbing rate indicates that the model requires more samples to meet an increasingly stringent confidence threshold for stopping, highlighting the trade-off between accuracy and cost efficiency.We refer readers to Appendix C.4 for more results.

Adaptive-Consistency vs. Self-Consistency
For Equal Average Sample Costs Section 4.1 previously demonstrated that Adaptive-Consistency achieves comparable performance to Self-Consistency using fewer samples.In this section, our primary objective is to compare the performance of Adaptive-Consistency to Self-Consistency across various sampling budgets.For each fixed sampling budget k, we contrast the performances of Adaptive-Consistency and Self-Consistency, where Self-Consistency distributes sample budget uniformly to each question, Adaptive-Consistency uses nonuniform allocation, rather than consistently across all instances.We evaluate Adaptive-Consistency using varying thresholds, with each threshold producing a distinct point (#samples, performance) on the costquality curve.For every specific sample count (#samples) generated by Adaptive-Consistency, we subsequently run Self-Consistency to obtain its corresponding performance.The relationship between the two methods across these data points is visualized in Figure 3 which provides a visual comparison of the performance of Adaptive-Consistency and Self-Consistency on GSM-8K.Adaptive-Consistency outperforms Self-Consistency in accuracy across all average sample costs.For example, when the average sample cost is 10, Adaptive-Consistency achieves approximately 3% higher accuracy on GSM-8K.Similar results hold on other datasets; see Appendix C.1 for full results.The success of Adaptive-Consistency can be attributed to the fact that it varies the number of samples based on the complexity of the instance, using more samples where a clear consensus is hard to reach and fewer where answers are consistent.Consequently, Adaptive-Consistency achieves improved overall performance when controlled for cost budget.

Evaluation of Different Stopping Functions
Adaptive-Consistency allows a flexible choice of stopping criteria, based on intended objective and requirements.Here, we evaluate six different functions: 1) RANDOM: randomly stopping with a probability p, 2) MAJORITY: stopping after the most common answer has a majority above a threshold, 3) ENTROPY: stopping after the entropy of answers is below a threshold, 4) BETA: The main The parameters for all these methods are tuned, as discussed in Section 4. Figure 4 compares BETA to ENTROPY and MAJORITY over a range of expected sampling costs.BETA consistently achieves higher accuracy than both for the same sampling cost.Further, we find RANDOM to be the least effective method as expected, whereas MAJORITY almost consistently underperforms both BETA and ENTROPY.While DIRICHLET and CRP have a similar performance to BETA, they are both about four orders of magnitude slower than BETA due to the expensive multivariate integral calculation.Nonetheless, despite being run on a single cpu core, even DIRICHLET and CRP have negligible time and cost compared to LLM inference.The exact timings are presented in Table 2.The detailed results are presented in Appendix C.2, Table 7.
In summary, Adaptive-Consistency is particularly effective in two scenarios: (i) when a majority trend is evident early in the sampling process, such as in the SVAMP dataset where it achieves comparable accuracy to Self-Consistency using fewer than 5 samples on average per input; and (ii) for tasks with a limited set of potential answers, such as the BOOLEAN EXPRESSIONS dataset where Adaptive-Consistency reduces the computational budget by 7.9 times without any loss in accuracy.

Related Work
Crowdsourcing and Adaptive Consistency Adaptive-Consistency finds inspiration in tech- niques from crowdsourcing (Lin et al., 2012;Dai et al., 2013;Weld et al., 2015;Bragg et al., 2016).Traditionally, crowdsourcing involves aggregating diverse human judgments, which presents challenges in managing resource allocation-knowing when to query additional contributors or stop based on the consistency of responses (Doan et al., 2011;Quinn and Bederson, 2011).Early research concentrated on probabilistic models estimating the 'true' answer and worker reliability (Dawid and Skene, 1979;Whitehill et al., 2009), later considering factors like worker expertise, task complexity, and answer quality (Raykar et al., 2010;Welinder et al., 2010).However, rather than addressing these issues with multiple human contributors, Adaptive-Consistency is tailored specifically for LLMs, optimizing for computational efficiency and output accuracy.In line with our vision, (Parameswaran et al., 2023) have recently proposed declarative prompt engineering, viewing LLMs like crowd workers and leveraging multiple prompting strategies.
Architectures for adaptive computation A related body of work on adaptive computation aims to preempt computation based on intermediate representations (Liu et al., 2020;Zhou et al., 2020;Schuster et al., 2021;Geng et al., 2021;Xin et al., 2020).Schuster et al. (2022) present CLAM, a language model that performs language generation adaptively.Hou et al. (2020) propose Dynamic Bert, which can adapt the depth and width of the transformer to satisfy various computational constraints.Xing et al. (2020) propose a dynamic deep neural network with an early-exit strategy embedded for enhancing the quality of compressed images.Another direction of work focuses on pruning model weights or training sparse weights (Fan et al., 2019;Jayakumar et al., 2021) to reduce training and inference time.In contrast to these methods, our approach completely obviates making any architectural modifications.
Inference-time adaptive computation These methods focus on adaptive computation at inference time without making architectural modifications to the models.Schwarzschild et al. (2021b,a) focus on three different generalization tasks.They observe that increasing the number of test iterations (which corresponds to the network depth in their setting) helps the models in generalizing better to difficult problems.Madaan and Yang (2022) leverage two different networks trained for the same task, a larger variant (slow) and a smaller variant (fast).The switch from fast to slow happens during inference, based on the complexity of generation at the current step.Xue et al. (2023) train language models to adaptively read tokens from a tape bank for each input.Different from these works, our focus is tasks where the multiple samples are drawn from a model (vs.iteratively solving a task, which is a focus of these works).Additionally, recent works such as (Madaan et al., 2023a;Chen et al., 2023) have propsed to adaptively selecting models of varying sizes based on verification signals derived from the output of the smaller model.Our methods, however, distinguish themselves by not necessitating the use of an additional verifier, and without the need of multiple models.

Adaptive Sampling in Training and Active
Learning Another line of work focuses on importance-based sampling of input instances during training (Bengio and Senecal, 2008;Prabhu et al., 2019;Berger et al., 2017).In contrast to the aforementioned methods, our approach centers on adaptively sampling multiple outputs per input instance during the inference phase, without soliciting additional labels.Our method is crafted to efficiently obtain reliable predictions from pretrained language models by adaptively sampling their outputs, distinguishing it from both adaptive sampling in training and active learning, which focus on the training phase.

Conclusion and Future Work
This paper presented Adaptive-Consistency, a costefficient and model-agnostic technique for improving the correctness of output from large language models (LLMs) using dynamic sampling.Our approach builds upon the Self-Consistency method and introduces a lightweight stopping criterion that allows for adaptive sampling based on the amount of agreement in the samples drawn so far.Adaptive-Consistency is effective across 17 datasets and three LLMs, on both reasoning and coding tasks.It reduces the required sample budget by 2 to 4 times, while maintaining comparable accuracy, with an average drop of less than 0.1%.
Our work opens up several avenues for future research.We may develop alternative stopping criteria, or combining multiple criteria could lead to even more efficient sampling techniques.Moreover, in our current approach, the majority decision relies on using matches to determine the most common answer.However, this may not always capture the true majority, e.g., in generative tasks, where the output can have variations that do not affect the overall correctness or relevance of the answer.To foster further research and enable reproducibility, we have released the code and LLM outputs at https://sample-step-by-step.info/.also partially supported by the CSE Research Acceleration Fund of IIT Delhi Aman is supported by a contract from the DARPA KAIROS program under agreement number FA8750-19-2-0200.The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes, notwithstanding any copyright notation thereon.The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government.

Limitations
Despite the promising results of our proposed Adaptive-Consistency method, it bears several limitations and scopes for future improvement.
• Stopping criterion sensitivity: The current stopping criterion, based on the majority element's stability in the sample set, may not always indicate sample agreement optimally.Instances may arise where the majority element lacks stability, yet the criterion triggers, potentially leading to suboptimal decisions.Future work could explore more robust or alternative stopping criteria.• Generalizability: The effectiveness of our method may vary across tasks or models, despite testing on a diverse range of 17 datasets and three different LLMs of contrastive scale.Notably, Adaptive-Consistency is anticipated to fail where Self-Consistency fails.• Task-specific adaptations: The task-agnostic nature of Adaptive-Consistency might limit its performance on tasks that could benefit from task-specific adaptations.Specialized versions of Adaptive-Consistency for specific tasks or domains could potentially enhance performance.We have initiated this by experimenting on CODE GENERATION dataset, but extending Adaptive-Consistency to other domains may not be as straightforward.• Reliance on the pretrained LLM: Our method depends on the pretrained LLM for generating multiple samples.Consequently, any limitations or biases in the LLM would persist in the Adaptive-Consistency. Addressing these issues might require improvements in the LLM training process itself or the integration of external knowledge sources.

A.1 Hyperparameters
The only hyperparameters in Adaptive-Consistency, are those related to parameters in stopping criterias (C thresh ).We use a high C thresh = 0.95 for Adaptive-Consistency.By using a high threshold, we aim to maintain high accuracy and prevent the algorithm from stopping too early.For other Stopping Criterias, we tune our parameters on the training set of GSM-8K, and use the same thresholds across all the datasets.The impact of the chosen threshold on the performance of our method is further analyzed in the Analysis Section ( § 5.1).We further evaluate all methods on a set of 3 seeds and report the table with standard deviation in Table 5.We use only a single seed for GPT-3.5-TURBO because of the cost associated.

A.2 Benchmarks
We evaluate our method on a diverse set of coding and reasoning benchmark datasets, encompassing 17 datasets across four distinct categories: 1. MATHEMATICAL Reasoning: To assess mathematical reasoning capabilities, we utilize the following datasets: GSM-8K (Cobbe et al., 2021), SVAMP (Patel et al., 2021), andASDIV (Miao et al., 2020).These datasets consist of gradeschool-level algebra word problems necessitating arithmetic operations and problem-solving based on contextual information.
2. COMMONSENSE Reasoning Tasks: We evaluate Adaptive-Consistency on four COMMON-SENSE reasoning tasks.1.) STRATEGYQA (Geva et al., 2021) comprises questions that demand the model to infer a multi-hop strategy with reasoning steps implicitly embedded in the questions.2.) DATE UNDERSTANDING entails questions that require the model to deduce dates from natural language descriptions and perform arithmetic operations accordingly.3.) SALIENT TRANSLATION is a salient translation error detection task that requires the model to identify the type of error in a translation.4.) SNARKS and 5.) RUIN NAMES both focus on emotional understanding tasks.
3. SYMBOLIC Reasoning Tasks: We examine the performance of our method on six diverse SYM-BOLIC reasoning tasks.1.) TRACKING SHUF-FLED OBJECTS is a tracking task that necessitates the model to infer the final state of a system, given its initial state and a sequence of modifications.task that demands the model to deduce the order of a sequence of objects based on a minimal set of conditions.3.) BOOLEAN EXPRESSIONS is a boolean expressions task that evaluates whether a language model has learned the rules of deductive reasoning, i.e., formal (zeroth-order) logic associated with the words "and," "or," "not," etc. 4.) DISAMBIGUATION QA is a disambiguation task that necessitates the model to select the person to whom the pronoun refers.5.) PENGUINS describes a table of penguins and requires the model to answer questions about the penguins' attributes.4. CODE GENERATION Tasks: We further evaluate the performance of our method by conducting experiments on four diverse standard coding tasks.These tasks encompass a range of programming challenges, including both basic humanwritten and crowd-sourced Python tasks found in the 1.) HUMANEVAL (Chen et al., 2021) and 2.) MBPP (Austin et al., 2021) datasets, as well as more challenging competition-level coding tasks from the 3.) APPS (Hendrycks et al., 2021) and 4.) CODECONTESTS (Li et al., 2022) datasets.

A.3 Tools and Framework
For querying GPT-3.5-TURBO and CODE-DAVINCI-002 models (Chen et al., 2021), we use the api library provided by OpenAI6 .We use the official code provided for running VICUNA-13B model (Chiang et al., 2023).We run inference on VICUNA-13B models on single A100 gpus.For coding tasks, we use the outputs provided by CodeT (Chen et al., 2022), where models are zeroshot prompted with temperature=0.8, and top_p = 0.95.stopping criteria in Adaptive-Consistency are fast to run, and we use a single-core machine.For numerical integration, we use the Scipy library in Python.

A.4 Test-Case Generation
For CODE GENERATION tasks, we generate test cases in a similar fashion to CodeT (Chen et al., 2022).Specifically, we prompt the model with function description and prompt for generation of assert statements.However, unlike CodeT, we limit ourselves to only 10 test cases, which are generated in 1-2 prompts to LLM, thus adding neglible effect on the code generation itself.
Dataset Statistics are presented in Table 3.

B Results
We present the complete results with standard deviation in Table 5.For CODE GENERATION tasks, results are presented in Table 4 Further in Table 6 we show that improvements by Adaptive-Consistency are statistically significant across all datasets.We perform 2 sample t-test on 3 random seeds.While p-value of number of generations is much less than 0.05 (average: 1.5e-3), indicating that our method is significantly more efficient, the p-value of accuracy is much larger than 0.05 (average: 0.50), indicating the slight accuracy difference between baseline and our method is statistically insignificant.

C Analysis
C.1 Adaptive-Consistency vs.
Self-Consistency For Equal Average Sample Costs In Section 5.2, we demonstrate that Adaptive-Consistency achieve better accuracy over Self-Consistency when both are operating on same expected sample cost.In Figure 5 we show the complete results.Section 4.1 previously demonstrated that Adaptive-Consistency achieves comparable performance to Self-Consistency using fewer samples.In this section, we consider a scenario where Adaptive-Consistency and Self-Consistency operate with the same average number of samples.For each fixed sampling budget k of Self-Consistency, we contrast the performance of Adaptive-Consistency and Self-Consistency, where Adaptive-Consistency uses k samples on average, rather than consistently across all instances.
Figure 3 provides a visual comparison of the performance of Adaptive-Consistency and Self-Consistency on GSM-8K: Adaptive-Consistency outperforms Self-Consistency in accuracy across all average sample costs.For example, when the average sample cost is 10, Adaptive-Consistency achieves approximately 3% higher accuracy on GSM-8K.
The success of Adaptive-Consistency can be attributed to its adaptive sampling strategy.By varying the number of samples based on the complexity of the instance-using more samples where a clear consensus is hard to reach and fewer where answers are consistent-Adaptive-Consistency manages to secure improved overall performance even when the average sample cost matches that of Self-Consistency.

C.2 Stopping Criterias
This section follows from the main discussion in Section 5.3.We evaluate different stopping criterias for Adaptive-Consistency.We evaluate 6 different functions: 1. RANDOM: randomly stopping with a probability p, 2.) 2. MAJORITY: stopping after the most common answer has a majority above a threshold, 3. ENTROPY: stopping after the entropy of answers is below a threshold, 4. BETA: The main stopping criteria used in Adaptive-Consistency, based on the Equation (2), 5. DIRICHLET: The stopping criteria, based on Equation (1).

CHINESE RESTAURANT PROCESS (CRP):
The stopping criteria, which models probability as chinese restaurant process making no assumption on possible number of unique answers.
For comparison, we tune the C thresh in each case on the training set of GSM-8K dataset.Results are presented in Table 7. RANDOM and MAJORITY are inferior to BETA across all datasets and models.Further, while DIRICHLET and CRP are almost similar to BETA, they are relatively very slow.While Although, from Table 7, ENTROPY looks appears to be on par with BETA, in Figure 6, we show BETA The ∆ columns display the reduction in generations (Gen.Reduc.) and the difference in accuracy (Acc.Diff.) between Self-Consistency and Adaptive-Consistency.For CODECONTESTS, Self-Consistency uses 1000, APPS use 50, while HUMANEVAL and MBPP use 100 generations each.
beats ENTROPY given the same expected sampling cost.
Finally, BETA has additional key advantages: BETA incorporates a measure of uncertainty, which makes it more robust to variations in data order, mitigates the influence of noise, and offers a quantitative measure of confidence in the majority outcome.Consider an extreme case where the first two generated solutions are identical.The majority voting strategy would instantly halt the process, potentially missing out on better solutions.In contrast, BETA will keep sampling as the confidence for stopping has not yet reached.

C.3 Chinese Restaurant Process
In the DIRICHLET stopping criteria, we assume that the number of unique answers that can be generated by the LLM is known in advance (and equal to the number of unique answers in the current observation set).However, this assumption may not hold for datasets such as GSM-8K, where numerical answers are expected.The CHINESE RESTAURANT PROCESS (CRP) is a generalization of the DIRICH-LET process that addresses this limitation by not making any assumption on the number of unique answers.
In CRP, we consider a list of same answers as a cluster, denoted by c i , where i is the index of the cluster.Let n i be the number of elements in cluster c i , and n be the total number of elements across all clusters.The probability of a new answer belong-ing to an existing cluster c i is directly proportional to the size of the cluster, and is given by: whereas the probability that a new unseen answer will form a new cluster is given by: where α is the concentration parameter, which parameterizes the probability of generating a new answer.
Our goal is to calculate the probability that the current majority cluster in observations will remain the same even with more generations.The first task is to estimate the concentration parameter α.We use the approximation proposed by (West, 1992) to model the α as where k is the number of unique answers (clusters) in the current observation, n is the total number of answers, a and b are priors and both set equal to 1, and γ is Euler's constant and G(α; a + k − 1, b + γ+log(n)) denotes the probability density function of the Gamma distribution with shape parameter a + k − 1 and rate parameter b + γ + log(n).
We sample α multiple times (100), and for each sample, we run Monte-Carlo Simulation (1000 simulations) based on the CRP probability modeling.

Figure 1 :
Figure 1: An overview of Adaptive-Consistency: Self-Consistency samples a predetermined number of answers, whereas Adaptive-Consistency iteratively samples until a lightweight Stopping Criteria, decides to report the majority answer.The figure demonstrates an example where Adaptive-Consistency reduces sampling costs by 4x, requiring only ten samples to report the majority answer.The bottom-left graph contrasts Adaptive-Consistency with Self-Consistency across three reasoning categories, showing an average sample budget reduction of 3.3× with a negligible 0.04% drop in accuracy.

Figure 2 :
Figure2: Impact of Confidence Threshold (C thresh ) on Adaptive-Consistency for GSM-8K: As C thresh varies, the accuracy of Adaptive-Consistency increases gradually, eventually plateauing.Initially, the average number of generations also increases gradually but then sharply climbs, reflecting the accuracy-confidence trade-off.

Figure 3 :Figure 4 :
Figure 3: Comparison of Adaptive-Consistency with Self-Consistency on various average sampling costs on 2 datasets: GSM-8K and DATE UNDERSTANDING.Adaptive-Consistency is able to consistently beat Self-Consistency, especially when the sampling cost is low.

Figure 5 :
Figure 5: Comparison of Adaptive-Consistency with Self-Consistency on various average sampling costs.Adaptive-Consistency is able to consistently beat Self-Consistency, especially when the sampling cost is low.Moreover, C thresh = 0.95 is a good indication of saturation in accuracy indicating the value works out-of-box for most configurations considered.

Figure 7 :
Figure 7: Impact of Confidence Threshold (C thresh ) on Adaptive-Consistency: As C thresh varies, the accuracy of Adaptive-Consistency increases gradually, eventually plateauing.Initially, the average number of generations also increases gradually but then sharply climbs, reflecting the accuracy-confidence trade-off.The trend is observed almost consistently across all datasets.

Table 4 :
ModelAvg.Gen. Accuracy Avg.Gen. Accuracy Gen. Reduc.Acc.Diff.↑ Comparison of Adaptive-Consistency with Self-Consistency on 4 diverse code generation datasets.The table presents the accuracy of Self-Consistency, the average number of generations (Avg.Gen.) for Adaptive-Consistency, and the accuracy of Adaptive-Consistency. Self-Consistency always draws a fixed number of samples.

Table 7 :
Comparison of BETA, MAJORITY and ENTROPY stopping criterias.In the two representative datasets, BETA consistently beats ENTROPY and MAJORITY for the same sampling cost.This shows in practice BETA performs better than both for the desirable range of accuracy and sampling cost.Comparison of various Stopping Criterias in Adaptive-Consistency.In general, BETA outperforms RANDOM and MAJORITY by decent margins across all datasets.BETA has comparable performance to DIRICHLET, but the latter is much slower.ENTROPY performs similarly to BETA but lacks human-interpretable stopping rationale.