A Comprehensive Evaluation of Tool-Assisted Generation Strategies

A growing area of research investigates augmenting language models with tools (e.g., search engines, calculators) to overcome their shortcomings (e.g., missing or incorrect knowledge, incorrect logical inferences). Various few-shot tool-usage strategies have been proposed. However, there is no systematic and fair comparison across different strategies, or between these strategies and strong baselines that do not leverage tools. We conduct an extensive empirical analysis, finding that (1) across various datasets, example difficulty levels, and models, strong no-tool baselines are competitive to tool-assisted strategies, implying that effectively using tools with in-context demonstrations is a difficult unsolved problem; (2) for knowledge-retrieval tasks, strategies that *refine* incorrect outputs with tools outperform strategies that retrieve relevant information *ahead of* or *during generation*; (3) tool-assisted strategies are expensive in the number of tokens they require to work -- incurring additional costs by orders of magnitude -- which does not translate into significant improvement in performance. Overall, our findings suggest that few-shot tool integration is still an open challenge, emphasizing the need for comprehensive evaluations of future strategies to accurately assess their *benefits* and *costs*.

Recent work proposed a variety of strategies for interfacing between the LM and the tool, such as through demonstrations of API calls (Paranjape et al., 2023) or using the tool to refine the model's output (Gao et al., 2023a)-see Figure 2 for an overview.But what are the advantages and tradeoffs of different TA strategies?For example, some strategies incur significantly higher computation costs than others with little to no improvement in performance.There is a gap in the literature on the evaluation of such strategies, in particular against strong baselines and against each other.Concretely, works that report empirical evaluations are often restricted to comparisons of a single proposed strategy against a limited selection of non-TA baselines, using a limited selection of LMs or even a single LM, or focus on evaluating various LMs with a specific TA strategy (Li et al., 2023).Additionally, comparisons often do not consider the increase in computation that each TA strategy requires, which vary significantly, and have a large effect on inference time or cost.
The above issues are only some of the pitfalls we observed in the literature, limiting the scope of current evaluations.In §3, we analyze the literature for common pitfalls and collect a set of guidelines towards a fair and reliable evaluation procedure specifically for TA strategies.Next ( §4), we conduct a study which addresses all of the observed pitfalls, using GPT3, Flan-UL2 and Flan-PaLM, and complex reasoning benchmarks StrategyQA, MuSiQue, GSM8K, and DROP.We report a fair, systematic comparison of five few-shot TA strategies across multiple models and demonstrations, and all strategies use the same set of tools.We analyze the study results ( §5) and arrive at surprising conclusions: (1) Non-TA baselines are stronger than initially reported.In most cases, TA strategies do not significantly or at all improve on non-TA strategies on popular Question Answering datasets.(2) For retrieval tools in knowledge tasks, TA strategies that fix model output after it is generated perform better than TA strategies that prompt the model to interface with the tool directly during generation.For calculator tools in calculationintensive tasks, the relationship is not decisive.(3) TA strategies incur significantly higher computation costs than non-TA baselines by multiplicative factors, and there is no general correlation between computation cost and performance, with the exception that refinement strategies in retrieval settings are more costly than non-refinement strategies.
In §6 we report a fine-grained analysis of the results.We investigate the effect of each example's difficulty-e.g., very large numbers, or very rare entities) on improvement from tool usage, and find that tools do not systematically improve model performance on harder examples, where they were expected to have the strongest improvement.Finally, based on an error analysis of failure cases, we find that the majority of mistakes follow incorrect tool invocations, rather than incorrect tool responses (in the case of the retrieval tool) or incorrect inferences based on correct tool usage.
In conclusion, we conduct an extensive evaluation of few-shot TA strategies, finding that previous estimates of tool-usage performance is not representative.Overall, this suggests that few-shot tool integration is still an open challenge.We call the community to evaluate future strategies systematically, while taking into account the significant costs that these strategies require in comparison to their benefits.Towards this, we provide a set of concrete guidelines for fair and reliable evaluation of TA strategies.Moreover, We release the handcrafted collection of 184 demonstrations used in our study (attached in the supplementary material).

Tool-Assisted Language Models
We describe existing few-shot strategies for augmenting LMs with tools and discuss related work.

Few-shot TA strategies
Strategies for tool usage can be broadly divided into two categories: (a) Using tools during generation and insert the tools' outputs into the model's prompt (Figures 1a, 2a); (b) Using tools to refine the LM's output after generation (Figures 1b, 2b).Strategies can be further categorized into settings where the tool is heuristically called in a pipeline or called when the model generates pre-specified tool calls.Refer to Mialon et al. (2023) for a review of the literature on TA strategies and models.
Among TA strategies of type (a): SelfAsk (Press et al., 2023) decomposes the task into subtasks as simpler questions, such that a tool can be called on each question.A related strategy is Demonstrate-Search-Predict (Khattab et al., 2023).Inline strategies such as Toolformer (Schick et al., 2023)1 , ART (Paranjape et al., 2023), inter alia (Chen et al., 2022;Gao et al., 2023b;Lyu et al., 2023) demonstrate tool usage with pre-defined words or tokens and tool arguments, halt generation when those tokens and arguments are generated, invoke the tool, and insert its output into the prompt to resume generation.Interleaving Retrieval (Trivedi et al., 2022a) does not directly instruct the model to use tools, but calls the tool on each reasoning step, to provide the model with additional context for future steps.(Jiang et al., 2023) propose a similar strategy, opting to re-write each step after using it as a query.There are also strategies such as Decomposed Prompting (Khot et al., 2023) that are generalizations of the previous strategies.
Among TA strategies of type (b): RARR (Gao et al., 2023a) involves a pipeline designed for knowledge-based tasks: verifying the relevance  Corrected: Muhammad Ali was 74 years old when he died.
Search(Alan Turing was 41 years old when he died.)-> Alan Turing was only 41 when he took his own life.
Correct: Alan Turing was 41 years old when he died.

Check & Fix
Muhammad Ali was 40 years old when he died.Alan Turing was 41 years old when he died.So the final answer is: Muhammad Ali Q1: How old was Muhammad Ali when he died?Search(How old was Muhammad Ali when he died?)-> Muhammad Ali died at the age of 74.
Relevant, corrected: Muhammad Ali was 74 years old when he died.
Q2: How old was Alan Turing when he died?Search(How old was Alan Turing when he died?)-> Alan Turing was only 41 when he took his own life.
Relevant, correct: Alan Turing was 41 years old when he died.Muhammad Ali was 74 years old when he died.Turing was 41 years old when he died.

RARR
Are follow up questions needed here:

SelfAsk -Baseline Inline -Baseline Chain of Thought -Baseline
So the final answer is: Muhammad Ali

Related Work
Training LMs to use tools.While we are primarily concerned with few-shot tool assistance of LM generation, the literature also explores LMs which are trained to use specific tools (Parisi et al., 2022;Hao et al., 2023;Patil et al., 2023).These methods are constrained to the tools seen during training, and require data (annotated, bootstrapped, or synthetically constructed) of tool demonstrations.

Evaluation Pitfalls
While there is a plethora of TA strategies ( §2.1), no systematic comparison of these strategies has been conducted.Research that proposes TA strategies in

Recommendation
(1) Coupling the TA strategy and the tool together.
Comparisons of TA strategies should use the same tools across strategies.
(2) Forcing no-tool baselines to the framework of the TA strategy.
The optimal way to solve the task without tools may be different from solving the task with tools: No-tool baselines should include multiple variants of both free-form and structured strategies, to ensure the TA strategies are not given an advantage.
(3) Using one model across all comparisons.
Different models may behave differently when it comes to using tools effectively, based on their training data.Multiple models should be tested, if possible.(4) Using one prompt and set of demonstrations across all comparisons.
Multiple different sets of demonstrations should be used to get reliable estimates of few-shot performance.
TA strategies can be efficient or inefficient with regards to the prompt tokens and generation tokens they require to work, with respect to no-tool baselines or with respect to each other.The differences can be significant ( §5).Comparisons of TA strategies should factor the computation cost of the strategy, which we term as token efficiency.
Table 1: Summary of evaluation pitfalls of TA strategies ( §3) and recommendations to mitigate them.
few-shot settings is often not focused on evaluating properties of those strategies, but other aspects of LM capabilities (Press et al., 2023;Gao et al., 2023a), usage in particular strict contexts (Paranjape et al., 2023), evaluating various LM models themselves with a particular strategy (Mialon et al., 2023), and so on.
Below we collect observations from the literature that demonstrate the limited evaluation scope of TA strategies, in an effort to establish a set of criteria for future evaluations to be reliable and fair (a summary is provided in Table 1).
(1) Coupling the TA strategy and the tool together.Comparisons may vary the tools and methods together (e.g., a TA strategy A with a tool A versus a TA strategy B with a tool B).
(2) Forcing baselines to the framework of the TA strategy.Typical baselines to a given TA strategy are to apply that strategy while letting the model generate the tool's output instead of the tool, and using CoT prompting.However, the optimal way to solve the problem without tools may not be the same as the TA strategy in question.In this work, we implement three different baselines ( §4) and find that there is no clear winner among two of them (we explore this empirically in §5).
(3) Using one model across all comparisons.Often, a single model is chosen to use as the underlying model for the TA strategy.This limits the insights from the evaluation to this model in particular, since conclusions may not carry over to other models.In this work, we find that the bestperforming strategies vary significantly across different LMs (we explore this empirically in §5).
(4) Using one prompt and one set of demonstrations across all comparisons.Few-shot evaluation is known to be unreliable when using a single set of demonstrations as a single prompt (Perez et al., 2021).Furthermore, some prompts used in TA strategy evaluations-in particular, CoT demonstrations-appear so often on the internet that they are suspected to be part of the models' training data, further compromising their function (Jacovi et al., 2023).
(5) Not considering TA strategy costs.In many cases, the TA strategy requires significantly more compute than no-tool baselines, and different TA strategies also require different amounts of computation.Computation cost is not traditionally considered in comparisons.

Experimental Setup
Our goal is to conduct a fair and reliable comparison of TA strategies, without being influenced by properties of specific models, tools or prompts.To this end, we focus on few-shot tool usage, a popular TA scheme that allows flexibility around using new tools and adapting tools to specific tasks.
In what follows, we describe our experimental setup.What guides this experimental setup is to perform a comprehensive, rigorous evaluation without the pitfalls of §3.Our evaluation covers 5 different TA strategies, 4 recent LMs, 4 complex reasoning datasets, 3 few-shot prompts, and 2 tools.For each TA strategy + dataset + model combination, we run three experiments with a different number of demonstrations.Overall, our evaluation includes an execution of 342 experiments, each of which generates 250 (GPT-3) or 500 (non-GPT-3) longform answers.Additional implementation details are in Appendix A.
Tool-assisted strategies.We evaluate the TA strategies shown in Figure 2: SelfAsk, Inline, Interleaving, C&F and RARR.We additionally include variants of SelfAsk and Inline where the model is separately called to summarize tool output in relevant context, as it can often be very long (SelfAskQA and InlineQA; see Appendix A for details).Finally, in the retrieval settings, we use Top-1 retrieval for all models, and additionally Top-5 retrieval for the Flan-PaLM-540B model (see "Models" below) to check whether additional retrieved information can improve performance despite the significantly longer input and processing cost.
For SelfAsk and RARR we use the original implementation provided by the methods' creators.We implement Interleaving (Trivedi et al., 2022a), as at the time of this research no implementation was available.Importantly, this implementation yields similar performance to that of existing approaches that combine CoT with retrieval from Wikipedia by He et al. (2022) 4) implemented methods that apply retrieval and refinement over generated CoT that are similar to C&F and achieve similar performance to ours, as well (see Appendix B).For Inline, we are not aware of reports on few-shot performance of a similar strategy in the literature.
Baseline strategies.We use no-tool versions of SelfAsk, Inline, and standard CoT prompting.The SelfAsk and Inline baselines simply involve giving the model the prompts used for the tool-based versions, while disabling tool calls (such that the model generates the output in-place of the tools).These are the baselines used by Press et al. (2023) and Schick et al. ( 2023) respectively.
Datasets.We consider tasks that require complex reasoning, where models could potentially benefit from external tool usage.Specifically, we use Strat-egyQA (Geva et al., 2021) and MuSiQue (Trivedi et al., 2022b), which require reasoning about entity knowledge, and GSM8k (Cobbe et al., 2021) and DROP (Dua et al., 2019) that evaluate arithmetic reasoning.In DROP we select examples that have numerical answers.We randomly sample 500 examples from the development set of each dataset (with the exception of StrategyQA, whose test set has 229 examples), and use it for performance evaluation of UL2, Flan-PaLM-540B and Flan-PaLM-62B .For GPT-3, we use a subset of 250 examples of that set, due to cost.We use standard evaluation measures for every dataset (F1 in the case of MuSiQue).We provide data examples in Appendix A.
Models.We evaluate the methods across four LMs: Flan-UL2-20B (Tay et al., 2023), GPT-3 (text-davinci-003) (Brown et al., 2020), Flan-PaLM-540B and Flan-PaLM-62B (Chung et al., 2022).We omit GPT-3 experiments on RARR and Interleaving due to cost.Importantly, our focus is not in comparing performance of these models, but to use them as samples of different model instances and training schemes against which to compare different TA strategies.
Tools.We strictly use the same tools across all strategies, to ensure a fair comparison: Google Search (Press et al., 2023;Schick et al., 2023;Lewis et al., 2021) for knowledge tasks, and a calculator (Schick et al., 2023;Qin et al., 2023) for the calculation tasks.RARR, SelfAsk and Interleaving are designed for retrieval settings only, while Inline and Check & Fix can be used in all settings.For the retrieval settings using Google Search and Flan-PaLM-540B , we test retrieval with both the top 1 and top 5 tool-retrieved snippets: The two formats are designed to cover both cases where a shorter tool output may prevent the model's answer from degenerating, and a longer tool output may help the model with more relevant information.
Few-shot demonstrations.In order to overcome bias from using demonstrations from prior work that were likely seen during training (Jacovi et al., 2023), we re-annotate prompts for all TA strategies, datasets and tools.We randomly sample 8 examples from each dataset's training set, and annotate each example with demonstrations for each TA strategy.Some of the strategies call the model multiple times with different prompts (e.g., Check & Fix, RARR), which requires separate annotations.This effort results in a total of 184 annotated demonstrations, which we release as a resource for future works on TA generation.From each set of 8 demonstrations, we then construct three separate prompts-3-shot, 5-shot and 7-shot-randomly sampled from the original 8 demonstrations, to get a better estimation of few-shot performance.

Comparative Results
Organization of the results.Due to the Tool vs. no tool.Previous work that propose TA strategies found that using such strategies consistently improve performance in comparison to notool baselines (Press et al., 2023;Jiang et al., 2023;Trivedi et al., 2022a, inter alia).
Figure 3 shows that the TA strategies do not improve performance over the no-tool baselines in our selection of datasets.The figure shows results against the average of the different few-shot scores, though we observe similar trends when using the maximum of scores as well.Full results are in Appendix B. Similarly to us, Gao et  We conclude that for the settings in this work, the no-tool baselines are stronger than initially expected based on the literature.More research is required to investigate whether this relationship holds in other contexts, though we note that the datasets and models used in our experiments are common in TA research (Mialon et al., 2023).
Additionally, our experiments provide empirical justification to Recommendations (2) and (3) in §3.First, we find that the CoT and Inline baselines outperform each other at a roughly equal rate, and neither emerges as a clear winner.This shows that different baselines obtain different results, and so, relying on only a single baseline in evaluation does not necessarily provide a good estimation for no-tool performance (recommendation (2)).Also, the best-performing strategies vary significantly across models, which highlights the importance of using multiple models for evaluation (recommendation (3))-for illustration, we report the highestperforming strategies in each setting in show that the overall conclusion can be distorted by choosing a particular model or strtegy Extended details are in Appendix B.1.
Tool use during generation vs. post-generation refinement.In Figure 3 we compare the strategies that use tools during generation against the strategies that first generate an answer, and then use tools to improve the answer.For retrieval tasks, refinement clearly outperforms non-refinement strategies, but the same does not apply to the calculation tasks.We conjecture that planning calculations ahead of time during generation is more aligned with LM pretraining data, based on internet text, than planning retrieval queries in similar contexts.
Token efficiency.TA strategies are typically evaluated in terms of task performance and properties such as factuality and logic correctness.We argue that computational cost is another important factor to consider.Specifically, we propose to evaluate token efficiency, that is, the amount of prompt tokens and generated tokens, which have direct effect on the cost of the TA strategy.Notably, the cost of a TA strategy depends on various variables, including model size, GPU type, caching optimizations, vocabulary size, beam search size, and so on.However, token counts can serve as a plausibly generic proxy for the purpose of comparing the cost of different TA strategies, as other factors are roughly equal across strategies, as long as the same models and tools are used.We consider prompt tokens and generated tokens separately, as they often have different consequences on cost. 2 Tables 3, 4 show both canonical and empirical comparisons across TA strategies with regards to token efficiency.The canonical comparison is a function of the relevant variables in the "canonical" setting where the model was expected to answer the question perfectly, and use the tool perfectly as intended.Across all TA strategy experiments, we found no general correlation between token efficiency and performance.Concretely: (1) All TA strategies are significantly more expensive than the no-tool baselines by orders of magnitude, while not incurring an improvement worthy of this extra cost.Empirically, using tools in each case can incur extra costs by a factor of 5x to 10x for prompt processing, and 2x to 5x for generation.(2) The refinement strategies are more expensive than the no-refinement strategies.So while they improve performance for retrieval tasks, it comes at a cost.

Analytical Results
We discuss further analyses of our results, findings that (a) our observations generally hold across different levels of example difficulty, and (b) most prediction errors of tool-augmented LMs stem from incorrect inputs to the tool and bad outputs from it, and not from a lack of tool usage.

Example Difficulty
It has been shown that LMs have difficulty solving problems involving long-tail entities (Kandpal et al., 2022;Mallen et al., 2022) and complex mathematical reasoning challenges (Mishra et al., 2022;Imani et al., 2023).Accordingly, we ablate the results from §5 along the following axes of example difficulty, in order to understand how tools can affect performance on difficult examples.We provide an overview of the trends here, and extended results are available in Appendix B.
Measures of difficulty.We investigate the effectiveness of tool-usage across varying levels of example difficulty, which we approximate in two axes:  4b, respectively.We find that performance uniformly decreases for harder ex-amples in the retrieval setting for all models, but in the calculation setting, this only manifests for Flan-UL2-20B (implying that the larger models are more robust to the numerical ranges in GSM8K and DROP).Overall, in all cases tool use does not improve upon the baselines even when controlling for the harder cases where tools are expected to be more useful.This conclusion is aligned with our error analysis in §6.3, which shows that the common errors stem from incorrect tool arguments, more than correct tool arguments but incorrect inferences based on them.Flan-UL2 with a calculator is an exception, where tool use indeed helps, though moreso on the easier examples, likely due to a higher rate of correct arguments to the calculator.

Tool Usage Statistics
A possible explanation for the similar performance of no-tool baselines could be a lack of tool usage.
To check this, we aggregate usage over the different TA strategies, and find that the models indeed use tools in the majority of the cases; 70%-80% in SelfAsk, and >90% in others (see Appendix B).We also investigate usage across other axes, such as models and number of demonstrations, and find similar trends.However, the datasets and tasks we investigate are designed to benefit from the tools in all cases, which shows that few-shot demonstrations are not always sufficient in inducing tool use in models.In particular, the SelfAsk strategies receive the lowest tool use, being the strategies that use natural language to query whether to use the tool (the answer begins with "Are follow up questions needed here:' to which the model answers "No" in the cases where the tool is not used).

Error Analysis
We sampled 50 instances for which an error was made by the TA models, randomly across the 5-shot experiments, and categorized them across three categories: (A) Incorrect tool input; (B) incorrect tool output; (C) incorrect model inferences based on correct tool usage.Error B applies only to the retrieval settings, where the retrieval tool (Google Search in our case) retrieved a wrong or irrelevant snippet.The errors were distributed approximately to 60% (A), 10% (B), and 30% (C) in the retrieval setting, and 80% (A) and 20% (C) in the calculation setting.Li et al. (2023) reported an error analysis for tool-assistance in dialogue customer assistance settings, with similar conclusions regarding error A, although errors B and C do not apply in their 13863 We analyze performance of the strategies across two area (no-tool baselines vs. TA strategies), conditioned on example difficulty as defined by the existence of rare or common entities in the retrieval settings (via percentile of page views) and small or large numbers in the calculation settings (via percentile of numeric range).In (a), lower page views imply higher difficulty, and in (b), larger numbers imply higher difficulty.
context, and other error types manifest instead.
Our results suggest that the majority of errors are not due to the incorrect tool responses (i.e., issues with Google Search as a choice of retriever), and overall more influenced by incorrectly invoking tools to begin with, in comparison to invoking them correctly but composing the solution incorrectly.

Conclusions and Takeaways
We conduct a comprehensive assessment of fewshot tool augmentation strategies for LMs, covering hundreds of experiments with multiple LMs, datasets, and tools.Our experiments show that current tool-usage integration approaches are presently a false promise; prompting strategies that do not use tools typically obtain similar task performance, without the high cost of tool execution.Controlling for example difficulty, where tools are expected to provide the most benefit, does not explain the relative strength of the no-tool baselines.Instead, the primary errors we observe are related to incorrect usage of the tools to begin with (i.e., generating incorrect arguments to the tool).
Our findings call for more robust evaluation of future TA strategies, primarily in more practical settings where models are not expected to leverage inherent abilities to solve tasks.To this end, our work provides concrete evaluation guidelines, such as employing stronger baselines and factoring in computation costs.

Limitations
While our study aims to provide a comprehensive evaluation of TA strategies, there are some limitations.First, recent work (Dodge et al., 2021;Magar and Schwartz, 2022;OpenAI, 2023) suggests that examples from public datasets, like those used in our evaluation, may have leaked to the training data of recent LMs.Such contamination can introduce biases to the evaluation, such as lack of need for external tools.We are not aware of alternatives without this issue at the time of this writing.
Second, due to the high cost of executing large LMs in an exhaustive evaluation, we ran only a single experiment for each combination of TA strategy, model, dataset, and number of demonstrations.However, given the sensitivity of models to the demonstrations (Perez et al., 2021), future work should extend this evaluation to use multiple sets of demonstrations for each such combination.
Last, while our findings show that non-tool models often perform on par with existing TA strategies, our setting favors tool usage.For example, our tasks only require a single type of tool such that the model does not need to choose between multiple tools.Future work that investigates when and how tools can improve performance should consider more realistic evaluation settings, for example, by considering tasks where the model may need to use multiple types of tools together, or tasks where tools may sometimes give unhelpful answers.

A Implementation Details
A.1 Tool-Assisted Strategies.
General Details.In all cases, if the tool invocation fails (e.g., with an ill-formatted calculation, or a null response from Google Search), the model is used to generate the tool's output instead.For all retrieval settings using Google Search, we test both Top-1 and Top-5 retrieval: The two formats are designed to cover both cases where a shorter tool output may prevent the model's answer from degenerating, and a longer tool output may help the model with more relevant information.Illustrative examples of the data are available in Table 5.
SelfAsk and SelfAskQA.SelfAsk involves decomposing each question into a series of simpler sub-questions, and calling the tool directly for each sub-question.The tool's output is inserted into the prompt as an intermediate answer.When the model generates a step that begins with the string "So the answer is:", it is expected to generate an answer that builds on the previous intermediate answers which were tool outputs.In this work, we use Google Search as the tool as in the original work by (Press et al., 2023).
Our SelfAsk implementation reuses the original implementation by Press et al. (2023).Since Self-Ask is designed specifically for knowledgebased QA, we only evaluate this strategy for the knowledge tasks MuSiQue and StrategyQA.
The SelfAskQA variant involves calling the model for each pair of sub-question and retrieved snippet that (hopefully) contains its answer.This method of recursively calling the model with different a different prompt as if it were another tool is a technique proposed by Khot et al. (2023).We collect all sub-questions from the SelfAsk prompts in order to construct QA prompts (using the tool to retrieve supporting snippets).The model is called with the QA prompts in order to answer each subquestion based on its snippet.The SelfAskQA variant in essence summarizes each Google Search snippet, which can be as long as a paragraph, into a short answer to the given sub-question, effectively simplifying and shortening the overall answer.
Inline and InlineQA.The Inline strategy format largely mimics the Toolformer format by Schick et al. ( 2023), but can also be cast into the ART framework by Paranjape et al. (2023) or the Decomposed Prompting framework by Khot et al. (2023).In general, the strategy simply calls for generating the tool call in a predefined format-in our case, square brackets and the tool name.The tool is invoked with the arguments generated by the model inside the brackets, and the tool's output is inserted into the model.Our implementation is based on the inference code implemented by Schick et al. (2023), although notably, we focus on few-shot usage, and do not perform the tool-usage pretraining step that largely concerns the referenced work.
We implement two variants: Inline, which uses a tool called "Search" that appends the retrieved snippet or calculation output directly into the prompt, and InlineQA, which uses a tool called "QA" which calls the model with a separate prompt in order to summarize the retrieved snippet into a concise answer, identically to the aforementioned SelfAskQA variant.As with the SelfAsk and SelfAskQA variants, among Inline and InlineQA in the knowledgebased tasks, neither consistently outperforms the other in particular.
Interleaving.The Interleaving Retrieval strategy (Trivedi et al., 2022a) proposes to use each reasoning step by the model in its CoT answer as a query to a retrieval model.The retrieved snippet is then added to the prompt in order to provide additional information to the model.The structure for each demonstration becomes: (1) All retrieved documents thus far; (2) The question; (3) The generated answer thus far (see Trivedi et al., 2022a for details).In this way, the tool is used heuristically without explicit demonstrations from the model, but the generation of the answer at each CoT step is still conditioned on tool usage based on the previous steps.
Check & Fix.We propose this strategy as a more lightweight variant of refinement based on tools in comparison to RARR, and it is comparable to contemporaneously proposed (Jiang et al., 2023): After each CoT step, the step is checked for accuracy using a tool, and if found inaccurate, a new fixed step is generated to replace it.
In the retrieval setting, each step is verified and fixed by prompting the model to classify whether the step is contradicted by the retrieved paragraphs,  and if so, to generate the fixed step based on demonstrations.In the calculation setting, each step is first heuristically checked for whether it contains a calculation, and if so, the calculation is inserted into the calculator tool, and the model is prompted to verify whether the tool output is consistent with the calculation in the text.If this is incorrect, the model generates the fixed step.In both cases, the answer generation continues where the fixed step completely replaces the original incorrect step.
RARR.RARR (Retrofit Attribution using Research and Revision, Gao et al., 2023a) was proposed as a post processing method for refining any text, including LM chain-of-thought outputs.This is done via automatically finding attribution for each claim in the text, and post-editing the output to fix unsupported content while preserving the original output as much as possible.Our RARR implementation reuses the original implementation by Gao et al. (2023a).
The RARR process involves the following steps, with each considered as a separate tool: 1. Question Generation: First, they generate a series of questions that cover various aspects of a passage, referred to as passage x.The questions generated aim to verify and attribute information from the passage.This is done via prompting the LM with few-shot examples.
2. Evidence Retrieval: For each generated question, the Google Search tool is utilized to retrieve the top-k passages that are related to the question.In this work, we evaluate both Top-1 and Top-5.
3. Evidence Ranking: The retrieved evidences are next ranked using a query-document relevance model scorer.Unlike the original RARR implementation (Gao et al., 2023a), which uses the GTR retrieval model (Ni et al., 2022), we instead implement the scorer via few-shot LM prompting, as suggested by the authors.The output of this stage is thus the top-1 ranked evidence.
4. Agreement Phase: Given a triplet of a text, question, and an evidence, this phase determines whether both the text and the question imply the same answer to the question.This is implemented via few-shot LM prompting using a chain-of-thought style prompt.version of the text, considering the discrepancy between the previous text and the evidence.This is implemented via few-shot LM prompting using a similar chain-of-thought style prompt from the previous stage (see Gao et al., 2023a for the exact prompting template).

Editing
The agreement and editing phases run iteratively until there are no needed revisions, detected in the Agreement Phase.

A.2 Baselines
Chain-of-Thought.The CoT baseline is the standard baseline proposed by Wei et al. (2023) and implemented as a baseline by Press et al. (2023); Paranjape et al. (2023), inter alia.Often, the demonstrations used for this baseline are those originally published by Wei et al. (2023).In this work we annotate a new sample of examples with CoT answers for the purpose of a better estimation of CoT few-shot performance, and release our annotations.
Self-Ask.The Self-Ask baseline uses the Self-Ask tool demonstrations, but does not invoke the tool after each "Follow up:" call, and instead generates the entire answer.This is the original no-tool baseline in Press et al. (2023).
Inline.The Inline baseline uses the Inline tool demonstrations, but does not invoke the tool after

B Extended Results
We provide the full results for our experiments (described in §4) in §B.1, and further analysis of TA strategy performance and tool usage in §B.2.

B.1 Full Experiment Results
Tables 9, 10 detail our experiment results.Tables 11, 12 For DROP and MuSiQue, we report the F1 measures using the evaluation scripts provided by Dua et al. (2019);Trivedi et al. (2022b) respectively.For GSM8K, we normalize the numerical answers and measure exact-match.For StrategyQA, we normalize the answers (for capitalization, prefix and suffix punctuation, and so on) and measure exact-match to "yes" and "no".
Best-performing strategies and baselines in each setting.In Tables 2, 6 we show the bestperforming baseline and best-performing general strategy for each setting of model and dataset, among the average scores across the three few-shot experiments.For strategies in general (Table 2), we see that the winning strategies vary significantly for different models, which supports Guideline (3) in Table 1.
The distribution among the baselines is split 50%-50% among CoT and Inline.When considering each few-shot experiment separately (i.e., not taking the average), the distribution is 60.0%, 37.5%, and 2% for Baseline-CoT, Baseline-Inline and Baseline-SelfAsk respectively for which baseline achieves the best-performing score.This supports Guideline (2) in Table 1.

B.2 Analysis
Example Difficulty.Figures 5, 6 show extended results for the example difficulty analyses in §6.
Here we consider the median of each difficulty metric-i.e., the difficulty across all entities or numbers in the example-rather than the minimum or maximum, as well as the ablation of refinement strategies against no-refinement strategies.We additionally checked for two alternative axes: operation complexity (addition and substraction as "easy" examples, and multiplication and division as "hard" examples) and popularity links rather than popularity views.The trends we observe in the main paper hold in all of these cases.
Tool Usage.Tables 7, 8 show aggregate tool usage percentages over multiple axes.Overall, fewshot demonstrations induce tool usage in the majority of cases, though not completely so (i.e., below 100%).

Figure 1 :
Figure 1: Illustration of tool-assistance strategies that invoke tools and insert their outputs into the prompt (a), and strategies that first generate some output, and only use tools to fix and refine it (b).

Follow up :
How old was Alan Turing when he died?Search(How old was Muhammad Ali when he died?)-> Muhammad Ali died at the age of 74.Alan Turing was only 41 when he took his own life.Search(How old was Alan Turing when he died?)-> Muhammad Ali was 74 years old when he died.Alan Turing was 41 years old when he died.Document 1: Muhammad Ali died at the age of 74.Document 2: Alan Turing was only 41 when he took his own life.Turing was 41 years old when he died.Muhammad Ali was 74 years old when he died.Are follow up questions needed here: SelfAsk Inline Interleaving Muhammad Ali was 40 years old when he died.Alan Turing was 41 years old when he died.So the final answer is: Muhammad Ali Search(Muhammad Ali was 40 years old when he died.)-> Muhammad Ali died at the age of 74.
Follow up: How old was Muhammad Ali when he died?Intermediate answer: Muhammad Ali died at the age of 74.Follow up: How old was Alan Turing when he died?Intermediate answer: Alan Turing was only 41 when he took his own life.Search(How old was Muhammad Ali when he died?)-> Muhammad Ali was 74 years old when he died.Muhammad Ali was 74 years old when he died.Search(How old was Alan Turing when he died?)-> Alan Turing was 41 years old when he died.Alan Turing was 41 years old when he died.

Figure 3 :
Figure 3: A comparison of evaluation scores across two areas ( §5): (a) No-tool baselines vs. TA strategies; (b) Tool usage via refinement of generated text vs. tool usage during generation, where the generated text contains tool arguments is conditioned on tool outputs.The dark line marks the confidence interval among samples.
(A) Long-tail entities (retrieval): Following Mallen et al. (2022), we extract the entities from the question and associated gold answers in StrategyQA and MuSiQue, and use the corresponding entity Wikipedia page views as a measure of popularity.(B) Large numbers (calculation): We segment the examples in the calculation tasks based on the range of the median and largest number in the example (question and gold solution in GSM8k, or question and context paragraph in DROP).Results.Performance across increasing levels of entity popularity and computation complexity, with different LMs and TA strategies, are shown in Figure 4a and Figure Figure4: We analyze performance of the strategies across two area (no-tool baselines vs. TA strategies), conditioned on example difficulty as defined by the existence of rare or common entities in the retrieval settings (via percentile of page views) and small or large numbers in the calculation settings (via percentile of numeric range).In (a), lower page views imply higher difficulty, and in (b), larger numbers imply higher difficulty.
Phase: If the previous Agreement Phase outputs disagreement between the text and the evidence, the (text, question, evidence) triplet is fed to a model that outputs a revised , 13, 14 detail average and max aggregations over the few-shot prompts.As mentioned, we sample 500 examples for Flan-PaLM-62B , Flan-PaLM-540B and Flan-UL2-20B experiments, and 250 for GPT-3 experiments, with the exception of StrategyQA whose test set has 229 examples.

Figure 5 :
Figure 5: An extension of Table3with results for both the average across few-shot experiments (a-b) and the maximum across few-shot experiments (c-d)-i.e., the maximum between 3-shot, 5-shot and 7-shot for each experiments setting.

Figure 6 :
Figure 6: An extension of Table 4. (a-b) refer to taking the minimum of entity page views to ablate examples that have rare entities, and maximum of numbers to ablate examples with large numbers.(c-e)take the median in both cases, and (f) shows the results when comparing TA strategies between refinement and non-refinement types.
Who lived longer, Muhammad Ali or Alan Turing?Yes.Follow up: How old was Muhammad Ali when he died?

Table 2 :
For each combination of dataset and model, we derive the best-performing strategy on the average score across the few-shot prompts.Notably, the best-performing strategy varies across different models, datasets or prompts, which means that it is necessary to evaluate over all axes to get a better estimation of general performance.

Table 3 :
Average number of prompt tokens per strategy (5-shot), with n as the CoT prompt length, t as the number of tool calls, k as the tool's output length.Flan-PaLM-540B has a shorter context window than GPT-3, which limits prompt length.The canonical formula for RARR favorably assumes a single verification question.

Table 4 :
Average number of answer tokens across the 5-shot experiments, for each strategy.The RARR formula assumes a single verification question per step.
Answer: No. Stanley Baldwin was Prime Minister of the United Kingdom from 1923 to 1929.The woman Prime Minister directly before him was Margaret Thatcher, who served from 1979 to 1990.The woman Prime Minister directly after him was Theresa May, who served from 2016 to 2019.So the answer is no.

Table 5 :
Illustrative examples of various datasets, strategies and model outputs.The answers from the Interleaving, Check & Fix and RARR models are of the same format as the CoT baseline.

Table 6 :
For each combination of dataset and model, we derive the best-performing baseline on the average score across the few-shot experiments.There is no clear winner: Two of the baselines achieve the best score in 50% of cases.

Table 7 :
Note that RARR and Interleaving are guaranteed to use tools so they are omitted.

Table 8 :
Overview of average rate of tool usage across experiments.Note that RARR and Interleaving are guaranteed to use tools.
Schick et al. (2023)instead generates the entire answer.This is the original no-tool baseline inSchick et al. (2023).

Table 10 :
Table3with results for both the average across few-shot experiments (a-b) and the maximum across few-shot experiments (c-d)-i.e., the maximum between 3-shot, 5-shot and 7-shot for each experiments setting.Results for the calculator settings of DROP and GSM8K.We omit Flan-UL2-20B results on DROP, as model could not converge to solve the task with our prompts, likely since each example in this task is very long.

Table 11 :
Aggregations by few-shot prompt of the results in Table9(baselines).

Table 12 :
Table 4. (a-b) refer to taking the minimum of entity page views to ablate examples that have rare entities, and maximum of numbers to ablate examples with large numbers.(c-e)take the median in both cases, and (f) shows the results when comparing TA strategies between refinement and non-refinement types.Aggregations by few-shot prompt of the results in Table9(TA strategies).

Table 14 :
Aggregations by few-shot prompt of the results in Table10.