CREATOR: Tool Creation for Disentangling Abstract and Concrete Reasoning of Large Language Models

Large Language Models (LLMs) have made significant progress in utilizing tools, but their ability is limited by API availability and the instability of implicit reasoning, particularly when both planning and execution are involved. To overcome these limitations, we propose CREATOR, a novel framework that enables LLMs to create their own tools using documentation and code realization. CREATOR disentangles abstract tool creation and concrete decision execution, resulting in improved performance. We evaluate CREATOR on MATH and TabMWP benchmarks, respectively consisting of challenging math competition problems and diverse tabular contents. Remarkably, CREATOR outperforms existing chain-of-thought, program-of-thought, and tool-using baselines. Additionally, we introduce the Creation Challenge dataset, featuring 2K diverse questions, to emphasize the necessity and benefits of LLMs' tool creation ability. Further research demonstrates that leveraging LLMs as tool creators facilitates knowledge transfer, and LLMs exhibit varying levels of tool creation abilities, enabling them to adapt to diverse situations. The tool creation ability revolutionizes the LLM's problem-solving paradigm, driving us closer to the next frontier of artificial intelligence. All the codes and data are released.

To overcome these concerns, researchers have explored equipping LLMs with external tools to alleviate their memory burden and enhance their expertise (Qin et al., 2023).For instance, integrating tools such as question-answering systems or web search engines enables LLMs to learn how and when to access external resources for problemsolving (Nakano et al., 2021;Schick et al., 2023).Recent studies have also incorporated additional tools for LLMs, such as GitHub resources, neural network models (e.g., Huggingface library), and code interpreters (e.g., Python interpreter), aiming to enhance their capabilities (Gupta and Kembhavi, 2022;Surís et al., 2023;Shen et al., 2023;Liang et al., 2023;Lu et al., 2023).These tools require LLMs to provide detailed plans before utilizing them to solve complex problems.
However, tool-augmented LLMs still encounter challenges (Chen et al., 2022;Gupta and Kembhavi, 2022;Schick et al., 2023;Surís et al., 2023), particularly in the following aspects.(1) Limitation in scope: Current approaches focus on a limited number of tools, making it difficult to find an appropriate existing tool for new problem types.
(2) Fragility in reasoning: Given that tasks are often complex, reasoning on the fly case-by-case can be fragile to random errors, while humans can benefit from finding robust commonalities among multiple similar questions.(3) Insufficiency in error-handling: Current tool utilization pipelines lack automatic and specific error handling, necessitating improvements in accuracy and robustness to ensure reliable execution results.
In this paper, we propose a novel approach to ad- dress these challenges.Rather than treating LLMs as users of tools, we empower them to be creators of tools, enabling them to solve problems with higher accuracy and flexibility.We introduce our tool creation framework, CREATOR, which leverages LLMs' ability to create and modify tools based on the problem at hand. Figure 1 illustrates the differences between CREATOR and a general tool-using framework.While the tool-using framework focuses on reasoning to select and plan API usage, our framework emphasizes diversifying tool choices, disentangling abstract and concrete reasoning, and improving robustness and accuracy.Specifically, CREATOR consists of four stages: • Creation: Create generally applicable tools with documentation and realization through abstract reasoning based on the problem.• Decision: With available tools, decide when and how to use them to solve the problem.• Execution: Execute the program, applying the chosen tools to solve the problem.• Rectification: Make modifications to tools and decisions based on the execution result.By introducing these four stages, we aim to better inspire the LLM's creativity and enhance the paradigm's robustness.This design sets CREATOR apart from traditional tool-using and addresses the three challenges we discussed respectively by (1) leveraging LLMs to create tools with higher generality, reusability, and variety, rather than relying on a limited number of given APIs; (2) offload-ing the cognitive burden of LLMs and disentangling their ability to perform abstract reasoning (creation of generalizable tools) and concrete reasoning (decision-making with details); (3) utilizing code as the medium for tool creation, which is more sensitive to errors, and enabling automatic rectification of tools and decisions based on error tracebacks.
To evaluate our design's effectiveness, we test CREATOR on two existing benchmarks: MATH (Hendrycks et al.) and TabMWP (Lu et al., 2022a), as well as the Creation Challenge dataset we create.The MATH dataset contains diverse and challenging math competition problems, while TabMWP includes a wide range of tabular contexts for problemsolving.Notably, ChatGPT built on CREATOR achieves remarkable average accuracies of 59.7% and 94.7% on MATH and TabMWP respectively, surpassing the standard chain-of-thought (CoT) (Wei et al., 2022), program-of-thought (PoT) (Chen et al., 2022), and tool-using baselines by significant margins.
As existing benchmarks do not specifically evaluate tool creation, we further introduce the Creation Challenge dataset, which consists of novel and challenging problems that are inadequately solved using existing tools or code packages.This dataset highlights the necessity and advantages of LLMs' tool creation ability.In addition, we show experimental results that provide evidence of how tool creation plays a crucial role in promoting knowledge transfer across similar queries that possess common core knowledge but differ in specific scenarios.We also present case studies highlighting the varying levels of tool creation ability observed in LLMs, allowing them to better adapt to diverse problem settings.

Related Work
Large Language Models.Large Language Models (LLMs) have gained attention for their impressive performance in handling various NLP tasks, following demonstrations and generating high-quality texts and codes (Brown et al., 2020;Chen et al., 2021;Chowdhery et al., 2022;Touvron et al., 2023).Prompting methods such as chainof-thought (Wei et al., 2022), instruction-following (Wang et al., 2022b;Longpre et al., 2023;Chung et al., 2022;Touvron et al., 2023;Liu et al., 2023), and verification mechanisms (Fung et al., 2023) have been developed to guide LLMs in problem-solving and align their behavior with human expectations.Our work builds upon these areas, incorporating them into our framework and using them as baselines for complex problem-solving.
Tool Use and Tool Creation.As an emerging field within NLP, the active interaction of LLMs with environments is facilitated through tools that serve as the medium (Li et al., 2023).Recent studies address constraints of LLMs, such as the limited real-time responsiveness and inaccurate calculations, by incorporating external tools (Trivedi et al., 2022;Komeili et al., 2022;Patel et al., 2021;Lu et al., 2022b).These studies augment LLMs with tools like scratch pads, search engines, QA systems, and calculators (Nye et al., 2021;Shuster et al., 2022;Schick et al., 2023) to improve task performance.More recent efforts integrate LLMs' tool-using abilities into a pipeline for task planning, tool calling, and result synthesis (Wu et al., 2023;Shen et al., 2023;Liang et al., 2023).In contrast, our work goes further by enabling LLMs to create tools instead of relying solely on existing tools.As our concurrent works, tool creation ability is also investigated under LATM framework (Cai et al., 2023) and LLM customization (Yuan et al., 2023).
Reasoning and Execution with Program.Reasoning with programs is an emerging field in NLP, whose goal is to leverage codes to do complicated computational reasoning instead of using natural language thoughts.Chen et al. (2022) show that code generation improves performance on math datasets, while Gao et al. (2022); Wang et al. (2022a) further demonstrate the potential of program reasoning on symbolic and algorithmic benchmarks.These efforts present a code-based chain-ofthought with linear logic but produce no enhanced tools capable of being reused or tested.As the concept of tool-using emerges, recent studies begin to incorporate code interpreters as external tools (Lu et al., 2023;Mialon et al., 2023;Wang et al., 2023).However, in CREATOR, we use code as the medium for tool creation rather than an external tool.Our framework also excels over PoT as we devise the tool creation stage, code rectification stage, and disentangle the logic in complex reasonings.ation, decision, execution, and rectification, as illustrated in Figure 2. The utilization of tool creation for problem-solving is inherently straightforward and aligns with LLMs' innate ability, as illustrated later in Section 5.2.In CREATOR, the main objective of design is to instinctively better inspire their creativity, and facilitate more effective use of it.Previous CoT and PoT methods mainly apply linear reasoning to solve target problems, and their task-solving process lacks reusability.However, the tools created in CREATOR can be transferred to solve other queries, and the rectification stage incorporated makes the reasoning process non-linear.We present a comprehensive comparison between CREATOR and other methods in Table 1.To get these demonstrations, we make a fixed set of demonstrations in advance and use them subsequently for each task.In specific, we randomly select a subset from the training set and prompt the LLM with the text instruction for tool creation for each query.We then correct the errors in these generations (if any) and remove verbose explanations before using them.Although the demonstrations are from the same task as the test queries, they are not required to be semantically similar to test queries, as the main purpose is only to inspire the LLM's creativity and regulate its output format.

Design of CREATOR
Ability of Abstract Reasoning.The core importance of the tool creation stage is to trigger LLM's ability to employ abstract thinking to alleviate the burden of reasoning during later stages.When LLMs create tools, they effectively use abstraction to address a particular problem type, necessitating a focus on the inherent characteristics of the problem rather than the specific numerical details.For example, in Figure 2, the LLM concentrates solely on recognizing the intrinsic nature of the problem and creates a tool for solving a three-variable equation system, disregarding all the numerical details and the specific expression being queried.

Decision Stage
Implementation Details.Similar to the creation stage, we instruct LLMs with demonstrations to decide how to use tools with the same prompt text form.Each demonstration " , where [SOL] represents the LLM's decision tool calls in code format.We also derive a fixed demonstration set the same way as in the creation stage, only that the LLM is now prompted to call the given tools instead of creating them, and to print out the final answer with any important information through "print(...)" in codes.This [INSTRUCTION] applies both to get demonstrations and to conduct test-time inference, which ensures that the LLM's answer can be easily extracted from printed outputs in subsequent stages.A detailed prompt text example is shown in Figure 15.
Ability of Concrete Reasoning.The decision stage necessitates the LLM's meticulous attention to rules and details for problem-solving, which we refer to as concrete reasoning.In Figure 2, the solution obtained from the tool needs to be summed for the final answer.This requires the LLM to understand the tool's outputs and relate them to the specific query to make an informed decision and derive the correct answer finally.By separating creation from the decision, CREATOR disentangles two phases of the LLM's abilities, which facilitates a smoother elicitation of different aspects of knowledge and improves task performance.

Execution Stage
The execution stage takes the information from previous stages to execute the tool leveraging the code interpreter.We do not apply the LLM in this stage, and the created tools and the LLM's decision are concatenated into a cohesive code block for execution.The tool is encapsulated within a function in the code block, and the LLM's decision calls it for problem-solving.During execution, we capture any outputs printed (as we have instructed the LLM in the decision stage) or errors encountered (by intercepting error messages in a sub-process).These information serve as inputs for subsequent stages to determine whether an answer can be obtained or rectifications are needed.

Rectification Stage
Implementation Details.During the rectification stage, CREATOR has two different options based on the information passed into it.If an error occurs, then the LLM is prompted with demonstrations to rectify the error.Applying a similar prompt format as before, the format of demonstrations where we provide the original tool implementation and calling decision in [ORI], offer the error tracebacks [ERR], and concatenate natural language reasoning on the error with the rectified code in [REC].
A detailed illustration of the prompt text is shown in Figure 16.
If the execution is successful, then the answer will be extracted from the captured model's output and compared to the standard answer to measure accuracy.
Significance.During the rectification process, we provide the LLM with error tracebacks, which offer crucial information for it to identify the error's location and causes.Armed with this guidance, the LLM can recover from previous mistakes, adjust its reasoning process, and attempt to solve the problem once again.Subsequent experiments will demonstrate how the inclusion of rectification significantly improves the performance of CRE-ATOR.The success of the rectification stage also showcases the LLM's ability to recognize misconceptions and self-correct.

Experiments
To evaluate the effectiveness of CREATOR, we conduct experiments on two established benchmarks: MATH (Hendrycks et al.) and TabMWP (Lu et al., 2022a).Additionally, we perform experiments on a newly introduced dataset, Creation Challenge, comprising 2K diverse questions that are inadequate to solve using existing tools or code packages.This enables us to further demonstrate the necessity and advantages of the LLM's tool creation ability.

Experimental Setup
Settings.We select ChatGPT as the base model for all methods due to its exceptional capabilities in code generation, decision-making, and logical reasoning.Refer to Appendices A.1 for more details.We evaluate CREATOR on two existing datasets: TabMWP, which includes diverse tablerelated problems, and MATH, consisting of challenging math competition problems.We apply them as they are representative in terms of diversity in data format and difficulty.We also assess the performance of our framework on Creation Challenge, comprising 2K data points, to explore the impact of tool creation hints on the LLM's performance.Refer to Appendices A.2 for more details.
Baselines.We compare CREATOR against four types of baselines to demonstrate its effectiveness: • Vanilla LLM w/ and w/o CoT: The Vanilla LLM with CoT employs linear reasoning to solve problems, while Vanilla LLM without CoT directly generates the answer.
• PoT: The LLM utilizes a program to reason through the problem step by step.Besides, we also incorporate rectification into PoT as a stronger baseline for a fair comparison.
• Tool Use: The LLM utilizes the WolframAlpha API as a general-purpose tool specialized in calculations.It's a fair external tool as all queries require numerical reasoning to some extent.
• CREATOR-Entangled: The LLM combines the creation and the decision stage in CREATOR instead of disentangling them, which serves as a special baseline for ablation study.

Creation Challenge
Existing benchmarks are not originally designed to evaluate tool creation, thus unable to fully showcase the necessity and advantages brought by the LLM's tool creation ability.creation ability with varying levels of hint utilization.We encourage future research to explore the dataset's potential through more flexible usage.

Experimental Results
We present the results on MATH, TabMWP, and Creation Challenge respectively in Tables 2 to 4. CREATOR achieves an accuracy of 59.7%, 94.7%, and 75.5% respectively on three tasks, surpassing all the best performance in baselines by large margins.To illustrate CREATOR's advantage, we present a case study showing how it's better than Tool Use in Figure 3A.For all tasks, disentangling the creation and decision stages generally results in better performance, compared to CRE-ATOR-Entangled.For Creation Challenge, we also observe that hints of tool creation can raise the performance up to 18.7%.We will further analyze the reasons for this improvement in Section 4.4.

Results Analysis
CoT Incompatible with Codes. Figure 3B.we show the adoption of brute-force algorithms and straightforward calculations when CoT is not applied yields higher accuracy.
In contrast, TabMWP involves simpler calculations and more straightforward reasoning paths, promoting consistency between natural language and programming reasoning.Therefore, the application of CoT enhances performance in these cases.We present more case studies to illustrate it in Appendices C.1.
CREATOR is Robust to Challenges. Figure 4 illustrates the performance of the LLM in relation to difficulty.CREATOR outperforms all the baselines for both tasks and achieves higher accuracy, particularly for difficult problems.This provides compelling evidence that CREATOR exhibits greater resilience to challenges.Rectification Raises Performance.with more flexibility and less reasoning burden.

Facilitation of Knowledge Transfer
One of the main purposes of tool creation lies in its reusability.The content of tools represents the abstraction of knowledge concepts, so the creation of one tool may help solve problems of various scenarios that share the same core concept.For instance, a keyword-extraction tool created for sentiment analysis can be applied to other tasks like document categorization and topic modeling, as they all require the identification and extraction of relevant keywords for problem-solving.By utilizing the knowledge and logic embedded in the tool, the LLM can transfer its understanding to solve similar problems efficiently with higher performance.
Settings.To validate our hypothesis, we construct a small set of questions with 300 data points, detailed in Appendices B.2.We divide data points into 100 sets, where all three queries in one set share the same core knowledge concept (key methodology that is universally applicable) but differ in scenario (problem background and specific details inquired).
Similar to previous experiments, we use Chat-GPT as the base LLM with unchanged detailed settings.We first test all the problems under the normal CREATOR framework respectively.Then, we test if the correct tool created under one scenario could be applied to the other two, and again test the LLM's performance.
Results Analysis.The statistics are presented in Table 5.Through the application of transferred tools, the LLM's accuracy can be raised by 15.3%.Further analysis shows that 39 sets of queries are positively influenced by this transfer, which highlights the tool creation ability of the LLM can facilitate knowledge transfer, leading to better performance on clusters of problems that share similar core concepts.

Different Levels of LLM's Tool Creation
We discover in experiments that LLM can create tools in different levels without special guidance, which affirms creativity is LLM's intrinsic emerging ability.By inspecting the created tools, we find that they can be categorized into three levels, which provides guidance and reference for future development.
1. Enhancement of Existing Tool.First, LLMs demonstrate the capability to enhance existing tools by encapsulating an existing tool or API and repurposing it to serve different needs.The first case of Figure 9 shows how LLM wraps an existing weather query API into a new tool that calculates the average temperature.
2. Concatenation of Multiple Tools.Second, the LLM can create new tools by organizing multiple APIs into a pipeline, enabling it to fulfill specific purposes.The second case in Figure 9 shows how the LLM calls two existing APIs three times in the new tool for problem-solving.
3. Hierarchical Tool.Third, the LLM can create tools with a clear hierarchy, which establishes clear caller-callee relationships among tools and reduces the burden of repetitive reasoning.The third case in Figure 9 illustrates a hierarchical structure where the first tool serves as the callee, while the second tool primarily solves the problem.

Conclusion
We propose the concept of automatic tool creation through LLMs and empirically devise CREATOR that harnesses the capabilities of LLMs as tool creators.By disentangling LLM's abstract and concrete reasoning, CREATOR enables clearer logic and enhances overall performance.Through comprehensive evaluations on established benchmarks and Creation Challenge, we demonstrate the superiority and indispensability of CREATOR compared to existing CoT, PoT, and tool-using approaches.We anticipate our study will inspire the development of more sophisticated AI systems leveraging LLM's tool creation potential.

Limitations
Our experiment is limited to two established benchmarks, MATH and TabMWP, along with our newly introduced dataset, Creation Challenge.However, it is crucial for future research to expand the application of our framework to encompass a broader array of tasks.This will enable a comprehensive assessment of the generalizability of our results, going beyond the scope of our current investigation.
Furthermore, our demonstration of the LLM's potential in tool creation is limited in scope.For instance, the current LLM is also capable of creating tools even to build a full project pipeline, but the execution ability and correctness of its creation still lack proper evaluations and remain questionable.It is incumbent upon future research to delve deeper into the boundaries of LLM's capabilities and establish clear limits regarding its tool creation potential.

Ethics Statement
We consider the following research issues in this paper: • Privacy involves safeguarding sensitive information and preventing its unauthorized disclosure.With respect to our framework, privacy becomes a concern when certain stages require demonstration examples and clear instructions, which may inadvertently contain sensitive information, or intentionally designed to prompt the LLM to leak privacy.Thus, it is crucial to ensure that personal or sensitive information is not disclosed to the closed-source LLM, and the private information or knowledge about tool creation in the closed-source LLM should be well-protected.• Fairness in AI aims to ensure the outputs and decisions made by AI systems do not perpetuate existing biases or discriminations.When creating tools, care must be taken to mitigate biases in the demonstrations and instructions, monitor the tool's performance across stages, and address any disparities that may arise in the whole generation or rectification process.• Transparency involves making AI systems and their processes understandable and interpretable.
When the language model creates tools under our framework, it's essential to have transparency regarding how those tools are developed.Developers should document any biases or limitations associated with the tools created, understand the strengths and weaknesses of the tools and how the decision is reached, and make informed decisions about their application.

Appendices A Details about Experiment
A.1 Model Details.
We employ GPT-turbo-3.5 as the base model for all our experiments.The maximum generation length for all experiments is set to 512, and a temperature of 0.3 is chosen to encourage deterministic generations while maintaining a certain degree of diversity, particularly during the creation of tools.
For both the MATH and TabMWP datasets, we evaluate questions that have numerical value answers (e.g.integers or decimals).This is due to the complicated format matching problems (e.g.matching of matrices as answers) that may cause bias.The tested questions cover approximately 80% of all and maintain high diversity, making our results still representative.We are planning to update our results on all MATH questions applying post-processing soon 2 , but some matching problems are still hard to solve.
The MATH dataset consists of seven math competition problem domains, namely algebra, counting and probability, geometry, intermediate algebra, number theory, pre-algebra, and pre-calculus.Each domain is evaluated separately, and the final metric is computed as the weighted average score.The TabMWP dataset includes a wide range of table information and problems of different difficulty levels, spanning from grade one to grade eight.

B Details about New Datasets B.1 Creation Challenge Details
We begin by constructing a seed dataset that involves novel settings and unconventional reasoning processes.Subsequently, we utilize the Text-Davinci-003 model to expand the dataset in an iterative manner.By random sampling from the seed data, we encourage the variety and novelty in the problems and their reasonings.
Figure 6 illustrates a sample query and its corresponding solution.Each data entry comprises the problem statement, a standard created tool that can be utilized (including utility, input, output, and realization), a tool-calling decision, and a final answer.
2 https://github.com/openai/prm800k/blob/main/prm800k/grading/grader.py for i in range( 6)] predict = sum(terms) return predict x = np.array([1,2,3,4,5,6,7,8,9,10]) y = np.array([5,12,23,42,75,122,187,272,379,510]) # Fit the data and do prediction predict = polynomial5(x, y, 11) Problem Suppose you are a scientist studying the spread of a virus.You have collected data on the number of new cases reported each day over a 10-day period.You suspect that the number of cases might be modeled by a polynomial function of degree 5.You want to fit a polynomial function to the data using the method of least squares and predict the number of new cases on the 11th day.

B.2 Tool Transfer Dataset Details
We create 300 data points in total and divide them into sets of three.Each set of three queries contains three corresponding answers, one standard tool that could be applied in all three scenarios to solve the problem, and three decisions about how to use the tool respectively.Similar to the construction of Creation Challenge, we manually write the seed data, which includes five sets of queries used as examples to show the format of each data point, sample demonstration examples from these seeds, and leverage the Text-Davinci-003 to create more data iteratively.
We present a sample set from the tool transfer dataset we curate in Figure 7.In the set, three different scenarios are provided, with each one consisting of a query, a sample decision, and an answer (not listed).Though the scenarios seem unrelated, they share the same core knowledge which can be transferred.In this case, the core knowledge is the calculation of profit.We also provide an example tool that can be applied to all these three scenarios with a corresponding introduction.Note that each set we define actually represents three data points.

C.1 CoT Incompatible with Code
In this section, we will provide more cases to further illustrate our arguments made in Section 4.4 about the conflicts between natural language thoughts and program thoughts.We con- ----------------------------Scenario 1: Pricing Strategy -Query A company produces a product that has a fixed cost of $20,000, a variable cost of $10 per unit, and a demand of 10,000 units.The company wants to maximize profit and is considering two pricing strategies.The first strategy is to sell the product at $30 per unit, and the second strategy is to sell the product at $35 per unit.What is the optimal pricing strategy for the company?-Sample Decision ...

Sample Tool (Common For 3 Scenarios)
The tool is used to calculate the optimal number of units to produce to maximize profit for a manufacturing company.It takes into account the fixed costs, variable costs, selling price, and demand for the product.The function uses the formula Profit = (Selling Price * Quantity) -(Variable Cost * Quantity) -Fixed Cost to calculate the profit and returns the optimal quantity to produce.
The company can sell up to 5,000 units of the product at this price.What is the optimal number of units to produce to maximize profit?-Sample Decision … ----------------------------Scenario 3: Capacity Planning -Query A company produces a product that has a fixed cost of $50,000, a variable cost of $15 per unit, and a selling price of $25 per unit.The company has a production capacity of 10,000 units.What is the optimal number of units to produce to maximize profit?-Sample Decision … We provide three scenarios sharing the core knowledge and a sample tool that all three scenarios can utilize.
trast two additional cases sourced respectively from MATH and TabMWP in Figure 8.
In the case of MATH, the ambiguity of "string manipulation" mentioned in natural language thoughts leads the model to create the tool that finds the hundredth digit in a hard-coding manner, while pure code generation in creating tools can avoid this problem.
Conversely, for TabMWP, CoT helps tool creation by avoiding unnecessary complexities in sim-ple problem-solving.In the second case, the natural language thoughts indicate clearly that only simple multiplication should be done, while pure code generation is trapped in a complex and chaotic logic that is prone to error.
These two cases further validate the conflicts between natural language thoughts and program thoughts, especially for challenging problems which may possess multiple reasoning paths that differ in suitability for code and natural language.

C.2 Different Levels of Tool Creation
We present in this section more details about the different levels of tool creation mentioned in Section 5.2.We present three cases in Figure 9.
The enhancement of existing tools in tool creation is presented in the first case.After the query, the LLM is given an existing API that could be called for a fixed purpose.This mimics the scenario in the real world where an API document is given to let one fulfill a particular purpose.At this level, the LLM learns how to create tools first by comprehensively understanding the existing tool and then transferring this knowledge to a new problem scenario.In this case, the LLM learns how temperature should be averaged across several days, and subsequently creates a tool that solves the problem.
The concatenation of multiple tools in tool creation is presented in the second case.In this case, the LLM is given several tools to solve the problem, but the usage of each tool is rather simple to follow.This level of tool creation requires the LLM to plan the use of tools in a logical way and organize them with clear logic.Instead of how to call tools to serve a different purpose, this level also illustrates the LLM's excellent ability in implementing a pipeline to solve specific queries through tool creation.
The hierarchy of tool creation is presented in the third case.This not only is the most common phenomenon that we observe in the experiment but also represents the most advanced aspect of the LLM's reasoning potential.By creating tools with a clear hierarchy, the LLM is successful in offloading more reasoning burdens and thus solving the problem with higher accuracy.In this case, is_prime represents only a "sub-tool", while the main tool solves the problem with more ease by calling it to help count the valid numbers.
Overall, the presented case studies provide valuable insights into the tool creation abilities of

D Prompting Details
All the methods we present in our main experiments need prompting to formalize the LLM's response and better inspire its ability.Specifically, We separate Tool Use into two parts, the first one aiming to inspire the LLM's ability to call WolframAlpha properly, and the second one aiming to prompt the LLM to retrieve the final answer.For CREATOR setting, the prompts are separated according to different stages.Note that the execution stage does not need prompting.

### Instruction
You are given a math question.You could call WolframAlpha API to help you solve the question.After seeing a question, you should first generate thoughts and think about how to call the API.Generate "WOLFRAM:" in the last line of your response with appropriate inputs you'd like to inquiry.

### Question
Point $P$ lies on the line $x= -3$ and is 10 units from the point $(5,2)$.Find the product of all possible $y$-coordinates that satisfy the given conditions.

### Instruction
You are given a math question.You have just called WalframAlpha API to help you solve the question.Please continue to generate your final numerical answer with the return from WalframAlpha API as reference.If There is an error return from the API, you could continue you thought step by step and give your final answer.Generate "Final Answer:" in the last line of with your final numerical answer.

### Question
Point $P$ lies on the line $x= -3$ and is 10 units from the point $(5,2)$.Find the product of all possible $y$-coordinates that satisfy the given conditions.

Figure 1 :
Figure 1: The difference between CREATOR and a general tool-using framework.

3. 1
Creation Stage Implementation Details.In the creation stage of CREATOR, we explicitly instruct LLMs with demonstrative examples to create tools and documentation to solve the problem.The general prompt text form is "###Instruction [INSTRUCTION]\n [EXAMPLE 1]\n [EXAMPLE 2] ...".Here the instruction text "[INSTRUCTION]" describes the goal and format of the output.Each demonstration "[EXAMPLE x]" follows format "### Question [QST]\n ### Tool [TOOL]".Each [TOOL] contains documentation text as code comments.A detailed example of prompt text is shown in Figure 14.

Figure 2 :
Figure 2: Overview of our CREATOR framework with four stages: Creation, Decision, Execution, and Rectification.With an LLM like ChatGPT, we successfully leverage its tool creation ability with code as the medium.In each stage we apply instructions and demonstrations in prompts, shown in Figures 14 to 16 in Appendices.

Problem:Figure 3 :
Figure 3: In subfigure A, we show an example in which Tool Use reasoning (left) fails, while CREATOR (right) solves successfully as it derives a new tool for the novel question.In subfigure B, we present a case comparing the answer given by CREATOR with and without CoT.Challenging problems in MATH cause conflicts between language and program reasoning.

Figure 4 :Figure 5 :
Figure 4: Comparison of the accuracy of baselines and CREATOR w.r.t.problem difficulty.

Figure 6 :
Figure 6: An example query and its solution provided in the Creation Challenge dataset.

Figure 7 :
Figure 7: An example data point in tool transfer dataset.We provide three scenarios sharing the core knowledge and a sample tool that all three scenarios can utilize.
Figures 14 to 16 the general prompting format and the formats of demonstrative examples, as detailed in details in 3.For the creation stage, decision stage, and rectification stage, we apply demonstration examples to enhance the LLM's abstract and concrete reasoning ability, while the execution stage intrinsically is unrelated to the LLM.We present one demonstrative example about a query in MATH, but other tasks including TabMWP and Creation Challenge also follow this prompting format.Prompting of Baselines.Besides CREATOR, we also apply demonstrative examples in prompting the ChatGPT's CoT, PoT, Tool Use abilities respectively, presented in Figures 10 to 13.Similar to the prompting of CREATOR, these prompt formats apply to all tasks in the main experiments, including evaluation on MATH, TabMWP, and Creation Challenge.

Final
Final Answer" in the last line)

Figure 10 :
Figure 10: The instruction and one of the demonstration examples we use when prompting ChatGPT in the CoT setting.
Figure 12: The instruction and one of the demonstration examples we use when prompting ChatGPT in the Tool Use setting.This figure shows the first part about Wol-framAlpha inquiry.

Final
Final Answer" in the last line)

Figure 13 :
Figure 13: The instruction and one of the demonstration examples we use when prompting ChatGPT in the Tool Use setting.This figure shows the second part about answer retrieving.

Table 2 :
The accuracy (%) on the test set of MATH dataset leveraging ChatGPT.Rec.represents Rectification.
Evaluation The components of the standard created tool in each data point of Creation Challenge can serve as valuable hints for the LLM's tool creation.Therefore, we extend our experiments on Creation Challenge to assess the LLM's tool

Table 3 :
The accuracy (%) on the test set of TabMWP dataset leveraging ChatGPT.Successful Execution indicates whether the LLM provides a valid final answer through words or codes within the rectification limit.

Table 4 :
The accuracy (%) on the Creation Challenge test set leveraging ChatGPT.No hint represents normal CREATOR framework.Utility hint provides hints about the utility of the tool, while all hint offers additional hints about the possible inputs and outputs of the tool.

Table 5 :
Results of tool transfer experiment.Tool transfer improves accuracy by up to 15.3%.