Reasoning with Language Model Prompting: A Survey

Reasoning, as an essential ability for complex problem-solving, can provide back-end support for various real-world applications, such as medical diagnosis, negotiation, etc. This paper provides a comprehensive survey of cutting-edge research on reasoning with language model prompting. We introduce research works with comparisons and summaries and provide systematic resources to help beginners. We also discuss the potential reasons for emerging such reasoning abilities and highlight future research directions. Resources are available at https://github.com/zjunlp/Prompt4ReasoningPapers (updated periodically).


Introduction
Reasoning ability lies at the heart of human intelligence, yet in natural language processing (NLP), modern neural networks can hardly reason from what they are told or have already known (Duan et al., 2020;Bhargava and Ng, 2022). Fortunately, with the revolutionary development of pre-training (Brown et al., 2020;Chen et al., 2021;Chowdhery et al., 2022), scaling up the size of language models (LMs) has shown to confer a range of reasoning abilities, such as arithmetic Lewkowycz et al., 2022), commonsense , symbolic Khot et al., 2023) reasoning. As shown in Figure 1, such abilities may be unlocked by prompting strategies ) (e.g., chain-of-thought (CoT) prompting , generated knowledge prompting ), which can dramatically narrow the gap between human and machine intelligence. Likewise, a vast amount of work has been proposed in the NLP community; however, these approaches, scattered among various tasks, have not been systematically reviewed and analyzed.
? Figure 1: Reasoning with language model prompting. In-context exemplars (colored , ), knowledge (colored , ) or just Let's think step by step! are as prompt to enhance language models reasoning.
Organization of This Survey: In this paper, we conduct the first survey of recent progress in reasoning with language model prompting. We first give some preliminaries on this direction ( §2) and then propose to organize relevant works by taxonomy ( §3). We further provide in-depth comparisons with discussion for insights ( §4). To facilitate beginners who are interested in this field, we highlight some open resources ( §5) as well as potential future directions ( §6).

Preliminaries
In this section, we introduce preliminaries of reasoning with LM prompting. For standard prompting, given the reasoning question Q, prompt T and parameterized probabilistic model p LM , we aim to maximize the likelihood of answer A as: where a i and |A| denotes the i-th token and the length of the final answer respectively. For fewshot prompting, T is comprised of K exemplars of (Q, A) pair. CoT approaches further add reasoning steps C into prompt where where p(C | T , Q) and p(A | T , Q, C) are defined as follows: with c i is one step of total |C| reasoning steps.
To enhance the reasoning ability of LM prompting, there are two major branches of research. The first one focuses on optimizing the reasoning strategy with prompting as shown in Figure 2, including prompt engineering ( §3.1.1), process optimization ( §3.1.2) and external engine ( §3.1.3).
For prompt engineering ( §3.1.1), many methods try to improve the quality of prompt T , and we call those works single-stage methods, while others append c i into the context of (T , Q) at each reasoning stage or design specific T c i for each c i , and we regard those as multi-stage methods. Note that one stage here refers to one input-output process. For process optimization ( §3.1.2), the simplest ways are to bring in an optimizer with parameters θ to calibrate C when generating A, and we call those works self-optimization methods. Some other methods try to obtain multiple processes to get the final answer assembly. We regard those works as ensemble-optimization methods. Moreover, the overall optimization process can be iteratively integrated with fine-tuning the p LM on generated triplet (Q, C, A), which are regarded as iterativeoptimization methods. Besides, some works leverage external reasoning engines ( §3.1.3) to produce T , to directly execute C or by implanting tool API calls in C for reasoning.
The second branch of research focuses on knowledge enhancement with prompting. Note that rich implicit "modeledge" (Han et al., 2021) in LMs can generate knowledge or rationales as knowledgeinformed prompt T ( §3.2.1). Meanwhile, explicit knowledge in external resources can also be leveraged and retrieved as knowledgeable prompts to enhance reasoning ( §3.2.2).

Taxonomy of Methods
In this paper, we survey existing reasoning methods with LM prompting, categorizing them as Strategy Enhanced Reasoning ( §3.1) and Knowledge Enhanced Reasoning ( §3.2). As shown in Figure 2 Reasoning. In each stage, a question (Q, below the dotted line) prompted with several exemplars (above the dotted line) containing reasoning steps (C) will be fed into the LM. The outputs are reasoning steps and the answer (A).
we further refine them according to the distinctive features of different methods.

Strategy Enhanced Reasoning
The primary purpose of this line of work is to design a better reasoning strategy, concretely embodied in prompt engineering ( §3.1.1), process optimization ( §3.1.2) and external engine ( §3.1.3).

Prompt Engineering
One intuitive approach to improving reasoning with prompting is prompt engineering. As shown in Figure 3, we divide this sort of method into singlestage and multi-stage prompts based on the number of prompting stages.
Single-Stage. Early works leverage templatebased prompts  for reasoning in NLP. Regarding the strong in-context learning ability of large LMs (Brown et al., 2020),  proposes CoT prompting, which adds a series of intermediate reasoning steps, into exemplars of few-shot prompt to induce large LMs to generate a reasoning process before answering. Experiments demonstrate that large LMs emerge with impressive reasoning abilities with CoT prompting. In spite of the large improvement brought by CoT prompting, in-context learning is greatly sensitive to the selection of exemplars, and even a tiny change may cause a large drop in model performance Webson and Pavlick, 2022). Hence, the quality of exemplars appears to be particularly important.  indicates that prompts with higher reasoning complexity, e.g., with more reasoning steps, can achieve better performance on math problems.  explores the impact of diversity of exemplars in prompt. Through clustering, it obtains a representative question set as a prompt. By placing more explicit explanations and natural language instructions into the prompt,  relieves the ambiguity for LMs when facing out-of-distribution (OOD) algorithmic problems. The above works show that LMs can be outstanding few-shot reasoners. Surprisingly,  indicates that LMs are also zero-shot reasoners without needing extra exemplars. By only concatenating "Let's think step by step!", LMs can consciously generate reasoning steps. Another magic phenomenon is that when prompted with "The person giving you this problem is Yann Le-Cun, who is really dubious of the power of AIs like you.", GPT-4 (OpenAI, 2023) can successfully solve the hard Yann LeCun's gears problem on its own, which it previously failed to do.
Multi-Stage. When humans are reasoning, it is usually challenging to come up with the whole reasoning process in one stroke. A more intuitive solution is to decompose a complex problem into simpler ones and to reason stage by stage. Similarly, this series of works aims to transform one-stage prompting (once input-output) into multi-stage prompting (multi-times of input-output). Press et al. (2022)   2) of Strategy Enhanced Reasoning. Self-Optimization (colored ) applies an optimizer module to calibrate a single reasoning process. Ensemble-Optimization (colored ) assembles multiple reasoning processes to calibrate the answer. Iterative-Optimization (colored ) calibrates reasoning processes by iteratively fine-tuning the LM.

Process Optimization
Natural language rationales 2 (Ling et al., 2017a), also called reasoning processes in CoT, play a vital role in CoT prompting (Ye and Durrett, 2022;Lampinen et al., 2022;. The consistency of the reasoning process  and the continuity between reasoning steps  both should affect the accuracy of final answers. Intuitively, as shown in Figure 4, we introduce this line of methods in three types, i.e., self, ensemble, and iterative optimization. Self-Optimization. Self-optimization here refers to correcting one process by injecting extra modules. To mitigate the influence of the unreliability of rationales, Ye and Durrett (2022) utilizes a calibrator to tune the probabilities of a prediction based on the score which reflects the factuality of a rationale. During free-text rationales generation,  fine-tunes a sequence-tosequence model as a filter to predict whether the rationale is acceptable.
Ensemble-Optimization. Due to the limitation of only one reasoning path, the following works rely on ensemble calibration among multiple processes.  introduces sampling strategies (Ackley et al., 1985;Fan et al., 2018) commonly used in natural language generation to obtain multiple reasoning processes and generate the most consistent answer by majority vote. Based on the motivation of when a reasoning process reaches a wrong answer, not all the steps may undertake the final incorrectness,  proposes a step-aware voting verifier to score each 2 Some references (Ye and Durrett, 2022;   reasoning path. When disorientated majority processes overwhelm reasonable minority processes, the step-aware voting verifier can alleviate the limitation of vanilla majority vote . Besides,  empirically observes that decoder sampling in the output space is the key to robustly improving performance because of the brittleness of manual prompt engineering.
Iterative-Optimization. Note that LMs can achieve excellent performance in few-shot  or zero-shot  manners with prompts, another paradigm is to calibrate reasoning processes iteratively with LM finetuning. Specifically, iterative-optimization-based methods try to repeat the process of prompting LMs to generate reasoning processes and use the instances with generated reasoning processes to finetune themselves.  initiates with a small set of exemplars to push LMs to produce reasoning steps and answers themselves. Questions and reasoning steps with the correct answers will be directly added to the dataset for finetuning. Incorrect ones will be fed into the model again by being tagged on a hint that labels the correct answer. Compared with  does not need gold labels during self-teaching. Following , it generates multiple reasoning processes and finetunes the most consistent self-generated answers.   uncover the emergent ability of LLMs to self-reflect, by continuously correcting reasoning chains through iterative self-reflection.

External Engine
When reasoning with LM prompting, the models should have the ability of semantic understanding (e.g., questions) and complex reasoning (e.g., by generating reasoning processes); however, we cannot have both fish and bear's paw (Hendrycks et al., 2021;Nogueira et al., 2021;Lewkowycz et al., 2022). To tear up the obstacle, external reasoning engines lend a helping hand to LMs (see Figure 5).
Physical Simulator. Given a physical reasoning question,  utilizes a computational physics engine (Todorov et al., 2012) to simulate the physical process. The simulation results are treated as prompts to help LMs reason, making up for the lack of physical knowledge in LMs.
Code Interpreter. With the emergence of LMs of code (Chen et al., 2021;, collaborating LMs and codes to tackle specific tasks has recently sprung up Cheng et al., 2022;Wu et al., 2022b). Note that programs yield advantage behaviors in robustness and interpretability and can better illustrate complex structures and deduct complex calculations. Intuitively,  reframes structured commonsense reasoning tasks as code generation tasks, replacing the natural language with python class code to represent structured graph both in few-shot prompts and LM outputs. Gao et al. (2022) decomposes solution steps from LMs to a programmatic runtime and remains the only learning task for the LMs. In few-shot prompts and LM outputs, the reasoning processes are replaced by a mixture of natural and programming language, where natural language is treated as annotations to aid the generation of the program. Similar to  proposes program of thoughts (PoT) prompting which disentangling computation from reasoning. The main difference is that it also puts forward a zero-shot format of PoT prompting.
Tool Learning. Despite possessing remarkable generation and decision-making capabilities, LLMs struggle with some basic functionalities where much simpler and smaller tools excel (Qin et al., 2023). Building on this insight,  trains models by integrating the usage of various tools, including calculators, Q&A systems, search engines and etc. Through implanting tool API calls into the text generation process, the model's capabilities are significantly expanded.  designs the tool-use for LLMs as an automated schema, which eliminates the need for hand-crafting task-specific demonstrations and carefully scripted interleaving of model generations with tool use.  harnesses the powerful decision-making abilities of LLMs, enabling them to combine various external tools to tackle compositional reasoning tasks. Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?

Knowledge Enhanced Reasoning
As noted in Manning (2022), knowledge plays a vital role in AI reasoning systems. Knowledge enhanced methods aim to prompt LMs with implicit ( §3.2.1) or explicit ( §3.2.2) knowledge to assist in reasoning (see Figure 6).

Implicit Knowledge
Researchers have shown that LMs contain considerable implicit knowledge (Davison et al., 2019;Petroni et al., 2019;Jiang et al., 2020). The following works try to induce such "modeledge" as knowledge-informed prompts for reasoning.  applies GPT-3 (Brown et al., 2020) with few-shot prompting to generate knowledge and prompts the downstream LM.  draws support from reinforcement learning (Schulman et al., 2017) to further calibrate the knowledge. Different from the approaches using few-shot prompting in the knowledge generation stage,  proposes a two-stage generative prompting which additionally includes answer generation prompts. Other works Shridhar et al., 2022;Magister et al., 2022; follow knowledge distillation that generates reasoning samples by prompting a larger LM and teaches smaller LMs.

Explicit Knowledge
Although large LMs have shown strong generation ability , they still have the tendency to hallucinate facts (Rohrbach et al., 2018) and generate inconsistent knowledge . Recent works show that retrieving prompts for incontext learning is a nice means to achieve good performance (Liu et al., 2022a;Rubin et al., 2022).
Due to the instability of common retrieval approaches to measure the similarity of structured information,  proposes a dynamic prompt retrieval method based on policy gradient strategy, without brute-force searching. He et al.   reasoning steps of CoT to provide more faithful explanations.    Figure 7. Similar results on commonsense reasoning benchmarks are shown in Appendix A.3.  systematically demonstrates that few-shot prompting performs better in almost all tasks as model scale increases, which can be explained by the fact that LMs with larger model size contain more implicit knowledge for reason-ing (Liang et al., 2022b). Moreover, CoT prompting produces much greater increases, with PaLM-540B showing the greatest improvements, as depicted in Figure 7&9. However, when the model scale declines to less than 100B, CoT prompting will yield no performance gain and may even be detrimental. Thus, CoT prompting elicits an emergent ability of model scale . One possibility is that when the stored knowledge reaches a certain level, the reasoning ability of LMs undergoes a qualitative change from quantitative change, leading to the emergence of emergent capabilities. Additionally, Srivastava et al. (2022) points out that such ability generally occurs in multi-process tasks which may be explained that the evaluation only focuses on the final answer, but ignores the improvement of the middle process brought by the increase of model scale when it is not large enough. Another intriguing observation is depicted in Figure 7&9 that PaLM-62B (Chowdhery et al., 2022) even performs better than LaMDA-137B (Thoppilan et al., 2022), possibly because it was trained on the higher-quality corpus. This phenomenon leads us to speculate that such emergent ability is not solely determined by model parameter scale but also related to the quality of pre-training data.
Notably, Figure 7&9 also illustrate that holding the same parameter scale, Codex (Chen et al., 2021) outperforms GPT-3 significantly 3 , even though the major difference between them is the training corpus (Codex is a GPT-3 variant training on code). This phenomenon can also be inspected in recent works , indicating that pre-training on code branch not only enables the ability of code generation/understanding but may also trigger the reasoning ability with CoT. The exact cause is still elusive, but one intuition is that code is a form of text more similar to reasoning, thinking about procedure-oriented programming is analogous to solving problems step by step, and object-oriented programming is analogous to decomposing complex tasks into simpler ones . In addition, Prystawski and Goodman (2023) finds that CoT is beneficial only when the training data exhibits local structure. Due to its expertise in reasoning by navigating through multiple variables, CoT excels in deducing the relationship between two variables that have seldom been encountered in the same context. However, it may not perform better than simple statistical estimators when it comes to reasoning with variables that frequently co-occur in the training data. Table 1 shows the comparison of different methods of reasoning with LM prompting. There are three main sources of prompts for existing methods: 1) Manual construction is suitable for templatebased prompts and few-shot prompting where the prompt is uncomplicated. 2) LM Generated prompt makes up for the shortcomings of manual construction prompt. It can customize specific rationales for each question and provide sufficient knowledge with the prompt for fine-tuning or selftraining. 3) Retrieval-based prompt often relies on well-annotated external resources (e.g., Wikipedia) and consumes expensive information retrieval, but it can alleviate the unstable issue of the generation.

Comparison of Prompts
We observe that no matter how prompt is produced, CoT prompting only works on large LMs. Smaller LMs work by fine-tuning with rationales. Combined with the empirical conclusion in Ye and Durrett (2022), these phenomena reveal that highquality reasoning rationales contained in the input context are the keys for reasoning with LM prompting. Although some works have attempted to explore the in-context learning ability on large LMs (Xie et al., 2022;Akyürek et al., 2022), the reason why CoT prompting can succeed is still intriguing to the community and not well-understood. One possible hypothesis is that CoT is a magical side product of training on code that can be unlocked by prompt. Note that exemplars containing CoT in few-shot prompts can be viewed as a kind of instruction that arouses the reasoning ability hidden in large LMs. Chung et al. (2022) verifies the similar result using CoT in instruction fine-tuning to advance model performance further. In fact, in-context learning can be seen as an intermediate state of evolution from general prompts to human-readable instructions. Following this trend, prompts may grow into an essential interface of human-machine interaction.

Taxonomy of Benchmarks and Tasks
In this section, we will give a brief overview of reasoning benchmarks and tasks. More details of datasets, as well as reasoning with ChatGPT can be found in Appendix A.4 and A.5.
Arithmetic Reasoning. Arithmetic reasoning, also referred to as mathematical reasoning, is the ability to perform reasoning on math word problems (MWP). Early works on this task (Hosseini et al., 2014;Kushman et al., 2014;Koncel-Kedziorski et al., 2015; focus on relatively small datasets consisting of grade school single-step or multi-step MWP. Later works increase in complexity, difficulty, and scale. Most recently,  extends existing datasets to construct a unified benchmark concerning mathematical abilities, language diversity, and external knowledge.

Commonsense
Reasoning. Commonsense knowledge and commonsense reasoning are some of the major issues in machine intelligence (Storks et al., 2019;Bhargava and Ng, 2022). When answering a question, people often draw upon their rich world knowledge. For LMs, the major challenge of performing commonsense reasoning lies in how to involve physical and human interactions under the presumption of general background knowledge (Bhargava and Ng, 2022). Many benchmark datasets and tasks (Clark et al., 2018;Mihaylov et al., 2018;Talmor et al., 2019;Bisk et al., 2020;Geva et al., 2021) are designed, and the most widely used benchmark today is CommonsenseQA (Talmor et al., 2019).
Logical Reasoning. Common forms of logical reasoning include deductive reasoning and inductive reasoning, deductive reasoning and abductive reasoning (Sinha et al., 2019;Bao et al., 2022;Young et al., 2022;Bao et al., 2023). Deductive reasoning is performed by going from general information to specific conclusions. Typical datasets in this field consist of synthetic rule bases plus derived conclusions (Clark et al., 2020;Tafjord et al., 2021). Dalvi et al. (2021) creatively proposes a dataset containing multi-step entailment trees together with rules and conclusions. As opposed to deductive reasoning, inductive reasoning aims to draw conclusions by going from specific observations to general principles (Yang et al., 2022c).
Symbolic Reasoning. Symbolic reasoning here only refers to a narrow collection of simple tasks that test a diverse set of symbolic manipulation functions, rather than symbolic AI, which is a more general concept. Typical symbolic reasoning tasks include last letter concatenation, reverse list and coin flip .
Multimodal Reasoning. Except for textual modality, humans utilize the information available across different modalities when performing reasoning. To this end, multimodal reasoning benchmarks (Zellers et al., 2019;Park et al., 2020;Dong et al., 2022) are presented to narrow this gap. Recently,  presents ScienceQA, a large-scale multimodal multiple choice dataset that consists of diverse questions of science topics with corresponding answers and explanations.  proposes a new task of multimodal analogical reasoning over knowledge graphs.

Resources
Thanks to the open-source spirit of the NLP community, numerous resources are publicly available alongside papers for researchers to experiment with. ThoughtSource is a central, open resource and community around data and tools related to CoT reasoning in large language models 4 . The LangChain library is designed to help developers build applications using LLMs combined with other sources of computation or knowledge 5 . λprompt allows for building a complete large LM-based prompt machines, including ones that self-edit to correct and even self-write their own execution code 6 . Recently, Ou et al. (2023) develops EasyInstruct, a Python package for instructing LLMs like GPT-3 6 Future Directions Theoretical Principle of Reasoning. LMs have been demonstrated to have emergent zero-shot learning and reasoning abilities . To uncover the mystery of such a success, many researchers have empirically explored the role of in-context learning (Ye and Durrett, 2022;Liu et al., 2022a) and rationales Lampinen et al., 2022). Another line of works tries to investigate the architecture of Transformers via knowledge neurons (Dai et al., 2022) or skill neurons . More recent works  demonstrate that pretrained LMs of code are better handling structured commonsense reasoning and prediction than LMs of natural language, even when the downstream task does not involve source code at all. However, the code-based pre-training (or re-structured pretraining (Yuan and Liu, 2022)) still has limitations since it has to utilize off-the-shelf structure (e.g., existing aligned corpus or build from scratch via syntax tree or AMR (Banarescu et al., 2013)) to reformulate plain texts. Thus, the truth may be close, and we argue that it is beneficial to study the theoretical principle to advocate for a transparent view of reasoning with LM prompting and further decipher the dark matter of intelligence by highlighting the counterintuitive continuum across language, knowledge, and reasoning 7 . Note that reasoning in NLP has the potential advantages of complex problem-solving and should better utilize dark matters in cross-disciplines (e.g., Theory of Mind (Sap et al., 2022;Moghaddam and Honey, 2023;Shapira et al., 2023)).
Efficient Reasoning. To be noted, existing methods mainly depend on large LMs, which may consume high computing resources. Regarding practicality, it is necessary to study reasoning with small LMs or develop efficient reasoning methodologies which pay attention to carbon emission and energy usage during model training and inference (Xu et al., 2021). One feasible way may be developing models that can enable generalization across a range of evaluation scenarios such as Flan-T5 (Chung et al., 2022), which finetune both with and without exemplars (i.e., zero-shot and few-shot) and with and without CoT. Recently, an intuitive approach has been proposed to transfer the reasoning capabilities of large LMs to smaller LMs via knowledge distillation (Shridhar et al., 2022;Magister et al., 2022;. Other promising directions include retrieval augmentation (Li et al., 2022a), model editing (Cao et al., 2021;Mitchell et al., 2022a,b;Cheng et al., 2023), delta-tuning Mao et al., 2022;Ding et al., 2022), etc.
Robust, Faithful and Interpretable Reasoning. Robustness, faithfulness and interpretability have long been pursued by the field of deep learning, especially in tasks that require strong logic, like reasoning. Shaikh et al. (2022) demonstrates that zeroshot CoT will produce undesirable toxicity and biases, indicating the necessity of robust, faithful and interpretable reasoning. Creswell and Shanahan (2022) leverages a selection-inference (Creswell et al., 2022) multi-stage architecture for faithful reasoning, but there is still a lack of interpretability within each stage. Code-based works  reach robustness and interpretability to some extent, but they have the aid of an external engine. There is still a long way to achieve true robustness, faithfulness and interpretability with LMs. Fortunately, Dohan et al. (2022) provides a new idea for utilizing a probabilistic program to tackle various reasoning problems. Other solutions may be neural-symbolic approaches (Du et al., 2021;Ouyang et al., 2021;Feng et al., 2022) or human feedback (Ouyang et al., 2022).
Multimodal (Interactive) Reasoning. Textual reasoning is restricted to what can be expressed through natural language. A more promising direction is multimodal reasoning regarding the information diversity of the real world of human reasoning.  generates CoT when dealing with a multimodal dataset; however, it simply extracts textual descriptions from images, and it is still a textual reasoning task indeed. Intuitively, it is beneficial to integrate multimodal information into reasoning processes such as images, audio, videos, etc., and design a unified multimodal CoT. Apart from unified multimodal models, it is also promising to model chains  to conduct interactive reasoning among models of different modalities. Besides, Sap et al. (2022) shows that one of today's largest language models (GPT-3 (Brown et al., 2020)) lacks the skill to reason about the mental states, and reactions of all people involved. Thus, interactive reasoning methodologies should be noted by inspiring from other domains (e.g., Cognitive Science (Hollenstein et al., 2019), Social Intelligence (Krishna et al., 2022)), which may have potential guidance for reasoning in NLP since only increasing the scale of LMs is likely not the most effective way to create AI systems.
Generalizable (True) Reasoning. Generalization is one of the most significant symbols of models to attain true reasoning abilities. Given a reasoning task, we hope LMs can handle not only the problem itself but solve a group of similar reasoning tasks (not seen during training). Anil et al. (2022) explore the OOD problem on the length of reasoning questions, but the true generalization is still far from satisfactory. Meanwhile, Kejriwal et al. (2022) highlights that more comprehensive evaluation methods grounded in theory (e.g., naive physics (Gardin and Meltzer, 1989) and commonsense psychology (Gordon and Hobbs, 2004)) should be proposed. We argue that the generalizable reasoning may be closely related to analogy reasoning Webb et al., 2022), causal reasoning (Feder et al., 2022, compositional reasoning , etc.

Conclusion and Vision
In this paper, we provide a review of reasoning with language model prompting, including comprehensive comparisons, and several research directions. In the future, we envision a more potent synergy between the methodologies from the NLP and other domains and hope sophisticated and efficient LM prompting models will increasingly contribute to improving reasoning performance.

Limitations
In this study, we provide a survey of reasoning with language model prompting. We discuss the related surveys in Appendix A.1 and will continue adding more related approaches with more detailed analysis. Despite our best efforts, there may be still some limitations that remain in this paper.
References & Methods. Due to the page limit, we may miss some important references and cannot afford all the technical details. We mainly review the cutting-edge methods within two years (mostly in 2022) in §3, mainly from the ACL, EMNLP, NAACL, NeurIPS, ICLR, arXiv, etc., and we will continue to pay attention to and supplement the latest works.
Benchmarks. Most of the reasoning benchmarks mentioned in §5 are gathered and categorized from the experimental part of mainstream works. The definition and boundary of each task may not be accurate enough. Besides, our work may miss some kind of reasoning tasks such as reasoning with generics (Allaway et al., 2022), default inheritance reasoning (Brewka, 1987), non-monotonic reasoning (Ginsberg, 1987) in NLP, and will try our best to fulfill this gap.
Empirical Conclusions. We give detailed comparisons and discussions of language models and prompts in §4, and list some promising future directions in §6. All the conclusions are proposed and further speculated upon empirical analysis of existing works which may not be macroscopic enough. As the field evolves faster, we will update the latest opinions timely.   or pre-trained models (Qiu et al., 2020;Du et al., 2022) are also related to our work. Unlike those surveys, in this paper, we conduct a review of reasoning with LM prompting, hoping to systematically understand the methodologies, compare different methods and inspire new ideas.

A.2 Taxonomy of Methods and Tasks
We list the complete taxonomy of reasoning with language model prompting from methods and tasks in Figure 8.

A.3 Performance Comparison of LMs with Different Scales
To show the generalization of discussions in §4.1 on different reasoning tasks, we additionally show the performance comparison of LMs with different scales on CommonsenseQA (Talmor et al., 2019) of commonsense reasoning in Figure 9.

A.4 Detailed Information of Reasoning Benchmarks
In § 5, we give a brief overview on benchmarks and tasks requiring various reasoning skills. We list more benchmarks and show their key statistics in Table 2. Apart from the above-mentioned specific reasoning tasks in § 5, there are some benchmarks (Lake and Baroni, 2017;Srivastava et al., 2022;Yu et al., 2022) that can evaluate the model's more diverse and generalized reasoning capabilities, which can also be included in the category of reasoning tasks.
A.5 Reasoning with ChatGPT  Recently, OpenAI (2022) develops ChatGPT, an AI chatbot system that has attracted tremendous users. ChatGPT is trained on a massive dataset of text and is able to generate human-like responses to a wide variety of prompts, the promising approach for which is called Reinforcement Learning from Human Feedback (Ouyang et al., 2022). The backbone of ChatGPT is from a model in the GPT-3.5   562 --562 SingleEq (Koncel-Kedziorski et al., 2015) 508 --508 MultiArith  600 --600 Dophin18k (Huang et al., 2016) 18,460 --18,460 MAWPS (Koncel-Kedziorski et al., 2016) 1,921 --1,921 Math23k (Wang et al., 2017) 23,161 --23,161 AQUA-RAT (Ling et al., 2017b) 97  large LM series 8 . In order to savor the reasoning ability of large LMs more realistically, we conduct some case tests on ChatGPT. Concretely, we pick out a piece of data from GSM8K (Cobbe et al., 2021), CommonsenseQA (Talmor et al., 2019) and Last Letter Concatenation  which respectively represent arithmetic reasoning, commonsense reasoning, and symbolic reasoning. 8 https://beta.openai.com/docs/ model-index-for-researchers Then we test each of the selected data on ChatGPT directly. Results can be seen in Figure 10-12. Figure 10 shows that given a math problem in GSM8K (Cobbe et al., 2021), ChatGPT outputs a reasoning process and a correct answer without in-context exemplars. This blazes its powerful arithmetic reasoning ability. The reasoning process has the same format as the gold label in GSM8K, indicating that GSM8K may be contained in the training corpus of ChatGPT.
In Figure 11, we test ChatGPT on a piece of data in CommsonsenseQA (Talmor et al., 2019). It not only gives the correct answer but additionally details why each option is right or wrong, which does not appear in the gold label of the dataset. This demonstrates the strong commonsense reasoning ability of ChatGPT. Figure 12 is a case in Last Letter Concatenation . We observe that although Chat-GPT gives a detailed and accurate description of last letter concatenation, it fails to answer the given question, showing that its symbolic reasoning capability is not as excellent as the above two.