Towards Reasoning in Large Language Models: A Survey

Reasoning is a fundamental aspect of human intelligence that plays a crucial role in activities such as problem solving, decision making, and critical thinking. In recent years, large language models (LLMs) have made significant progress in natural language processing, and there is observation that these models may exhibit reasoning abilities when they are sufficiently large. However, it is not yet clear to what extent LLMs are capable of reasoning. This paper provides a comprehensive overview of the current state of knowledge on reasoning in LLMs, including techniques for improving and eliciting reasoning in these models, methods and benchmarks for evaluating reasoning abilities, findings and implications of previous research in this field, and suggestions on future directions. Our aim is to provide a detailed and up-to-date review of this topic and stimulate meaningful discussion and future work.


Introduction
Reasoning is a cognitive process that involves using evidence, arguments, and logic to arrive at conclusions or make judgments.It plays a central role in many intellectual activities, such as problem solving, decision making, and critical thinking.The study of reasoning is important in fields like psychology (Wason and Johnson-Laird, 1972), philosophy (Passmore, 1961), and computer science (Huth and Ryan, 2004), as it helps individuals make decisions, solve problems, and think critically.
Recently, large language models (LLMs) (Brown et al., 2020;Chowdhery et al., 2022;Chung et al., 2022;OpenAI, 2022, inter alia) such as Chat-GPT have made significant advancements in natural language processing and related fields.It has been shown that these models exhibit emergent behaviors, including the ability to "reason", when they are large enough (Wei et al., 2022a).For example, by providing the models with "chain of thoughts", i.e., reasoning exemplars, or a simple prompt "Let's think step by step", these models are able to answer questions with explicit reasoning steps (Wei et al., 2022b;Kojima et al., 2022), e.g., "all whales are mammals, all mammals have kidneys; therefore, all whales have kidneys."This has sparked considerable interest in the community since reasoning ability is a hallmark of human intelligence that is frequently considered missed in current artificial intelligence systems (Marcus, 2020;Russin et al., 2020;Mitchell, 2021;Bommasani et al., 2021).
However, despite the strong performance of LLMs on certain reasoning tasks, it remains unclear whether LLMs are actually reasoning and to what extent they are capable of reasoning.For example, Kojima et al. (2022) claim that "LLMs are decent zero-shot reasoners (p.1)", while Valmeekam et al. (2022) conclude that "LLMs are still far from achieving acceptable performance on common planning/reasoning tasks which pose no issues for humans to do (p.2)."This limitation is also stated by Wei et al. (2022b): "we qualify that although chain of thought emulates the thought processes of human reasoners, this does not answer whether the neural network is actually reasoning (p.9)."Therefore, in this paper, we aim to provide a comprehensive overview and engage in an insightful discussion on the current state of knowledge on this fast-evolving topic.We initiate our exploration with a clarification of the concept of reasoning ( §2).Subsequently, we turn our attention to the techniques for enhancing/eliciting reasoning in LLMs ( §3), the methods and benchmarks for evaluating reasoning in LLMs ( §4), and the key findings and implications in this field ( §5).Finally, we reflect on and discuss the current state of the field ( §6). 2 What is Reasoning?
Reasoning is the process of thinking about something in a logical and systematic way, using evidence and past experiences to reach a conclusion or make a decision (Wason and Johnson-Laird, 1972;Wason, 1968;Galotti, 1989;Fagin et al., 2004;McHugh and Way, 2018).Reasoning involves making inferences, evaluating arguments, and drawing logical conclusions based on available information.Although "reasoning" is a term that is commonly used in literature and daily life, it is also an abstract concept that can refer to many things.To help the reader better understand this concept, we summarize several main categories of reasoning that are commonly recognized: Deductive reasoning.Deductive reasoning is a type of reasoning in which a conclusion is drawn based on the truth of the premises.In deductive reasoning, the conclusion must necessarily follow from the premises, meaning that if the premises are true, the conclusion must also be true.For example: • Premise: All mammals have kidneys.
• Premise: All whales are mammals.
• Conclusion: All whales have kidneys.
Inductive reasoning.Inductive reasoning is a type of reasoning in which a conclusion is drawn based on observations or evidence.The conclusion is likely to be true based on the available evidence, but it is not necessarily certain.For example: • Observation: Every time we see a creature with wings, it is a bird.• Observation: We see a creature with wings.
• Conclusion: The creature is likely to be a bird.
Abductive reasoning.Abductive reasoning is a type of reasoning in which a conclusion is drawn based on the best explanation for a given set of observations.The conclusion is the most likely explanation based on the available evidence, but it is not necessarily certain.For example: • Observation: The car cannot start and there is a puddle of liquid under the engine.• Conclusion: The most likely explanation is that the car has a leak in the radiator.
Other types of reasoning include analogical reasoning, which involves making comparisons between two or more things in order to make inferences or arrive at conclusions; causal reasoning, which involves identifying and understanding the causes and effects of events or phenomena; and probabilistic reasoning, which involves making decisions or arriving at conclusions based on the likelihood or probability of certain outcomes.
Formal Reasoning vs Informal Reasoning.Formal reasoning is a systematic and logical process that follows a set of rules and principles, often used in mathematics and logic.Informal reasoning is a less structured approach that relies on intuition, experience, and common sense to draw conclusions and solve problems, and is often used in everyday life.Formal reasoning is more structured and reliable, while informal reasoning is more adaptable and open-ended, but may also be less reliable.We refer the reader to Galotti (1989); Bronkhorst et al. (2020) for a detailed distinction between them.
Reasoning in Language Models.The concept of reasoning in language models has been around for some time, but there is not a clear definition of what it entails.In the literature, the term "reasoning" is often used to refer to informal reasoning, although it is not always explicitly stated that it is informal (Cobbe et al., 2021;Wei et al., 2022b, inter alia).Different forms of reasoning may be used depending on the task, benchmark, or method being used, e.g., deductive reasoning (Cobbe et al., 2021;Creswell et al., 2022;Han et al., 2022b, inter alia), inductive reasoning (Yang et al., 2022;Misra et al., 2022, inter alia) or abductive reasoning (Wiegreffe et al., 2022;Lampinen et al., 2022;Jung et al., 2022, inter alia).In this paper, we encompass various forms of reasoning, with a particular focus on "informal deductive reasoning" in large language models since it is a widely used form in which the conclusion is guaranteed to be true as long as the premises are true.

Towards Reasoning in Large Language Models
Reasoning, particularly multi-step reasoning, is often seen as a weakness in language models and other NLP models (Bommasani et al., 2021;Rae et al., 2021;Valmeekam et al., 2022).Recent research has suggested that reasoning ability may emerge in language models at a certain scale, such as models with over 100 billion parameters (Wei et al., 2022a,b;Cobbe et al., 2021).In this paper, we follow Wei et al. (2022a) in considering reasoning as an ability that is rarely present in smallscale models like GPT-2 (Radford et al., 2019) and BERT (Devlin et al., 2019), and therefore focus on techniques applicable to improving or eliciting "reasoning"2 in LLMs such as GPT-3 (Brown et al., 2020) and PaLM (Chowdhery et al., 2022).

Fully Supervised Finetuning
Before discussing reasoning in large language models, it is worth mentioning there is research working on eliciting/improving reasoning in small language models through fully supervised finetuning on specific datasets.For example, Rajani et al.
(2019) finetune a pretrained GPT model (Radford et al., 2018) to generate rationales that explain model predictions with the built CoS-E dataset, and find that models trained with explanations perform better on commonsense question answering tasks (Talmor et al., 2019).Talmor et al. (2020) train RoBERTa (Liu et al., 2019)  There are two major limitations of fully supervised finetuning.First, it requires a dataset containing explicit reasoning, which can be difficult and time-consuming to create.Additionally, the model is only trained on a specific dataset, which limits its application to a specific domain and may result in the model relying on artifacts in the training data rather than actual reasoning to make predictions.

Prompting & In-Context Learning
Large language models such as GPT-3 (Brown et al., 2020) have demonstrated remarkable fewshot performance across a variety of tasks through in-context learning.These models can be prompted with a question and a few ⟨input, output⟩ exemplars to potentially solve a problem through "reasoning", either implicitly or explicitly.However, research has shown that these models still fall short when it comes to tasks that require multiple steps of reasoning to solve (Bommasani et al., 2021;Rae et al., 2021;Valmeekam et al., 2022).This may be due to a lack of exploration into the full capabilities of these models, as recent studies have suggested.

Chain of Thought and Its Variants
To encourage LLMs to engage in reasoning rather than simply providing answers directly, we may guide LLMs to generate "reasoning" explicitly.One approach for doing this is chain-of-thought prompting, proposed by Wei et al. (2022b).This approach involves providing a few examples of "chain of thought" (CoT), which are intermediate natural language reasoning steps, in the prompt to LLMs (Figure 2).Specifically, in CoT prompting, ⟨input, output⟩ demonstrations are replaced with ⟨input, chain of thought, output⟩ triples, e.g., "[input] Roger has 5 tennis balls.He buys 2 more cans of tennis balls.Each can has 3 tennis balls.How many tennis balls does he have now?[chain of thought] Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls.5 + 6 = 11.[output] The answer is 11."In this way, given a target question, the model learns to generate explicit ratio- nale before producing the final answer.Experimental results show that this simple idea can improve LLMs' few-shot performance on arithmetic, symbolic, and commonsense reasoning tasks, sometimes to a striking degree.
There are several variants of chain-of-thought prompting that have been proposed in the literature, in a different form or to solve a specific problem.
Different Form: Kojima et al. (2022) introduce Zero-shot-CoT, in which LLMs are simply prompted with the phrase "Let's think step by step" after the input, in order to elicit reasoning without the need for few-shot demonstrations.Madaan et al. (2022); Gao et al. (2022); Chen et al. (2022) find that LLMs trained with code, e.g., Codex (Chen et al., 2021), can achieve better performance on reasoning tasks by framing reasoning as code generation.Wang et al. (2022a) propose to iteratively prompt chain of thought.He et al. (2023) attempt to retrieve external knowledge in CoT to improve faithfulness of reasoning.
Specific Problem/Setting: Before chain of thought, Nye et al. (2022) also try to use intermediate computations, named "scratchpads", to improve language models' reasoning performance in both finetuning and few-shot regimes, with a particular focus on programs.Shi et al. (2022) attempt to solve multilingual reasoning tasks with CoT in the native language, CoT in English (regardless of the problem language), and CoT in English (with the problem translated to English).Chen (2022) apply CoT to table-based reasoning, finding that LLMs can achieve strong performance on table tasks with only one exemplar.Prystawski et al. (2022) demonstrate that CoT can improve LLMs' performance on paraphrase selection for metaphors.Lu et al. (2022) apply chain of thought to solve multimodal science questions.

Rationale Engineering
The original version of chain-of-thought prompting, proposed by Wei et al. (2022b), relies on manually crafted examples of intermediate reasoning steps and applies greedy decoding in the generation.Rationale engineering aims to more effectively elicit or utilize reasoning in LLMs.This can be achieved through rationale refinement, which involves creating more effective examples of reasoning steps, or through rationale exploration and rationale verification, which involve exploring and verifying the rationales produced by LLMs.A summary of raltionale engineering is illustrated in Figure 2.
Rationale refinement.The choice of exemplars can significantly affect the few-shot performance of LLMs, as demonstrated in research such as Liu et al. (2022b), which also appears in chain-of-thought prompting.Rationale refinement aims to create and refine rationale examples that are better able to elicit reasoning in LLMs.Fu et al. (2022b) propose complexity-based prompting to create rationales with more reasoning steps.Their experiments show that the performance of LLMs improves with the increased rationale complexity.Similarly, Zhou et al. (2022c) propose algorithmic prompting, which suggests that providing more thorough examples of solutions can help improve reasoning performance on some simple math calculations.Zhang et al. (2022b) design Auto-CoT to automatically construct exemplars by partitioning questions from a given dataset into clusters and then using Zero-Shot-CoT (Kojima et al., 2022) to generate the rationale for a representative question from each cluster.The analysis shows that making exemplars diverse is important in prompting LLMs to produce better rationales.
Rationale exploration.In addition to providing better exemplars, we can allow LLMs to fully explore various ways of reasoning to improve their performance on reasoning tasks, named rationale exploration.Based on the idea that complex problems often admit multiple ways of thinking that can lead to their unique correct answer, Wang et al. (2022c) present a decoding strategy called selfconsistency to improve upon the traditional greedy decoding used in chain-of-thought prompting.This strategy involves sampling a diverse set of rationales, rather than just the greedy one, and selecting the most consistent answer by marginalizing out the sampled rationales.The idea is also used in Fu et al. (2022b) to vote over the top complex rationales.To further improve performance, Li et al. (2022b) suggest providing different demonstrations for each question by sampling exemplars from an exemplar base, in order to increase the diversity of the sampled rationales.
Rationale verification.Ensuring that the rationales produced by LLMs are valid is critical, as incorrect rationales can lead to incorrect final predictions (Ye and Durrett, 2022).To address this issue, the process of rationale verification aims to verify whether the rationales produced by LLMs lead to the correct final answers.Cobbe et al. (2021) propose augmenting LLMs with a trained verifier that assigns a score to each rationale and solution generated by the LLM, selecting the highest-ranked solution as the final answer when solving math word problems.Li et al. (2022b) also use this technique to guide rationale selection, in conjunction with the process of rationale exploration.Different from the above methods that train an external verifier to verify the rationales, Weng et al. (2022) suggest using LLMs themselves as the verifiers.

Problem Decomposition
Chain-of-thought prompting, while effective for eliciting reasoning in LLMs, can struggle with complex tasks, e.g., tasks that require compositional generalization (Lake and Baroni, 2018; Keysers et al., 2020).To solve a complex problem, it is helpful to first break it down into smaller, more manageable subproblems.By solving each of these subproblems, we can effectively solve the complex problem.This technique is called problem decom-position or divide and conquer (Talmor and Berant, 2018;Min et al., 2019;Perez et al., 2020).
Based on this idea, Zhou et al. (2022a) propose least-to-most prompting, which consists of two steps: decomposing the complex problem into subproblems and solving these subproblems in a specific order, with each subproblem being facilitated by the answers obtained from previously solved subproblems.As follow-up work, Drozdov et al. (2022) introduce dynamic least-to-most prompting, which is designed to solve more realistic semantic parsing problems by decomposing the problems with prompting-based syntactic parsing and dynamically selecting exemplars based on the decomposition.In addition, Khot et al. (2022) design decomposed prompting, which breaks down a complex problem into subproblems that can be handled by a shared library of prompting-based LLMs, each specialized in a particular subproblem.Furthermore, Dua et al. (2022) develop successive prompting, which iteratively decomposes a complex problem into a simple problem, with the next subproblem prediction having access to the answers to the previous subproblems.While the above methods decompose or solve compositional questions with multiple forward passes, Press et al. (2022) suggest decomposing and solving the input question in one forward pass using CoT prompting.Overall, these techniques show promise for helping LLMs to solve complex tasks by decomposing the problem into more manageable subproblems.

Others
There are other techniques that have been developed to facilitate reasoning in LLMs for specific tasks or settings.For instance, Creswell et al. (2022); Creswell and Shanahan (2022) introduce a selection-inference framework that uses LLMs as modules to select and infer reasoning steps from a set of facts that culminate in the final answer.Kazemi et al. ( 2022) suggest using backward chaining, i.e., from goal to the set of facts that support it, instead of forward chaining like Creswell et al. (2022); Creswell and Shanahan (2022).In addition, Jung et al. (2022) propose a method for solving binary questions by prompting LLMs abductively and recursively to rationalize each option.Zhou et al. (2022b) design a technique for performing numerical reasoning on complex numbers by replacing the complex numbers with simple numbers to produce simpler expressions, and then using these expressions to perform calculations on the complex numbers.There are also efforts to distill reasoning from LLMs into smaller models, such as the work by Li et al. (2022a); Shridhar et al. (2022);Magister et al. (2022).Finally, we refer the reader to Dohan et al. (2022)'s position paper on language model cascade, which presents a unifying framework for understanding chain-of-thought prompting and research in this line.

Hybrid Method
While "prompting" techniques can help elicit or better utilize reasoning in large language models to solve reasoning tasks, they do not actually improve the reasoning capabilities of the LLMs themselves, as the parameters of the models remain unchanged.In contrast, the "hybrid approach" aims to simultaneously improve the reasoning capabilities of LLMs and make better use of these models in order to solve complex problems.This approach involves both enhancing the reasoning capabilities of the LLMs and using techniques such as prompting to effectively utilize these capabilities.

Reasoning-Enhanced Training and Prompting
One approach to improving the reasoning capabilities of LLMs is to pretrain or finetune the models on datasets that include "reasoning".Lewkowycz et al. (2022); Taylor et al. (2022) find that LLMs trained on datasets containing scientific and mathematical data can achieve better performance on reasoning tasks like quantitative reasoning problems when using CoT prompting3 .Pi et al. (2022) show that continually pretraining with SQL data can boost the performance of language models, e.g., T5 (Raffel et al., 2020), on natural language reasoning such as numerical reasoning and logical reasoning.finetuning and scratchpad prompting results in a significant improvement in LLMs' ability to generalize to longer problems, while this phenomenon is not observed in the standard fully supervised finetuning paradigm.

Bootstrapping & Self-Improving
Instead of finetuning LLMs on pre-built datasets that include reasoning, there are studies that have explored the idea of using LLMs to self-improve their reasoning abilities through a process known as bootstrapping.One example of this is the Self-Taught Reasoner (STaR) introduced by Zelikman et al. ( 2022), in which a LLM is trained and refined on its own output iteratively.Specifically, with CoT prompting, the model first generates initial rationales.And then, the model is finetuned on rationales that lead to correct answers.This process can be repeated, with each iteration resulting in an improved model that can generate better training data, which in turn leads to further improvements.As a follow-up to this work, Huang et al. (2022a) show that LLMs are able to self-improve their reasoning abilities without the need for supervised data by leveraging the self-consistency of reasoning (Wang et al., 2022c).

Measuring Reasoning in Large Language Models
We summarize methods and benchmarks for evaluating reasoning abilities of LLMs in this section.

End Task Performance
One way to measure reasoning abilities of LLMs is to report their performance, e.g., accuracy, on end tasks that require reasoning.We list some common benchmarks as follows.
Arithmetic Reasoning.Arithmetic reasoning is the ability to understand and apply mathematical concepts and principles in order to solve problems involving arithmetic operations.This involves using logical thinking and mathematical principles to determine the correct course of action when solving mathematical problems.
Commonsense Reasoning.Commonsense Reasoning is the use of everyday knowledge and understanding to make judgments and predictions about new situations.It is a fundamental aspect of human intelligence that enables us to navigate our environment, understand others, and make decisions with incomplete information.Benchmarks that can be used for testing commonsense reasoning abilities of LLMs include CSQA (Talmor et al., 2019), StrategyQA (Geva et al., 2021), and ARC (Clark et al., 2018).We refer the reader to Bhargava and Ng ( 2022)'s survey for more work in this domain.
Symbolic Reasoning.Symbolic reasoning is a form of reasoning that involves the manipulation of symbols according to formal rules.In symbolic reasoning, we use abstract symbols to represent concepts and relationships, and then manipulate those symbols according to precise rules in order to draw conclusions or solve problems.Two benchmarks of symbolic reasoning are presented in Wei et al. (2022b), including Last Letter Concatenation and Coin Flip.
Others.In practice, there are many benchmarks that can be used to evaluate reasoning abilities of LLMs (indirectly), as long as the downstream task involves reasoning.BIG-bench (Srivastava et al., 2022), for example, includes over 200 tasks that test a range of reasoning skills, including tasks like Date Understanding, Word Sorting, and Causal Judgement.Other benchmarks, such as SCAN (Lake and Baroni, 2018) and the one proposed by Anil et al. (2022), focus on evaluating generalization ability.LLMs can also be tested on their table reasoning abilities using benchmarks such as WikiTableQA (Pasupat and Liang, 2015), FetaQA (Nan et al., 2022), as suggested by Chen (2022).In addition, there are benchmarks for evaluating LLMs' generative relational reasoning abilities, such as CommonGen (Lin et al., 2020;Liu et al., 2022a) and Open Relation Modeling (Huang et al., 2022b,d).

Analysis on Reasoning
Although LLMs have demonstrated impressive performance on various reasoning tasks, the extent to which their predictions are based on true reasoning or simple heuristics is not always clear.This is because most existing evaluations focus on their accuracy on end tasks, rather than directly assessing their reasoning steps.While some error analysis has been conducted on the generated rationales of LLMs (Wei et al., 2022b;Kojima et al., 2022, inter alia), this analysis has often been limited in depth.
There have been some efforts to develop metrics and benchmarks that enable a more formal/deep analysis of reasoning in LLMs.Golovneva et al. (2022) design ROSCOE, a set of interpretable, detailed step-by-step evaluation metrics covering various perspectives including semantic alignment, logical inference, semantic similarity, and language coherence.Saparov and He (2022) create a synthetic dataset called PrOntoQA that is generated from real or fictional ontologies.Each example in the dataset has a unique proof, which can be converted to simple sentences and back again, allowing for a formal analysis of each reasoning step.Han et al. (2022a) introduce a dataset called FO-LIO to test the first-order logic reasoning capabilities of LLMs.FOLIO contains first-order logic reasoning problems that require models to determine the correctness of conclusions given a set of premises.In addition, Wang et al. (2022b) conduct ablation experiments on CoT and find that LLMs may also perform reasoning while prompting with invalid rationals.Their study also suggests that being relevant to the query and correctly ordering the reasoning steps are important for CoT prompting.
In summary, most existing studies primarily report the performance of the models on downstream reasoning tasks, without a detailed examination of the quality of the rationales produced.This leaves open the question of whether the models are actually able to reason in a way that is similar to human reasoning, or whether they are simply able to achieve good performance on the tasks through other means.Further research is needed to more formally analyze the reasoning abilities of LLMs.

Findings and Implications
In this section, we summarize the important findings and implications of studies on reasoning in large language models.
Reasoning seems an emergent ability of LLMs.Wei et al. (2022a,b); Suzgun et al. (2022) show that reasoning ability appears to emerge only in large language models like GPT-3 175B, as evidenced by significant improvements in performance on reasoning tasks at a certain scale (e.g., 100 billion parameters).This suggests that it may be more effective to utilize large models for general reasoning problems rather than training small models for specific tasks.However, the reason for this emergent ability is not yet fully understood.We refer the reader to Wei et al. (2022a); Fu et al. (2022a) for some potential explanations.
Chain of thought elicits "reasoning" of LLMs.The use of chain-of-thought (CoT) prompts (Wei et al., 2022b) has been shown to improve the performance of LLMs on various reasoning tasks, as demonstrated in the experiments of Wei et al. (2022a,b); Suzgun et al. (2022).Additionally, Saparov and He (2022) ( §4.2) find that, when using CoT prompts, LLMs are able to produce valid individual proof steps, even when the synthetic ontology is fictional or counterfactual.However, they may sometimes choose the wrong steps when multiple options are available, leading to incomplete or incorrect proofs.Moreover, for many reasoning tasks where the performance of standard prompting grows smoothly with model scale, chain-of-thought prompting can lead to dramatic performance improvement.In addition to these benefits, the use of CoT prompts has been shown to improve the out-ofdistribution robustness of LLMs (Wei et al., 2022b;Zhou et al., 2022a;Anil et al., 2022, inter alia), an advantage that is not typically observed with standard prompting or fully supervised finetuning paradigms.
LLMs show human-like content effects on reasoning.According to Dasgupta et al. (2022), LLMs exhibit reasoning patterns that are similar to those of humans as described in the cognitive literature.For example, the models' predictions are influenced by both prior knowledge and abstract reasoning, and their judgments of logical validity are impacted by the believability of the conclusions.These findings suggest that, although language models may not always perform well on reasoning tasks, their failures often occur in situations that are challenging for humans as well.This provides some evidence that language models may "reason" in a way that is similar to human reasoning.
LLMs are still unskilled at complex reasoning.Although LLMs seem to possess impressive reasoning capabilities with the techniques described in §3, they still struggle with more complex reasoning tasks or those involving implicature, according to studies such as Valmeekam et al. (2022); Han et al. (2022a); Ruis et al. (2022).For instance, Valmeekam et al. (2022) find that even in relatively simple commonsense planning domains that humans would have no trouble navigating, LLMs such as GPT-3 (Brown et al., 2020) and BLOOM (Scao et al., 2022) struggle to perform effectively.These findings suggest that existing benchmarks may be too simple to accurately gauge the true reasoning abilities of LLMs, and that more challenging tasks may be needed to fully evaluate their abilities in this regard.
6 Reflection, Discussion, and Future Directions Why reasoning?Reasoning is the process of thinking about something in a logical and systematic way, and it is a key aspect of human intelligence.By incorporating reasoning capabilities into language models, we can enable them to perform tasks that require more complex and nuanced thinking, such as problem solving, decision making, and planning (Huang et al., 2022e,f;Song et al., 2022).This can improve the performance of these models on downstream tasks and increase their out-ofdistribution robustness (Wei et al., 2022a,b;Suzgun et al., 2022;Zhou et al., 2022a;Anil et al., 2022).
In addition, reasoning can make language models more explainable and interpretable, as it provides explicit rationales for their predictions.
Right task/application?As Valmeekam et al. (2022) point out, current benchmarks may not adequately reflect the reasoning capabilities of LLMs.
In addition, tasks such as solving simple math problems and concatenating letters in strings ( §4.1) are artificial and do not accurately reflect real-world situations.To truly understand the reasoning ability of LLMs, it is important to consider more realistic and meaningful applications such as decision making (Edwards, 1954), legal reasoning (Levi, 2013), and scientific reasoning (Zimmerman, 2000).Our ultimate goal should not be to enable LLMs to solve simple math problems, which can be simply done with other programs.When conducting relevant research, it is essential to ask whether the specific task being tackled is meaningful and whether the proposed method can be generalized to more realistic tasks and applications.

Are language models really able to reason?
There are several indications that LLMs are able to reason, including 1) high performance on various tasks requiring reasoning (Suzgun et al., 2022); 2) the ability to reason step-by-step with chainof-thought prompting (Wei et al., 2022b); and 3) the reflection of human-like content effects on reasoning (Dasgupta et al., 2022).However, these findings are not sufficient to conclude that LLMs can truly reason.For 1), it is not clear whether the models are making predictions based on reasoning or heuristics (Patel et al., 2021).For many existing benchmarks on reasoning, actually, we can design a program with heuristic rules to achieve very high performance.We usually do not think a program relying on heuristic rules is capable of reasoning.
For 2), although the models seem to reason stepby-step, the generated rationales may be incorrect and inconsistent.It is possible that the models are "generating reasoning-like response" rather than "reasoning step-by-step".For 3), while LLMs display some human-like reasoning patterns, this does not necessarily mean that they behave like humans.Additionally, there are several observations that suggest LLMs may not be capable of reasoning: 1) LLMs still struggle with tasks that require complex reasoning (Valmeekam et al., 2022;Han et al., 2022a;Ruis et al., 2022).If LLMs are really decent reasoners, they should handle tasks that can be simply solved by humans through reasoning; 2) LLMs make mistakes in their reasoning, as explained above; 3) #4 The performance of LLMs on downstream tasks has been found to be sensitive to the frequency of certain terms, such as numbers, in the training data (Razeghi et al., 2022;Jung et al., 2022), which would not be expected if the models were solving mathematical problems through reasoning; 4) # Language models have been found to struggle with associating relevant information that they have memorized (Huang et al., 2022c).
Overall, it is still too early to draw a conclusion about the proposed question.In fact, there is also an ongoing debate about whether language models can actually understand language or capture meaning (Bender and Koller, 2020;Li et al., 2021;Manning, 2022;Piantasodi and Hill, 2022).Further in-depth analysis of factors such as training data, model architecture, and optimization objectives is needed, as well as the development of better benchmarks for measuring the reasoning capabilities of LLMs.However, it is clear that the current models are not yet capable of robust reasoning.

Improving reasoning capabilities of LLMs.
While techniques like chain-of-thought prompting (Wei et al., 2022b) may help to elicit reasoning abilities in large language models, they cannot enable the models to solve tasks beyond their current capabilities.To truly enhance reasoning in LLMs, we need to utilize training data, model architecture, and optimization objectives that are designed to encourage reasoning.For example, finetuning a model with a dataset including CoT data has been shown to improve reasoning (Chung et al., 2022), and models can also self-improve through the process of bootstrapping their reasoning (Zelikman et al., 2022;Huang et al., 2022a).There is still much research that needs to be done in this area, and we look forward to future progress in improving reasoning in large language models.

Conclusion
In this paper, we have provided a detailed and upto-date review of the current state of knowledge on reasoning in large language models.We have discussed techniques for improving and eliciting reasoning in LLMs, methods and benchmarks for evaluating reasoning abilities, and the findings and implications of previous studies in this topic.While LLMs have made significant progress in natural language processing and related fields, it remains unclear to what extent they are capable of true reasoning or whether they are simply using memorized patterns and heuristics to solve problems.Further research is needed to fully understand the reasoning abilities of LLMs, improve LLMs' reasoning capabilities, and determine their potential for use in a variety of applications.We hope that this paper will serve as a useful overview of the current state of the field and stimulate further discussion and research on this interesting and important topic.

Limitations
In this paper, we provide an overview of the current state of knowledge on reasoning in large language models.Reasoning is a broad concept that encompasses various forms, making it impractical to summarize all related work in a single paper.Therefore, we focus on deductive reasoning, as it is the most commonly studied in the literature.Other forms of reasoning such as inductive reasoning (Yang et al., 2022;Misra et al., 2022, inter alia) and abductive reasoning (Wiegreffe et al., 2022;Lampinen et al., 2022;Jung et al., 2022, inter alia) may not be discussed in depth.
Additionally, given the rapid evolution and significance of reasoning within large language models, it is crucial to note that new contributions may have emerged in the field concurrent with the writing of this paper.An additional resource to consider is a parallel survey by Qiao et al. (2022), which emphasizes reasoning via language model prompting.Our coverage may not extend to papers released during or after 2023 such as evaluation on Chat-GPT (Bang et al., 2023;Zheng et al., 2023).As such, we recommend readers to check the papers that cite this survey for a more comprehensive and updated understanding of this field.
C2. Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Not applicable.Left blank.
C3. Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Not applicable.Left blank.
C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Not applicable.Left blank.
D Did you use human annotators (e.g., crowdworkers) or research with human participants?Left blank.
D1. Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.
D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 1 :
Figure 1: The structure of the paper.

Figure 2 :
Figure 2: An illustration of Chain-of-Thought Prompting and Rationale Engineering, where asterisk (*) denotes the target problem to be solved.