Logic-LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning

Large Language Models (LLMs) have shown human-like reasoning abilities but still struggle with complex logical problems. This paper introduces a novel framework, Logic-LM, which integrates LLMs with symbolic solvers to improve logical problem-solving. Our method first utilizes LLMs to translate a natural language problem into a symbolic formulation. Afterward, a deterministic symbolic solver performs inference on the formulated problem. We also introduce a self-refinement module, which utilizes the symbolic solver's error messages to revise symbolic formalizations. We demonstrate Logic-LM's effectiveness on five logical reasoning datasets: ProofWriter, PrOntoQA, FOLIO, LogicalDeduction, and AR-LSAT. On average, Logic-LM achieves a significant performance boost of 39.2% over using LLM alone with standard prompting and 18.4% over LLM with chain-of-thought prompting. Our findings suggest that Logic-LM, by combining LLMs with symbolic logic, offers a promising avenue for faithful logical reasoning. Code and data are publicly available at https://github.com/teacherpeterpan/Logic-LLM.


Introduction
Logical reasoning is a cognitive process that involves using evidence, arguments, and logic to arrive at conclusions or make judgments (Huang and Chang, 2023).It plays a central role in intelligent systems for problem-solving, decision-making, and critical thinking.Recently, large language models (LLMs) (Brown et al., 2020;Ouyang et al., 2022a;OpenAI, 2023) have exhibited emergent ability to "reason" like human (Wei et al., 2022a).When prompted with step-wise explanations of reasoning ("chain of thoughts"), or a simple prompt "Let's think step by step.",these models are able to answer questions with explicit reasoning steps (Wei et al., 2022b;Kojima et al., 2022).Despite the advances of LLMs, they still struggle with complex logical reasoning problems (Liu et al., 2023b).Recent studies (Golovneva et al., 2023;Ribeiro et al., 2023b;Lyu et al., 2023) found that LLMs occasionally make unfaithful reasoning, i.e., the derived conclusion does not follow the previously generated reasoning chain.While chain-of-thought may imitate human reasoning processes, the fundamental nature of LLMs remains that of black-box probabilistic models, lacking a mechanism to guarantee the faithfulness of reasoning (Shanahan, 2022).In contrast, symbolic inference engines, such as expert systems (Metaxiotis et al., 2002), are faithful and transparent because the reasoning is based on symbolic-represented knowledge and follows well-defined inference rules that adhere to logical principles.The main obstacle is how to accurately translate a problem into symbolic representations, considering the inherent ambiguity and flexibility of natural language.This is precisely where LLMs excel, making LLMs a promising complement to symbolic solvers.This drives our exploration of neuro-symbolic methods that integrate LLMs with symbolic reasoning.As illustrated in Figure 1, we present LOGIC-LM, a novel framework that decomposes a logical reasoning problem into three stages: Problem Formulation, Symbolic Reasoning, and Result Interpretation.During problem formulation, an LLM converts the natural language description of the problem into an appropriate symbolic formulation, identifying key entities, facts, and rules present in the problem statement.Subsequently, at the symbolic reasoning stage, a deterministic symbolic solver performs inference on the symbolic formulation.Lastly, a result interpreter explains the output and maps it to the correct answer.By incorporating LLMs with symbolic solvers, we can exploit the robust natural language understanding capabilities of LLMs to precisely represent the problem using symbolic representations, while also taking advantage of the logical faithfulness and transparency offered by symbolic solvers.To improve the accuracy of the symbolic parsing, we also incorporate the idea of self-refinement to iteratively revise the generated logical form using the error messages from the symbolic solver as feedback.
We showcase the adaptability and effectiveness of LOGIC-LM on five logical reasoning datasets: ProofWriter (Tafjord et al., 2021), PrOn-toQA (Saparov and He, 2023), FOLIO (Han et al., 2022), AR-LSAT (Zhong et al., 2022), and the Log-icalDeduction dataset from BigBench (Srivastava et al., 2022).These datasets cover a wide range of logical reasoning problems, including: • Deductive Reasoning problems • First-Order Logic (FOL) reasoning problems • Constraint Satisfaction Problems (CSP) • Analytical Reasoning (AR) problems We integrate four types of symbolic inference tools tailored to these problems: 1) logic programming engine that supports deductive reasoning through forward/backward chaining; 2) FOL inference engine that derives new conclusions based on FOL rules and facts, 3) constraint optimization engine that provides solvers for CSP over finite domains, and 4) boolean satisfiability problem (SAT) solver that solves analytical reasoning problems.
Our evaluations show that the strategy of integrating LLMs with symbolic solvers performs significantly better than purely relying on LLMs for logical reasoning, with an average improvement of 39.2% over the standard prompting and 18.4% over the chain-of-thought prompting ( § 4.1).We also find that LOGIC-LM becomes increasingly effective as the required reasoning depth increases ( § 4.3).Finally, by analyzing the impact of selfrefinement, we highlight the effectiveness of incrementally revising symbolic formalizations when interacting with the symbolic solver ( § 4.4).

Related Work
Language Models for Logical Reasoning.Recent works in adapting LLMs for logical reasoning tasks can be broadly categorized into two groups: 1) fine-tuning approaches that optimize LLMs' reasoning ability through fine-tuning or training specialized modules (Clark et al., 2020;Tafjord et al., 2022;Yang et al., 2022), and 2) in-context learning approaches that design special prompts to elicit LLMs' step-by-step reasoning capabilities.Typical methods include chain-of-thought prompting (Wei et al., 2022b;Wang et al., 2023) that generates explanations before the final answer and the least-tomost prompting (Zhou et al., 2023) that breaks the problem down into simpler components that can be solved individually.Both the above approaches perform reasoning directly over natural language (NL), providing greater flexibility than symbolicbased reasoning.However, the intrinsic complexity and ambiguity of NL also bring undesired issues such as unfaithful reasoning and hallucinations.
Different from prior works, we use symbolic language as the basic unit of reasoning.This effectively transfers the burden of executing complex, precise reasoning from LLMs to more reliable, interpretable external symbolic solvers.Simultaneously, we leverage the strong in-context learning ability of LLMs to formulate the NL-based problem into suitable symbolic representations, thus maintaining the benefit of flexibility.
Although prior works (Mao et al., 2019;Gupta et al., 2020;Manhaeve et al., 2021;Cai et al., 2021;Tian et al., 2022;Pryor et al., 2023) also propose neuro-symbolic methods to combine neural networks with symbolic reasoning, these methods suffer from limitations such as hand-crafted or specialized module designs that are not easily generalizable, or brittleness due to the difficulty of optimization.In contrast, we propose a more generalizable framework that integrates modern LLMs with symbolic logic without the need for training or designing complex problem-specific modules.
Tool-augmented Language Models.Language models have inherent limitations such as the inability to access up-to-date information, take actions, or perform precise mathematical reasoning.To No giant language model could have bad performance.If a language model has good performance, it is used by some researchers.A work used by some researchers should be popular.If BERT is a giant language model, then the same for GPT3.BERT is a giant language model.Is the following statement true, false, or unknown?GPT3 is popular.Is the following statement true, false, or unknown?Nails cannot conduct electricity.

Logic Programming
In an antique car show, there are three vehicles: a tractor, a convertible, and a minivan.The tractor is the secondnewest.The minivan is newer than the convertible.

Which of the following is true?
A) The tractor is the oldest.B) The convertible is the oldest.C) The minivan is the oldest.

Answer
The statement "Nails cannot conduct electricity" is false.
The statement "GPT3 is popular" is true.
A) The convertible is the oldest.address this, recent work has begun to augment language models with access to external tools and resources, such as the information retriever (Nakano et al., 2021;Shi et al., 2023;Lazaridou et al., 2022), calculator (Cobbe et al., 2021), code interpreter (Wang et al., 2022), planner (Liu et al., 2023a), and other pre-trained models (Shen et al., 2023).Recent works (Gao et al., 2023;Chen et al., 2022) have achieved improved performance on arithmetic reasoning tasks by generating Python programs that specify the reasoning procedure as chained commands in the order of execution.However, this idea has not been extended to logical reasoning problems, primarily due to the challenge of representing their highly "non-linear" reasoning procedure (e.g., hypothesizing, case-by-case analysis, and the process of elimination) with functional programming.Our work provides a novel way to solve this within the framework of augmented LLMs.Instead of parsing the problem-solving procedure as programs, we only describe the problem with symbolic language using LLMs and then offload the reasoning to external symbolic solvers.

SMT Solver
Auto-Formalization.The concept of converting natural language into symbolic representations has been widely adopted in auto-formalization for mathematical reasoning (Wu et al., 2022;Drori et al., 2022;He-Yueya et al., 2023;Jiang et al., 2023).These works demonstrate the proficiency of LLMs in translating a considerable fraction of mathematical problems into formal specifications defined in tools like SymPy (Meurer et al., 2017), Isabelle/HOL (Paulson, 1994), and Lean (de Moura et al., 2015).Mathematical reasoning can be considered a specialized subset of logical reasoning, primarily focused on numeric deductions.Due to this numeric specificity, mathematical problems are often more readily translatable to symbolic forms.In contrast, logical reasoning covers a wider array of problem types, often requiring a deeper understanding of world knowledge and commonsense for effective parsing into symbolic forms.Despite plenty of works studying mathematical reasoning, our work pioneers in extending the concept of autoformalization to a broader range of logical reasoning tasks with modern LLMs.

LOGIC-LM
As shown in Figure 2, the inputs of our model are a logical reasoning problem P described in natural language, along with a goal G in the form of a multiple-choice or free-form question.LOGIC-LM then follows a problem formulation-and-reasoning paradigm to solve the problem.
In the Problem Formulation stage, we prompt an LLM to translate the problem and the goal into a task-specific symbolic language.In the Symbolic Reasoning stage, we call a deterministic symbolic solver, e.g., a logic programming engine, to obtain a symbolic-represented answer.Finally, an LLM-or rule-based Result Interpreter is responsible for translating the answer back to natural language.Using this approach, the reasoning is guaranteed to be faithful as long as the problem formulation is correct since the answer A is the result of executing deterministic algorithms (e.g., forward/backward-chaining) embedded within the symbolic reasoner.Compared to previous methods based on chain-of-thought, our framework reduces the burden of LLMs by shifting their focus from "solving the problem by reasoning step-by-step" to "representing the problem in symbolic language".

Problem Formulator
Intuitively, LLMs may struggle with directly solving complex reasoning problems.However, they have demonstrated a notable ability to comprehend textual inputs and translate them into formal programs, such as mathematical equations (He-Yueya et al., 2023) or Python codes (Gao et al., 2023).We posit that this capability to formulate problems into different languages can be extended to symbolic languages as well.We leverage the few-shot generalization ability of LLMs to achieve this.By providing the LLM with detailed instructions about the grammar of the symbolic language, alongside a few demonstrations as in-context examples, we observe that LLMs, like InstructGPT (Ouyang et al., 2022b) and GPT-4 (OpenAI, 2023), can effectively follow the instructions to identify key entities, facts, and rules present in the problem statement, and then translate these elements into symbolic language following our defined grammar.
Specifically, we use four different symbolic formulations to cover four common types of logical reasoning problems: deductive reasoning, firstorder logic reasoning, constraint satisfaction problem, and analytical reasoning.These formulations provide a foundation for translating natural language-based problem statements.By defining additional problem-specific formulations, our framework retains the flexibility to accommodate a wider range of reasoning tasks.Next, we will delve into the grammar of each symbolic formulation.Examples of each problem type are in Figure 2.
Logic Programming (LP) Language.Deductive reasoning typically starts from known facts and rules, and iteratively makes new inferences until the goal statement can be proved or disproved (Poole and Mackworth, 2010).The Prolog logic programming language (Clocksin and Mellish, 2003;Körner et al., 2022) is arguably the most prominent symbolic language to describe deductive reasoning problems.We adopt its grammar to represent a problem as facts, rules, and queries.
• Facts: a fact F is a simple statement with a predicate and a set of arguments, formulated as , where P is the predicate name and each argument a i can be a variable, entity, number, or bool.For example, Age(Peter, 31) means "Peter's age is 31", and MadeOfIron(Nails, True) represents the fact "Nails are made of iron".
• Rules: rules are written in the form of clauses: • Queries: a query Q is simply another fact required to be proved based on known facts and rules.
First-Order Logic (FOL).While the logic programming language efficiently represents common deductive reasoning problems, it may fail to represent more complex first-order logic (FOL) problems.To address this, we also include the FOL grammar (Enderton, 2001) in Appendix A. A problem is then parsed into a list of FOL formulas, which are divided into Premises (the known information from the problem) and Conclusion (the unknown formula to be proved).An example sentence and its FOL formula are given in Table 1.
Constraint Satisfaction (CSP).Constraint satisfaction problems (CSPs) (Kumar, 1992) aims to find the value assignment of a set of objects that satisfy a number of constraints.A CSP is often defined as a triple (X, D, C), where is a set of constraints.Each variable x i can take on the values in the nonempty domain D i .Every constraint C j is a pair ⟨t j , R j ⟩, where t j ⊂ X is a subset of k variables and R j is a k-ary relation on the corresponding subset of domains D j .We use the above syntax to define a CSP problem as variables, domains, and constraints.An example is given in both Figure 2  Boolean Satisfiability (SAT) Formulation.SAT is the problem of deciding if there is an assignment to the variables of a Boolean formula such that the formula is satisfied.Many analytical reasoning problems can be formulated as SAT problems.We adopt the grammar defined in Ye et al. (2023) to formulate an SAT problem P as (Φ, T , Q), where Φ is a set of constraints defined under the theory T , and Q is the query of interest.
Table 1 summarizes the four types of logical reasoning problems, their typical datasets, and the symbolic formulation used to represent each type of problem.We also give an example of a natural language statement with its corresponding symbolic formulation for each type.Appendix C shows the full prompts we use for the problem formulator.To teach LLMs to better align each statement with its corresponding symbolic form, we use the format SYMBOLIC_FORMULA ::: NL_STATEMENT in in-context examples to enable better grounding.

Symbolic Reasoner
After the problem formulator parses the problem P and the goal G into symbolic representations P and Ĝ, we call a deterministic external solver depending on the task, to obtain the answer A. Table 1 summarizes the symbolic solvers we use for each type of logical reasoning problem.
LP System.For deductive reasoning, we incorporate the Pyke expert system (Frederiksen, 2008), which makes inferences based on the logic programming language.In response to a query, Pyke first creates a knowledge base, populating it with known facts and rules.Subsequently, it applies forward-and backward-chaining algorithms to infer new facts and substantiate the goal.
FOL Prover.We use Prover92 as the FOL inference engine.Prover9 is an automated theorem prover that supports first-order logic and equational logic.It initially converts FOL statements to conjunctive normal form (CNF) and then performs resolution (Robinson, 1965) on the CNF to deduce whether a conclusion is true, false, or unknown.
CSP Solver.Solving a CSP is to find value assignments for all variables that satisfy all given constraints.Commonly used algorithms for this task include backtracking, constraint propagation, and local search variants.To this end, we incorporate the python-constraint3 package which offers solvers for CSPs over finite domains.
SAT Solver.For solving SAT problems, we use the Z3 theorem prover (de Moura and Bjørner, 2008), a satisfiability modulo theories (SMT) solver developed by Microsoft4 .The SMT solver provides algorithms to determine whether a set of mathematical formulas is satisfiable.It generalizes the SAT problems to more complex formulas involving real numbers, integers, and various data structures such as lists, arrays, bit vectors, and strings.A lot of real-world analytical reasoning problems can be represented as problems of solving a system of equations.

Self-Refiner
For complex problems, generating the correct logical form may become challenging for LLMs.To address this, we introduce a self-refinement module that learns to modify inaccurate logical for-mulations using the error messages from the symbolic reasoner as feedback.Recent works (Chen et al., 2023;Madaan et al., 2023) have adopted similar ideas to improve code generation, by teaching LLMs to debug their predicted programs via fewshot demonstrations.Here we extend this idea to refine generated logic representations.If the symbolic solver returns an execution error, we instruct the LLM to refine the incorrect logical form, by prompting it with the erroneous logic form, the solver's error message, and a set of demonstrations showing common error cases (e.g., a free variable is not bounded to any quantifier in FOL) and their remedies.We run this process iteratively until either no error messages are returned, or the maximum number of allowable revisions is reached.

Result Interpreter
Finally, the result interpreter translates the results returned from the symbolic solver back to a natural language answer.For certain problems, this can be achieved through predefined rules; for example, mapping Entailment to true.However, this process can be more complex for CSPs, e.g., translating {convertible: 1, tractor: 2, minivan: 3} to "the convertible is the oldest.".To handle these varying levels of complexity, we designed both rule-based and LLM-based result interpreters.Details of the result interpreter are given in Appendix D.

Experiments
Datasets.We evaluate LOGIC-LM on five common logical reasoning datasets, as follows.
PrOntoQA (Saparov and He, 2023) is a recent synthetic dataset created to analyze the capacity of LLMs for deductive reasoning.We use the hardest fictional characters version of the dataset, based on the results in Saparov and He (2023).Each version is divided into different subsets depending on the number of reasoning hops required.We use the hardest 5-hop subset for evaluation.Each question in PrOntoQA aims to validate a new fact's veracity, such as "True or false: Alex is not shy.".
ProofWriter (Tafjord et al., 2021) is another commonly used dataset for deductive logical reasoning.Compared with PrOntoQA, the problems are expressed in a more naturalistic language form.We use the open-world assumption (OWA) subset in which each example is a (problem, goal) pair and the label is one of {PROVED, DISPROVED, UNKNOWN}.The dataset is divided into five parts, each part requiring 0, ≤ 1, ≤ 2, ≤ 3, and ≤ 5 hops of reasoning, respectively.We evaluate the hardest depth-5 subset.To reduce overall experimentation costs, we randomly sample 600 examples in the test set and ensure a balanced label distribution.
FOLIO (Han et al., 2022) is a challenging expert-written dataset for logical reasoning.The problems are mostly aligned with real-world knowledge and use highly natural wordings, and the questions require complex first-order logic reasoning to solve.We use the entire FOLIO test set for evaluation, consisting of 204 examples.
LogicalDeduction is a challenging logical reasoning task from the BigBench (Srivastava et al., 2022) collaborative benchmark.The problems are mostly about deducing the order of a sequence of objects from a minimal set of conditions.We use the full test set consisting of 300 examples.
AR-LSAT (Zhong et al., 2022) is a dataset that collects all analytical logic reasoning questions from the Law School Admission Test from 1991 to 2016.We use the test set which has 231 multiplechoice questions.AR-LSAT is particularly challenging, with state-of-the-art models only achieving performance slightly better than random guessing (Liang et al., 2022;Ribeiro et al., 2023a).Baselines.We compare our model against two baselines that depend solely on LLMs for logical reasoning: 1) Standard LLMs, which leverage incontext learning to directly answer the question; and 2) Chain-of-Thought (CoT) (Wei et al., 2022b), which adopts a step-by-step problem-solving approach, generating explanations before providing the final answer.We separately evaluate the settings that ChatGPT (gpt-3.5-turbo),GPT-3.5 (text-davinci-003) (Ouyang et al., 2022a) and GPT-4 (gpt-4) (OpenAI, 2023) serve as the underlying LLMs for all models.To ensure fair comparisons, we use the same in-context examples for all models.For reproducible results, we set the temperature to 0 and select the response with the highest probability from LLMs.Since all examples are formed as multiple-choice questions, we evaluate model performance based on the accuracy of selecting the correct answer.

Main Results
We report the results of LOGIC-LM (without selfrefinement) and baselines in Table 2.For LOGIC-LM, a symbolic solver does not return an answer when there are grammar errors in the symbolic formulation.For these un-executable cases, we fall back on using chain-of-thought to predict the answer.We have three major observations.1. Logic-LM significantly outperforms standard LLMs and CoT across all datasets.With GPT-3.5, our method outperforms standard LLM on all datasets, with an average improvement of 39.2%.This highlights the benefit of combining LLMs with external symbolic solvers for logical reasoning.LOGIC-LM also improves CoT by a large margin of 18.4% on average, showing that offloading the reasoning to symbolic solvers greatly improves faithfulness compared with pure language-based reasoning with CoT.
2. GPT-4 outperforms GPT-3.5 by a large margin of 48.46% on average for the standard prompting.This aligns with the assertion that the main enhancement of GPT-4 lies in its ability to carry out complex reasoning (OpenAI, 2023).Although this may indicate that the logical reasoning capability can be boosted by scaling up the LLM, we observe that GPT-4 still makes numerous unfaithful reasoning errors.By delegating the reasoning to symbolic solvers, our method can further improve GPT-4 by an average of 24.98% and 10.44% for standard prompting and CoT prompting, respectively.
3. While integrating CoT generally enhances LLM performance, we find its benefits comparatively less substantial or even negative on FOLIO, LogicalDeduction, and AR-LSAT, with a modest improvement of 11.75%, 9.41%, and -3.2%, respectively.On the contrary, the benefits of CoT on ProntoQA and ProofWriter are 51.59% and 33.82%, respectively.A plausible explanation is that CoT emulates human forward-chain reasoning: beginning with known facts and sequentially deriving new conclusions until the goal is met.This reasoning style aligns well with problems in the PrOntoQA and ProofWriter datasets.However, FOL and CSP problems often necessitate more sophisticated reasoning strategies that are "nonlinear" compared to standard forward-chain reasoning.These include hypothesizing, conditioning, recursive inference, and the process of elimination.Compared to CoT, the integration of symbolic solvers is better suited to these reasoning styles, hence yielding a more marked improvement on FO-LIO (+21.85%),LogicalDeduction (+45.67%), and AR-LSAT (+24.14%).

Effectiveness of Problem Formulator
We then evaluate how well LLM can translate a given problem into the symbolic formulation used by each symbolic solver.In each dataset (Exe_Rate).Generally, LLM demonstrates high proficiency in transcribing problems into symbolic formats, evidenced by its near 100% Exe_Rate on ProntoQA, ProofWriter, and Logi-calDeduction.However, the high performance on these datasets is somewhat anticipated, given that their problems are mostly synthetically generated, limiting language variability.When it comes to datasets comprising real-world, expertly crafted problems, such as FOLIO and AR-LSAT, GPT-4's performance is notably less promising, with Exe_Rate scores of 79.9% and 32.6% respectively.This discrepancy underscores the inherent challenges associated with converting real-world problems into their logical equivalents.
Exe_Rate only reflects the grammar correctness of the logical form.We also report the accuracy of the executable samples (Exe_Acc) to measure the semantic correctness.We find that logical forms generated by GPT-4 generally achieve high Exe_Acc, even for the most challenging AR-LSAT dataset.Such performance accentuates the potential of symbolic solvers in bolstering the model's logical reasoning prowess, contingent on the precise translation of problems into symbolic forms.

Robustness of Reasoning
Incorporating symbolic solvers also leads to more robust reasoning.To illustrate this, we report the performance of LOGIC-LM and baselines for questions of varying complexity levels.We randomly selected 300 examples from each subset of ProofWriter, ensuring a balanced label distribution.The problems in these subsets require 0, <=1, <=2, <=3, and <=5 hops of reasoning, respectively.The results, shown in Figure 3, indicate that LOGIC-LM becomes increasingly effective as the required reasoning depth increases.For exam- CoT (GPT-3.5)Logic-LM (GPT-3.5)CoT (GPT-4) Logic-LM (GPT-4)

Impact of Self-Refinement
In Table 3, we find that self-refinement is effective in fixing the in-executable symbolic formulations, increasing the Exe_Rate by 5.01 on average.For an in-depth analysis, we then evaluate the accuracy and Exe_Rate across different rounds of selfrefinement on FOLIO, namely, 0 (no refinement), 1, 2, and 3 rounds.The results are in Figure 4. We find that as the rounds of self-refinement increase, the percentage of executable formulations consistently increases, leading to an enhancement in the final performance.This suggests that selfrefinement serves as an effective tool in aiding the LLM to accurately frame the problem.However, the accuracy tends to stagnate in subsequent rounds, even though the Exe_Rate continues to increase.This can be attributed to the type of feedback received by the self-refiner, which is the error mes-Problem: "Stranger Things" is a popular Netflix If a Netflix show is popular, Karen will binge-watch it.
If and only if Karen binge-watches a Netflix show, she will download it.Karen does not download "Black Mirror"."Black Mirror" is a Netflix show.If Karen binge-watches a Netflix show, she will share it to Lisa.Question: Is the following statement true, false, or uncertain?"Black Mirror" is popular.

Conclusion:
Popular(blackMirror) # "Black Mirror" is popular.Predicted answer: B sage from the symbolic solver.This feedback aids in converting "invalid" symbolic representations into valid ones.However, a valid symbolic representation does not necessarily equate to a "correct" problem formulation that accurately represents the problem.This issue could be tackled by enhancing the self-refiner to incorporate feedback beyond the error message, e.g., a reward signal from an additional module evaluating the accuracy of a generated symbolic form.We leave this as a promising direction for future exploration.

Case Study
In Figure 5, we show an example of the symbolic representations generated by GPT-4, together with the predicted answer.In general, LOGIC-LM has demonstrated a potent capacity to interpret complex problems into symbolic forms.Nonetheless, there remain certain difficulties in accurately understanding the semantics of the problem.
We further analyze some error cases in Figure 6 of Appendix E. Example 1 shows a case where GPT-4 generates an incorrect FOL representation, stemming from its inability to define appropriate predicates.Here, instead of creating the predicate EasternWildTurkey, the model generates a constant, WildTurkey(eastern), in which WildTurkey is the predicate and eastern is the constant.While this representation is valid in isolation, it does not interact well with subsequent constants.This inconsistency is a recurring issue in GPT-4's symbolic form generation, illustrating that the model sometimes struggles to maintain an overarching understanding of the problem when forming logical symbols.Example 3 highlights a case where GPT-4 struggles to interpret specific expressions accurately.In this case, the model fails to distinguish between the meanings of "below" and "above", resulting in an incorrect constraint Dan > Eve.Example 4 exemplifies GPT-4's challenge with fully grasping the rules of FOL grammar, evidenced by the invalid generated formula: Rating(subway, y) ∧ y > 9.These error cases underscore that transforming problems into logical forms remains a challenging task for modern LLMs, due to the intricacies of FOL formulation, the innate flexibility of natural language, and the complexity of global problem comprehension.

Conclusion and Future Work
In this work, we propose a novel approach to address logical reasoning problems by combining large language models with symbolic solvers.We introduce Logic-LM, one instantiation of such a framework, and demonstrate how it significantly improves performance over pure LLMs and chainof-thought prompting techniques.
While Logic-LM has proven to be a capable system, it can be further improved with extension to more flexible and powerful logic systems.For example, statistical relational learning (SRL) systems such as Markov logic networks (Richardson and Domingos, 2006) and probabilistic soft logic (Bach et al., 2017) have demonstrated great promise in reasoning under uncertainty and integration with our framework would enable even more adaptive problem-solving capabilities.Additionally, our method can be extended to reasoning problems requiring commonsense, which remains a significant challenge as they often require reasoning over complex and ambiguous rules.

Limitations
We identify two main limitations of First, LOGIC-LM relies on translating reasoning problems into logical formats that can be tackled by symbolic solvers.As a consequence, the model's applicability is inherently bounded by the expressiveness of the symbolic solver, for example, not all problems can be easily encoded in first-order logic.Nevertheless, this limitation can be mitigated by integrating a more diverse set of symbolic solvers.The flexible design of LOGIC-LM facilitates this integration.The wide range of reasoning tasks that we can instantiate our LOGIC-LM framework on shows its general applicability.
Second, LOGIC-LM depends on in-context learning coupled with self-refinement to convert a natural language (NL) problem into the symbolic representation.While this method has proven to be effective, it may face difficulties when dealing with logical representations with intricate grammar structures, such as probabilistic soft logic.This arises from the difficulty in conveying complex grammatical rules to the language model through a limited number of demonstrations within a constrained context size.As a potential solution, future works could explore the development of specialized modules to enhance the mapping between NL and symbolic language, e.g., fine-tuning LLMs with synthetic data generated via symbolic solvers.

Ethics Statement
The use of large language models requires a significant amount of energy for computation for training, which contributes to global warming (Strubell et al., 2019).Our work performs few-shot in-context learning instead of training models from scratch, so the energy footprint of our work is less.The large language models whose API we use for inference, especially GPT-4, consume significant energy.
A Syntax for First-order Logic (FOL)

Chain-of-Thought Prompting
Task Description: Given a problem statement as contexts , the task is to answer a logical reasoning question .

Context:
The following paragraphs each describe a set of five objects arranged in a fixed order .
The raven is the third from the left .
Question: Which of the following is true ?

Options:
A ) The quail is the rightmost .B ) The owl is the rightmost .C ) The raven is the rightmost .D ) The falcon is the rightmost .E ) The robin is the rightmost .

Reasoning:
The owl is the leftmost .This means owl is not the rightmost .(• • • more reasoning here • • • ) This means raven is also not the rightmost .So , the answer is : A) The quail is the rightmost .

Logic-LM
Task Description: You are given a problem description .The task is to parse the problem as a constraint satisfaction problem , defining the domain , variables , and contraints .

Context:
The following paragraphs each describe a set of three objects arranged in a fixed order .
The minivan is newer than the convertible .
Question: Which of the following is true ?

Options:
A) The station wagon is the second -newest .B) The convertible is the second -newest .C) The minivan is the second -newest .

Domain:
1: oldest 3: newest Question: Which one of the following CANNOT be true of the week 's schedule ?

Options:
A ) The division that is toured on Monday is also toured on Tuesday .B ) The division that is on Monday is also toured on Friday .C ) The division that is toured on Tuesday is also toured on Thursday .D ) The division that is toured on Wednesday is also toured on Friday .E ) The division that is toured on Thursday is also toured on Friday .
The correct option is : C

Chain-of-Thought Prompting
Task Description: Given a problem statement as contexts , the task is to answer a logical reasoning question .
Context: During a single week , from Monday through Friday , tours will be conducted of a company 's three divisions : Operations , Production , and Sales .Exactly five tours will be conducted that week , one each day .(• • • more context here • • • ) If the Operations division is toured on Thursday , then the Production division is toured on Friday .
Question: Which one of the following CANNOT be true of the week 's tour schedule ?

Options:
A ) The division that is toured on Monday is also toured on Tuesday .B ) The division that is toured on Monday is also toured on Friday .C ) The division that is toured on Tuesday is also toured on Thursday .D ) The division that is toured on Wednesday is also toured on Friday .E ) The division that is toured on Thursday is also toured on Friday .

D Result Interpreter Implementation
For PrOntoQA and ProofWriter, the Pyke logic programming engine returns the inferred value of the variable in the query or Unknown if the variable cannot be determined.For example, for the query ConductElectricity(Nail, x), Pyke may return x =True.By comparing with the goal statement ConductElectricity(Nail, False), we can know that goal to be proved is False.
For FOLIO, the FOL inference engine directly returns the veracity label of the goal as ENTAILMENT, CONTRADICTION, and CONTINGENT, which can be mapped to True, False, and Unknown, respectively.For LogicalDeduction, the solver returns all the possible value assignments in an array.We write rules to parse each option into the corresponding value and check it is in the generated array.For AR-LSAT, we attempt to separately prove each option to find the correct answer.

Figure 1 :
Figure 1: Overview of our LOGIC-LM framework.

Figure 2 :
Figure 2: Overview of our LOGIC-LM model, which consists of three modules: (1) Problem Formulator generates a symbolic representation for the input problem with LLMs via in-context learning (2) Symbolic Reasoner performs logical inference on the formulated problem, and (3) Result Interpreter interprets the symbolic answer.
We convert all examples into a standard multiplechoice format, comprising a problem statement, a question, and potential answers, as shown in Figure 2. We also select 1-5 examples from the training set of each dataset as in-context examples.Detailed data statistics are in Appendix B.

Figure 5 :
Figure 5: An example of the generated symbolic representation and the predicted answer by LOGIC-LM.
Max is a yumpus .Each yumpus is a dumpus .(• • • more reasoning here • • • ) Tumpuses are not sour .So Max is not sour .The correct option is : B Logic-LM Task Description: You are given a problem description and a question .The task is to : 1) define all the predicates in the problem 2) parse the problem into logic rules based on the defined predicates 3) write all the facts mentioned in the problem 4) parse the question into the logic form Context: Each jompus is fruity .(• • • more context here • • • ) Rompuses are zumpuses .Alex is a tumpus .Question: True or false : Alex is not shy .Predicates: Jompus (\ $x , bool ) ::: Does x belong to Jompus ?(• • • more predicates here • • • ) Zumpus (\ $x , bool ) ::: Does x belong to Zumpus ?The following paragraphs each of seven objects arranged in a fixed order .(• • • more context here • • • ) Eve finished below Ada .Rob finished below Joe .Question: Which of the following true ?A ) finished third .B ) Eve finished third .C ) Ada finished third .D ) Dan finished third .E ) Rob finished third .F ) Amy finished third .G Joe finished third .The correct option is : A

Table 1 :
and Table 1.A summary of the symbolic formulations (with examples) and symbolic solvers we use for the five datasets in our study, representing four different types of logical reasoning problems.

Table 3 :
Analysis of accuracy and execution status of LOGIC-LM.We present the percentage of executable logical formulations (Exe_Rate) together with the accuracy of the execution (Exe_Acc).SR represents before (−) and after (+) self-refinement.

Table 3
, we report the percentage of symbolic formulations that are executable by the corresponding symbolic solver for
Context:Question: Is the following statement true or false ?Max is sour . 3]