Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations

Pre-trained language models (LMs) struggle with consistent reasoning; recently, prompting LMs to generate explanations that self-guide the inference has emerged as a promising direction to amend this. However, these approaches are fundamentally bounded by the correctness of explanations, which themselves are often noisy and inconsistent. In this work, we develop Maieutic Prompting, which aims to infer a correct answer to a question even from the unreliable generations of LM. Maieutic Prompting induces a tree of explanations abductively (e.g. X is true, because ...) and recursively, then frames the inference as a satisfiability problem over these explanations and their logical relations. We test Maieutic Prompting for true/false QA on three challenging benchmarks that require complex commonsense reasoning. Maieutic Prompting achieves up to 20% better accuracy than state-of-the-art prompting methods, and as a fully unsupervised approach, performs competitively with supervised models. We also show that Maieutic Prompting improves robustness in inference while providing interpretable rationales.


Introduction
Following the remarkable success of few-shot prompting powered by large language models (e.g. Brown et al., 2020), recent studies on prompting methods suggest that LMs' reasoning capability can be further promoted by generating a sequence of explanation for a given problem, prior to inferring the answer Liu et al., 2021). The so-called explanation-based prompting helps an LM better elicit its knowledge and reason by leveraging its own generated explanations -whether it be a commonsense knowledge statement (Liu et al., 2021), a solution for a math word problem , or a scratchpad

Type I (41%)
Smoke is not the source of fire?
Smoke is a result of fire. Therefore, the statement is False.
One is a number that comes before zero?
One is ... Therefore, the statement is True.
One is a number that comes after zero?
One is ... Therefore, the statement is True.
Butterflies have 4 wings. Therefore, the statement is False.

Butterflies have 4 wings?
Butterflies have 2 wings on each side of their body. Therefore, the statement is False.
Explanation-based prompting is intuitively motivated by the steps of reasoning humans typically employ while solving a problem (Hausmann and VanLehn, 2007). However, we find that this intuition is faulty in practice, as model-generated explanations are often logically inconsistent and unreliable. For example, we manually inspected 100 samples from a QA task ( Figure 1) and found that for a considerable number of cases, (1) the generated explanation does not logically lead to the inferred answer, (2) the model infers the same label for a statement and its negation (Kassner and Schütze, 2020), and (3) the model falsifies its own generated explanation. These findings raise fundamental questions on the role of explanations in LM reasoning: If the explanation is correct -is there a guarantee that the LM will infer a label that is consistent with the explanation? And if the explanation is wrong -is there a way to make use of even the wrong explanation in inferring the correct  Figure 2: An overview of MAIEUTIC PROMPTING. Given a question Q, we generate maieutic tree consisting of abductive and recursive explanations, define the relations between them, and employ MAX-SAT to find the best truth-value assignments to the explanations and Q. answer?
To this end, we propose MAIEUTIC PROMPT-ING, a novel few-shot inference method that infers a correct answer by enumerating a structure of explanations -possibly noisy and contradictory -and resolving them with a symbolic inference algorithm. Inspired by the maieutic method 1 of Socrates, MAIEUTIC PROMPTING induces the LM to generate abductive explanations for diverse hypotheses with deep recursive reasoning, and collectively identifies and eliminates contradicting candidates, resulting in consistent answers. Figure 2 shows the overview of MAIEUTIC PROMPTING. First, we prompt the LM to abductively (Peirce, 1974) rationalize both possible answers, True and False, rather than generating a single explanation and then connecting it to one of the answer choices. Moreover, we do not expect the 1-hop explanations to be always correct; thus, we further validate the LM's confidence in its explanations by recursively prompting the model with its own generation as the question. Our generation process derives a tree structure of generated propositions, where one proposition establishes a logical ground for the correctness of one another.
To infer the answer for the original question, we quantify the strength of the LM's belief in each proposition and the logical relationships between propositions in the maieutic tree. We then employ the weighted MAX-SAT (Battiti, 2009) solver to collectively infer the truth-values of all the propositions (including the original question) that best satisfy the set of observed relations. This way, we symbolically induce the subset of generations that makes the most probable and consistent inference.
1 Maieutic method brings out definitions implicit in the interlocutor's beliefs, ... is a method of hypothesis elimination, steadily identifying and eliminating those that lead to contradictions (Vlastos, 1991).
Our experimental results show that the performance of MAIEUTIC PROMPTING exceeds that of all the few-shot prompting baselines (e.g. Chain of Thought;  in three commonsense reasoning and fact verification benchmarks. Using a small NLI model to infer the relations between propositions, MAIEUTIC PROMPTING performs up to 20% better than other prompting methods, and performs on par or even better than fine-tuned models. Further analyses show that MAIEUTIC PROMPTING is robust to perturbations in both the questions and prompts, and offers an interpretable interface to understand the rationale behind the model's inference.

Problem Setup and Background
Our goal is to infer whether a given statement Q makes sense, i.e. inferring the truth value A of Q. Conventionally, this can be done through prompting a language model with the following two methods: Standard Prompting Let Q be a statement we want to infer the truth value of (i.e. either True or False). In standard k-shot prompting, the modelinferred answerÂ is defined as: where C = {(q 1 , a 1 ), · · · , (q k , a k )} denotes the few-shot examples for in-context learning.
Explanation-based Prompting In explanationbased prompting, the inference process is factorized into two steps:

Maieutic Prompting
In this section, we introduce MAIEUTIC PROMPT-ING, which performs inference over a maieutic tree of generated explanations. First, we introduce logical integrity, a key concept that is used to determine the reliability of propositions.
Language models often generate logically inconsistent propositions; for instance, in Figure 1, the model infers True when prompted with either "One is a number that comes before zero." or "One is a number that comes after zero.". In this sense, p(True|Q) does not provide a reliable value to determine whether Q is true or not. We formalize this idea as logical integrity: a proposition Q is logically integral when the LM consistently infers the truth value of Q and ¬Q (i.e. Q as True and ¬Q as False, or vice versa).
Formally, we define a boolean function integral(E) as following 2 : integral(E) = 1 {1 or 2 is satisfied} 2 Given E, ¬E can be automatically generated simply by inserting a prefix (e.g. It is wrong to say that), or prompting LM to negate the given sentence.
A statement is considered to be logically integral / True when condition 1 is met, and logically integral / False when condition 2 is met. Intuitively, the truth values of logically integral propositions are more credible than non-integral ones in that the LM is guaranteed to be logically consistent to their negated counterparts. For example, "One is a number that comes before zero." in Figure 1 would not be logically integral, as the model assigns same truth value to both Q and ¬Q.
For the rest of this section, we first search for logically integral propositions by constructing the maieutic tree (Section 3.1); we then quantify the relations between the propositions (Section 3.2), on which basis we infer the final consistent answer (Section 3.3).

Abductive Explanation Generation
Given a question, we require the LM to post-hoc rationalize both True and False labels. This abductive explanation generation has several advantages over an ad-hoc approach that first generates an explanation, then predicts the label. First, in the ad-hoc setting, the model is required to not only stay relevant to the given question, but also to generate a discriminative explanation that helps in choosing one label over the other. Abductive generation, on the contrary, exposes the model to consider different possible answers rather than discriminating one, which often reveals an explanation that otherwise would not have been generated. Second, the model is able to incorporate label information into its generation. Intuitively, this may help elicit more specific explanations and mitigate the issue of a bland and generic generation which does not help the inference, a well-known weakness in LM-based conditional generation (Adiwardana et al., 2020).
Concretely, we define a function abductive which gets the statement Q as the input and outputs a tuple of two abductive explanations with True, False given as the answer, respectively: Figure 2 shows a concrete example of generating E T given Q. With Q, we prompt the model to rationalize True as the answer: "War cannot have a tie? True, because", which then is completed by an explanation by LM "In a context of war, there's always a victor and a loser.".   Figure 3: Illustrative example of maieutic tree generation, with the max tree depth set to 2. For visual clarity, we generate only 1 E T and 1 E F per question and omit the width-wise spanning of knowledge.

Depth-wise Knowledge Spanning
As substantiated in Figure 1, LM-generated explanations are noisy and inaccurate by nature. Prior works indirectly compensate for the untrustworthy generations by independently sampling multiple generations then aggregating them in the answerlevel (e.g., through majority voting; , but it is questionable whether the votingbased aggregation is indeed capable of filtering out only the incorrect explanations. To systematically address this issue, we harness the LM itself to validate its own generations -by recursively prompting the LM with the generated explanations. As Figure 2 shows, this corresponds to a depth-wise spanning of knowledge that induces a maieutic tree, a multi-depth structure of generated propositions and relations between them. Let S i denote the set of nodes at depth i in the maieutic tree T . Each node in S i is an explanation for an answer label (True or False), recursively generated given its parent node as the question: For instance, in Figure 2 "There can be cases where the loser is not clear." (E TF ) is generated by prompting the LM with its parent node "In a context of war, there's always a victor and a loser." Note that T is a full tree when the equality holds for all depths.
In practice, we sample multiple explanations with the same Q and A through nucleus sampling (Holtzman et al., 2019). This corresponds to the width-wise spanning of knowledge, enhancing the diversity and coverage of generated explanations.

When to Stop Generating
Generating a full tree could be computationally expensive, as the number of generation grows exponentially with the maximum tree depth. Therefore, in each branch, we stop generating further once we reach a logically integral proposition; intuitively, this aligns with our goal of finding generations for which the LM consistently infers a particular truth value. Figure 3 illustrates an example of maieutic tree generation where the maximum depth of the tree is set to 2. For visual clarity, we only generate one explanation per Q and A. Given Q, we first generate E T and E F , then validate whether each of these explanations is logically integral. Since E T is logically integral, we stop generating in this branch, but continue generating from E F which is not logically integral. After reaching the maximum depth, we prune the branches leading to leaf nodes that are still not logically integral. This procedure guarantees one simple constraint over the maieutic trees: keep only the generations that lead to a logically integral proposition. We provide a formal description of the generation process in Appendix A.

Defining the Relations
Now that we have generated the maieutic tree, we seek to define the relations between propositions and quantify their strength into scalar weights. For illustration, assume that an LM has generated the following E F for the given Q: Q: Captain Kirk is part of Star Wars? A: False, because Captain Kirk is a character in Star Trek.
The generation can be logically interpreted as follows: (1) the LM believes that Captain Kirk is a character in Star Trek, (2) the LM believes that the proposition Captain Kirk is a character in Star Trek can be a reason to deny that Captain Kirk is part of Star Wars. Accordingly, we define belief and consistency to represent the two dimensions of the logical relationship.
Belief w E corresponds to the LM's belief that the proposition E is true (and therefore, ¬E is false). To quantify belief, we prompt the LM with E and ¬E respectively as a question, then comparing the probability assigned to True: Note that calculating this does not require any additional prompting, as we already gained access to these values while checking for the logical integrity of each proposition.
Consistency w E,Q,A corresponds to the consistency of the generated E with the given Q and A. Intuitively, if the LM is logically consistent, the likelihood of E generated given an answer (e.g. likelihood of E F being generated given False) should be larger than its likelihood given the opposite answer (e.g. E F being generated given True). Following this intuition, we compute the consistency as:

Inference
The two types of relations formulate a set of unary and binary logical constraints, based on which we assign the truth values to all nodes in the maieutic tree T , and in consequence, infer the answer to the original question. First, we represent the set of beliefs C blf as a set of unary constraints. For each leaf node E in T , Note that all the leaf nodes in T are logically integral, hence we can count on the credibility of belief for these nodes. We now define the set of all belief constraints C blf as: For example, the nodes E F and E T F in Figure 2 would have a belief constraint in C blf .
Likewise, for consistency, we define C con as the set of binary constraints using logical implication. For each edge (E l , E lA ) in T , Our objective is to assign the truth values for all Es and the root node Q in T , such that we maximize c∈C blf ∪Ccon w c · 1 {c=True} which is the sum of the weights of the satisfied constraints.
This problem is naturally formulated as weighted MAX-SAT, which can be algorithmically solved using an off-the-shelf solver. Specifically, we use RC2 solver (Morgado et al., 2014) to find the assignments for Es and Q that max-satisfy the logical constraints in C blf ∪ C con .

Auxiliary Verifier
One limitation of the consistency definition in Section 3.2 is that it only considers the relationship between a parent node and a child node. Since the definition builds upon the likelihood of each generation from an LM, we cannot take into account the relationships across branches, e.g. E T and E F in Figure 3. This motivates us to introduce a small NLI model as an auxiliary verifier, which can infer the relationship between an arbitrary pair of nodes in T . Following previous works (Minervini and Riedel, 2018;Wang et al., 2019), we convert the NLI labels into logical relations as following: For all pairs of nodes (E 1 , E 2 ) ∈ node(T ) 2 , E 1 = E 2 , we obtain either E 1 → E 2 or E 1 → ¬E 2 if E 1 entails or contradicts E 2 . For NLI-based clauses, we fix the weights to 1. 3 While the objective function stays the same, C con is now replaced with C NLI , a set of clauses induced by the verifier model.

Experiments
Datasets We evaluate MAIEUTIC PROMPTING on three commonsense reasoning and fact verification benchmarks in binary QA format: Com2Sense , CSQA 2.0 (Talmor et al., 2021), CREAK (Onoe et al., 2021). Com2Sense and CSQA 2.0 consist of adversarial commonsense questions generated to mislead a proxy model. CREAK tests for a combination of commonsense reasoning and accurate fact retrieval, consisting of long-tail questions such as "Harry Potter can teach how to fly on a broomstick?". Despite their simple format, these datasets require both a substantial amount of knowledge and robust reasoning, which make them challenging even for the billion-scale fine-tuned LMs (Table 1).
Baselines We compare our method with both the few-shot prompting methods and supervised models. For few-shot prompting we consider both the standard prompting and explanation-based prompting methods, including Chain of Thought , Self-Consistency  and Generated Knowledge Prompting (GKP) (Liu et al., 2021). For supervised models, we consider the strong baselines used for the respective dataset, such as fine-tuned RoBERTa , T5 (Raffel et al., 2020), UnifiedQA (Khashabi et al., 2020) and Unicorn (Lourie et al., 2021).
Configuration Details For all prompting methods, we use the same set of 6 demonstration examples and the same version of GPT-3 (text-davinci-001) as the LM. In maieutic tree generation, we set the maximum depth to 2. For depth 1, we use nucleus sampling (p = 0.7) (Holtzman et al., 2019) to generate 3 E T s and 3 E F s from Q. For depth 2, we use greedy decoding to generate 1 E T and 1 E F from each parent node. This constrains the generated tree to have at most 18 nodes excluding the original Q. 4 In Section 4.3, we conduct an ablation study on this depth-adaptive decoding scheme and analyze the effect of the tree size. For the main experiments, we use the off-the-shelf RoBERTa  fine-tuned on MNLI (Williams et al., 2018) as a verifier model. Table 1 presents overall evaluation results of MAIEUTIC PROMPTING along with the prompting and supervised baselines. MAIEUTIC PROMPTING significantly outperforms all prompting methods across all benchmarks. Notably, GKP and Self Consistency ensembled more 1-hop explanations than the maximal size of the maieutic tree; our superior performance compared to these methods confirms the sample efficiency of depth-wise knowledge spanning. Moreover, MAIEUTIC PROMPT-ING is the only prompting method that performs better than even the smallest supervised baseline (RoBERTa-large) in Com2Sense and CREAK. In fact, MAIEUTIC PROMPTING allows us to use an 4 Both GKP and Self Consistency employ an ensemble strategy, generating N different samples of explanations then aggregating their answers. For a fair comparison with ours, we set N = 20 for both methods, generating more explanations than the maximal possible size of the maieutic tree in our setting.  off-the-shelf LM to achieve comparable performance to a large fine-tuned LM by simply plugging in our inference algorithm. Although explanation-based prompting methods do improve the model's accuracy compared to standard prompting, the gap gets smaller in a more challenging benchmark. For instance, while the gap between standard prompting and GKP is substantial in CREAK, the gap reduces down to only around 5% in both Com2Sense and CSQA 2.0. Unlike these baselines, MAIEUTIC PROMPTING consistently improves the performance by more than 10% over standard baseline across all benchmarks. In the challenging CSQA 2.0, MAIEUTIC PROMPT-ING achieves similar performance with fine-tuned Unicorn, a pre-trained model specialized for commonsense related tasks.

Robustness Analysis
We perform additional analyses to understand the working of our method under semantic perturbations and different prompt formats.
Robustness to semantic perturbations In addition to the standard accuracy, we report two additional metrics called pairwise accuracy and contrast set accuracy in Table 1. In Com2Sense test set and CREAK contrast set, each question is paired with its complimentary counterpart, of which the surface form is similar but the answer should be the opposite (e.g. "Barack Obama only has daughters." vs "Barack Obama has no daughter."). In pairwise accuracy, a model should get both sentences correct to get the pair correct. Since fine-tuned models are not exposed to the complimentary sentences during training, those that rely on surface-form heuristics performs worse in these metrics compared to standard accuracy. In these metrics, the gap between MAIEUTIC PROMPTING and baselines widens substantially, indicating the robustness of our method   against semantic perturbations.
Robustness to different prompts Prior works revealed that prompting performance could be sensitive to few-shot examples and their order (Lu et al., 2021b;. We investigate whether this holds true for MAIEUTIC PROMPT-ING, as shown in Figure 4. We compare different prompting methods run with 3 different sets of fewshot examples (left), and 5 different permutations of the few-shot examples (right). In both settings, while Self Consistency and MAIEUTIC PROMPT-ING are much more stable then the other two, our method has slightly less variance.

Ablation Study
We ablate different components of MAIEUTIC PROMPTING to investigate their respective contributions as shown in Table 2.
Generation First, we consider MAIEUTIC PROMPTING without abductive generation -while all the other stages staying the same, we generate each explanation without providing an answer label, i.e. in an identical fashion to Chain of Thought. In this setting, the performance of MAIEUTIC PROMPTING in Com2Sense degrades about 4%, alluding to the importance of abductive generation in eliciting the latent knowledge from LM. Next, we ablate the depth-adaptive decoding mechanism (Section 3.1.2), by applying either greedy decoding or nucleus sampling for all depths of the maieutic tree. All greedy decoding restrains width-wise spanning of knowledge, hence  leads to large degradation of performance. All nucleus sampling performs much more comparably with our best configuration, although the stochastic decoding produces slightly more errors in the explanations.
Consistency We ablate the NLI-based clauses and replace them with the original C con discussed in Section 3.2. With the LM-likelihood based clauses, the accuracy reduces by about 7%, but still prevails over the prompting baselines in Table 1. The result clearly shows that the verifier model indeed benefits the inference process, providing more accurate relations between generated explanations. Nonetheless, our method performs competently even without an access to this verifier model.

Effect of tree size
We also investigate how the size of the maieutic tree influences the performance.
In Table 3, we present the performance of MAIEU-TIC PROMPTING on Com2Sense dev set with various values of maximal depth and width. In both dimensions, the accuracy saturates after a certain threshold. We attribute this to (1) the topic drift in generation which intensifies as the depth grows, (2) larger overlaps in generated knowledge as we sample more explanations width-wise.

Human Evaluation
We qualitatively analyze actual inference results of MAIEUTIC PROMPTING through human evaluation. For each sample, we first retrieve true Es (the set of generated Es that are inferred to be True by MAIEUTIC PROMPTING). We then evaluate them over the four criteria from Liu et al. (2021): (1) Grammar: whether each explanation is grammatically correct, (2) Relevance: whether the explanation is topically relevant to the question, (3) Factuality: whether each explanation states facts, and (4) Helpfulness: whether the explanation explicitly leads to one of the answer labels. Six NLP experts scored a total of 100 examples sampled from CSQA 2.0 dev set, of which 50 were answered correctly (Set 1) and 50 were answered wrongly by the model (Set 2). The average Krippendorff's alpha (Krippendorff, 2007) was 0.64, indicating a substantial inter-annotator agreement. Figure 5 presents the evaluation results. For both correct and incorrect sets, over 99% of the true Es are grammatically perfect, and most of them provide relevant evidence to the question. 5 Surprisingly, the LM often generates both factual and helpful explanations even when its answer is different from the ground truth: over 40% of the true Es for incorrectly answered examples are perfectly factual, and over 23% of them are completely helpful in correctly answering the question. We find that in many of these cases, the questions did not have a clear-cut answer; as exemplified in Figure 6, the explanations generated and validated by MAIEUTIC PROMPTING are compelling enough as an alternative to the ground-truth answer.

Related Work
Numerous prior works have leveraged natural language explanations (NLEs) to promote model reasoning, either by training a model to explain (Rajani et al., 2019;Camburu et al., 2018;Chen et al., 2022;Wiegreffe and Marasović, 2021), generating unsupervised answers to pre-defined queries or collecting distantly supervised rationales using LMs (Shwartz et al., 2020;Brahman et al., 2021). Incorporated with the large-scale LMs capable of in-context learning (Brown et al., 2020;Chowdhery et al., 2022), these efforts have led to explanation-based prompting ; : War cannot have a tie.
: In order for one side to win a war, the other side must lose.
: In the context of a war, there is always a victor and a loser.
: In any conflict there is a winner and a loser.
: There can be cases where both sides claim victory or where the loser is not clear. : Historically there have been many wars where no victor was declared. : The Korean War ended in a military armistice, meaning that the war ended in a draw and neither side could claim victory.
: In football, the top division almost always contains the same clubs.
: The Football League is a hierarchical organization with a promotion and relegation system between its member clubs.
: There is little movement of clubs between football's top division, known as the Premier League, and the second division, known as the Championship. : There is a high level of parity between clubs in the Premier League and the Championship.
: There are many teams that change divisional placements from one year to the next. : There are many teams that get relegated (move down a division) in football.  Figure 6: Examples of MAIEUTIC PROMPTING. We present a case where MAIEUTIC PROMPTING correctly infers the ground-truth answer (above), and a case where the inferred answer is different from the ground-truth. Even in the latter case, the generated explanations make sense and logically lead to the inferred answer. We provide more examples in Appendix B. Liu et al., 2021;Lampinen et al., 2022). MAIEUTIC PROMPTING motivates upon these works, rethinking the role of NLEs in LMbased inference.
Despite their success, recent observations reveal that LM-generated explanations are unreliable, as they often lack logical consistency and are not factually grounded (Ye and Durrett, 2022;Kassner and Schütze, 2020). These findings are closely related to the broader limitations of generative LMs, which assign high probability to unlikely sentences (Welleck et al., 2020;Holtzman et al., 2021) and are sensitive to semantic perturbations . MAIEUTIC PROMPTING overcomes these limitations by avoiding the use of explanations "asis", and inferring the answer based on the relationships between the explanations.
Another line of relevant works harness NLEs to improve model interpretability. A mainstream approach in this direction is to train a model that explains its inference post-hoc or in parallel with the answer (Camburu et al., 2018;Narang et al., 2020;Jacovi et al., 2021). Unlike these works, the explanations in our work are designed to be intrinsic ; the explanations themselves explicitly participate in the inference.
Our work also relates to the recent thread of works that apply symbolic methods on top of LMs to improve their consistency. The symbolic methods take a form of either a lexical constraint on sequence decoding (Lu et al., 2021a), or an aux-iliary symbolic module for the generation to be consistent with the world model (Nye et al., 2021b) and performing discrete operations Cobbe et al., 2021). Other works explore to train a model that simulate the symbolic reasoning process, such as logical transformation (Bostrom et al., 2021) and consistent generation of beliefs Dalvi et al., 2022). However, these models require a curated set of human annotations, which limits their application to specific benchmarks and domains. MAIEUTIC PROMPTING generalizes the broad idea of these neuro-symbolic approaches in an unsupervised setup, employing MAX-SAT algorithm to symbolically determine the true subset from a noisy pool of neural generations.

Conclusion
In this work, we suggest MAIEUTIC PROMPTING, a novel few-shot inference method inspired by the Socratic way of conversation. We systematically generate a tree of explanations that bear logical relations between each other, then assign the truth values to explanations that max-satisfies these relations. Empirical results on multiple benchmarks demonstrate both the competitiveness and robustness of MAIEUTIC PROMPTING compared to diverse baselines. Qualitative analyses show that our method also provides intrinsic interpretations over its inference  Figure 7: Example of correct inference by MAIEUTIC PROMPTING. We show the generated maieutic tree along with the assigned truth-values to each propositions.