Entailer: Answering Questions with Faithful and Truthful Chains of Reasoning

Our goal is a question-answering (QA) system that can show how its answers are implied by its own internal beliefs via a systematic chain of reasoning. Such a capability would allow better understanding of why a model produced the answer it did. Our approach is to recursively combine a trained backward-chainingmodel, capable of generating a set of premises entailing an answer hypothesis, with a verifier that checks that the model itself believes those premises (and the entailment itself) through self-querying. To our knowledge, this is the first system to generate multistep chains that are both faithful (the answer follows from the reasoning) and truthful (the chain reflects the system’s own internal beliefs). In evaluation using two different datasets, users judge that a majority (70%+) of generated chains clearly show how an answer follows from a set of facts - substantially better than a high-performance baseline - while preserving answer accuracy. By materializing model beliefs that systematically support an answer, new opportunities arise for understanding the model’s system of belief, and diagnosing and correcting its misunderstandings when an answer is wrong.


Introduction
Although pretrained language models (PTLMs) have shown remarkable question-answering (QA) performance, it is often unclear why their answers follow from what they know.While there has been substantial work on training models to also generate explanations for their answers (Wiegreffe and Marasović, 2021), or produce them via few-shot prompting, e.g., "chains of thought" (Wei et al., 2022), those explanations may not be faithful (the answer does not necessarily follow from them) and may not be truthful, in the sense that the language model itself does not believe1 the explanation state-Figure 1: Given a question, Entailer searches for an answer hypothesis that is supported by an entailment proof.First it over-generates candidate proofs, then it removes those that the model itself does not "believe" (i.e., confirms via self-querying that it considers all the generated proof elements to be true).Finally it selects the best verified proof.Multistep proofs are generated by iteratively backward chaining on the premises (Section 3.2).
ments that it generated.Rather, our goal is to generate answers that systematically follow from the model's own internal beliefs, materializing those beliefs as explicit statements that can then be inspected.Such a capability offers new opportunities for understanding, diagnosing, and ultimately correcting errors in a language model's behavior.
Our approach uses a combination of generation and verification, implemented in a system called Entailer 2 .Chains are constructed by backward chaining from candidate answers, recursively using a language model (LM) trained for a single backward-chaining step.For each step, Entailer over-generates candidate entailments, then filters Q: A magnet will stick to (A) a belt buckle (B) a wooden table (C) a plastic cup (D) a paper plate A: (A) A magnet will stick to a belt buckle because A belt buckle is sometimes magnetic.because Metal is sometimes magnetic.A belt buckle is made of metal.A magnet will stick to magnetic metals.Q: You can make a telescope with a (A) straw (B) Glass (C) Candle (D) mailing tube A: (D) You can make a telescope with a mailing tube.because A telescope is made of a tube for observing / seeing.A mailing tube is a kind of tube.
Q: Quartz may produce rainbows when light is shined (A) around the crystal's area (B) through any of its sides (C) in the room its in (D) in to a mirror at it A: (B) Quartz may produce rainbows when light is shined through any of its sides.because A rainbow is produced when light shines through a prism.because The rainbow is made of all different colors in visible light.
A prism can split light into different colors.A quartz is a kind of prism.out those that do not conform to its own internal knowledge ("beliefs") by self-querying, asking itself whether (a) the generated premises (leaves of the proof step) are true, and (b) each entailment step is valid (Figure 1).It then recursively backwardchains on premises until the overall proof confidence cannot be further improved (or a depth limit d is reached).Finally, the candidate answer supported by the highest-scoring chain of a reasoning is returned.As a result, the system has materialized some of its latent knowledge from which the selected answer follows.Most significantly, the resulting proof is thus both faithful (the answer follows from the proof) and truthful (the proof reflects the system's beliefs), providing a previously unavailable window into the model's beliefs about the world and their implications, e.g., Figure 2.
To train the Entailer model, we use a combination of the existing EntailmentBank dataset (Dalvi et al., 2021), plus a new crowd-annotated dataset that we construct by bootstrapping (train an initial model, generate candidate entailment examples with it, then annotate those examples as extra training data).The model is then frozen, and Entailer is then applied zero-shot to new datasets, i.e., Entailer is a treated as a general-purpose, fixed model specialized for reasoning, rather than requiring finetuning for new tasks.
We evaluate Entailer on two existing datasets, OBQA (Mihaylov et al., 2018) and QuaRTz (Tafjord et al., 2019).We find that its reasoningbased QA accuracy is similar to its direct QA accuracy, with the advantage that a supporting chain of reasoning is also produced.We also perform a human evaluation, and find that 70% of time users judge the chains to clearly show how an answer followed from their premises, substantially higher than for explanations produced by a comparable high-performance QA system, Macaw (Tafjord and Clark, 2021).Our contributions are thus: 1.

Related Work
Systematic Reasoning: Several recent systems have demonstrated the ability to perform systematic reasoning directly over natural language (Natural Language Inference (Manning and MacCartney, 2009)), namely deriving conclusions from known facts via step-wise application of well-defined inference operations.One approach is to retrain a blackbox model end-to-end (Clark et al., 2020), but has been limited to small rulebases.An alternative approach, which we follow here, is to have an outside loop around a model, where the model generates individual inference steps (i.e., rules), and a con-troller chains them together.SCSearch (Bostrom et al., 2022), NLProofS (Yang et al., 2022), IRGR (Ribeiro et al., 2022), ProofWriter (iterative version) (Tafjord et al., 2020), and Selection-Inference (Creswell et al., 2022) do this in a forward-chaining fashion, MetGen (Hong et al., 2022) does this bidirectionally, while Braid (Kalyanpur et al., 2020) (like us) does this in a backward-chaining fashion.In all these systems, the required facts were expected to be provided explicitly to the model.In contrast, Entailer's reasoning uses its own internal, latent knowledge, as well as (optionally) externally provided facts.LeapOfThought (Talmor et al., 2020) demonstrated that reasoning with a combination of implicit and explicit knowledge was possible for simple 1-step inferences.We expand this for multi-step inference, and (unlike LeapOfThought) have the system also explicitly articulate the implicit knowledge it uses, and its chain of reasoning.
Recent work has shown that generating a freeform explanation ("chain of thought") before an answer can also improve performance on a variety of tasks (Wei et al., 2022;Cobbe et al., 2021;Nye et al., 2021).In these works, however, the explanations are unstructured, and there are no claims of faithfulness that the answer follows from the generation, nor that the explanations themselves represent model beliefs.
Materializing a Model's Internal Knowledge: Pretrained LMs contain a vast amount of knowledge, and can be thought of as a kind of knowledge base to tap into (Petroni et al., 2019).Recent work has shown that this latent knowledge can be materialized as explicit English sentences or a knowledge graph using generative techniques, e.g., COMeT (Bosselut et al., 2019), ParaCOMET (Gabriel et al., 2021).Our work with Entailer similarly materializes its latent knowledge, but here in a goaldirected way, namely by producing a faithful chain of reasoning from facts it validates ("believes") as true to an answer.This articulation can be seen as a kind of self-talk, where a self-generated context can improve QA (Shwartz et al., 2020).However, here our generations are not used as context for opaque problem-solving, but are assembled into a well-defined chain of reasoning.

Beliefs:
We refer to the model's factual opinions as "beliefs" rather than "knowledge" because those opinions may be wrong.In general, an agent can be said to believe p if it acts as if p was true (Schwitzgebel, 2019).Following (Kassner et al., Figure 3: An entailment tree is a set of multi-premise, 1-step entailments (red boxes) showing how the hypothesis (root node, green) is entailed from the leaf nodes (white).If all the leaf nodes are true, and all the 1-step entailment relations are valid, then we say the tree is a valid chain of reasoning for the hypothesis.
2021), we take a simple, syntactic operationalization of this, namely the agent answers "yes" to the question "p?", but also note that more semantic versions could be used, e.g., the agent also answers "yes" to paraphrases and implications of p.In general, models can sometimes be inconsistent in their beliefs (Elazar et al., 2021;Kassner and Schütze, 2020;Ribeiro et al., 2019).For our purposes here, we simply note that such inconsistencies may occasionally exist, and that techniques for inconsistency resolution could be applied in future to reduce these, e.g., (Kassner et al., 2021;Li et al., 2019).

Approach
Like several previous systems (Section 2), Entailer treats reasoning as Natural Language Inference (NLI).In NLI, the basic unit of knowledge is (represented as) a sentence rather than a structure, and a proof4 is a tree of multi-step, multi-premise entailments, e.g., Figures 2 and 3.
Within this framework, given a question, Entailer first generates candidate answers, then tries to prove each one, selecting the answer with the highest-scoring proof.We now describe these steps.

Hypothesis Generation
Given a question, Entailer first generates candidate answers and converts these into declarative hypotheses (e.g., "Is the sky (A) blue (B) yellow" → { H 1 = "The sky is blue.",H 2 = "The sky is yellow."). 5 An N -way multiple choice question yields N hypotheses.A true/false question yields

Generating a Backward-Chaining
Step

Models:
The core of Entailer is generating and validating a single entailment step that entails a hypothesis.We define the following data types: H: A hypothesis (English statement) to prove.P: A set of premises {p 1 ,...,p i } (sentences) that together may entail the hypothesis H. Together, P and H form a one-deep entailment step, denoted by P ⊢ H. Q: A question posed to Entailer.A: A candidate answer for consideration.C: An optional context (set of sentences) relevant to the problem.This allows Entailer to also use external knowledge, if available, when generating a tree.We train a model (details in Section 4) with the three input/output behaviors below (optional inputs shown in parentheses): (QAC)H → P : Given a hypothesis H, generate a set of premises P that may entail H (QAC)H → S d : Score whether the model be-lieves that a hypothesis H (or premise p i ) is true (S d > 0.5) or not, (i.e.perform yes/no QA).We call this the direct score (range 0-1).(QAC)P H → S e : Score whether the model believes a candidate entailment (P ⊢ H) is valid (S e > 0.5) or not, i.e., P validly entails H (range 0-1).
Examples of these three angles are in Table 1.

Algorithm:
To generate a single backward-chaining step we adopt an overgenerate-and-filter approach, also found useful elsewhere (Yang et al., 2022;Cobbe et al., 2021;Li et al., 2022).First, given H, we use the angle H → P to generate a set of premises P that may entail H.We then check that the model believes all the premises p i ∈ P using the H(= p i ) → S d angle, and that it also believes the inference step P ⊢ H itself is valid (independent of whether the p i are true or not) using the P H → S e angle.The proof score, denoting how well the 1step proof supports the hypothesis, is the product of the premises' and entailment scores: .s e (P ⊢ H) We repeat this k times using nucleus sampling to obtain a diversity of alternative proof steps, and then select the highest-scoring one P ⊢ H, as illustrated in Figure 1.

Backward Chaining
This one-step backward chainer is embedded in a larger control algorithm that grows a multi-step entailment tree, searching for the best tree.This algorithm is illustrated in Figure 4 and described below.The full algorithm is in Appendix A. Each node N in the tree has two associated scores, the direct ("fast") score s d , denoting the model's direct belief in N (in red in Figure 4), and (for internal nodes) the proof ("slow") score s r , denoting how confidently the model can prove N (in blue), computed from its children.The overall score s is the max of the two.The proof score s r is recursively defined as the product of its children's overall scores times the entailment score: The algorithm starts with an initial hypothesis node H, then iteratively expands each leaf node N , looking for a proof that scores higher than the direct score s d of that node.In other words, can the system prove N with a more confidence than simply "guessing" if N is true?If it can, the node's overall score s (max of the two) will increase, that increase is be propagated up the tree, and the expansion is retained, e.g., step 3 in Figure 4.If expansions cannot improve a node's score further (e.g., steps 2 and 4), the expansions are pruned and that node becomes a leaf (red bars in Figure 4).
Note that because premises are self-generated rather than externally provided, this stopping condition has a different semantics to earlier work, e.g., (Kalyanpur et al., 2020;Bostrom et al., 2021): Rather than stopping when externally known facts are reached, Entailer stops when strongly believed facts are reached, and more backward chaining can no longer improve a hypothesis' confidence.
This whole procedure is repeated for each candidate answer hypothesis (Section 3.1).Finally the system selects the answer corresponding to the hypothesis with the top-scoring proof s(H).

Model Training
The core of Entailer is the model for one-step inference (Section 3.2.1),with the three functionalities listed in Table 1.As Entailer is intended to be a general-purpose system, the model is trained onetime for these three functionalities, and then frozen.It is then applied zero-shot to new datasets, e.g., the evaluation in Section 5.

EntailmentBank
To train Entailer's model, we leverage (the training partition of) the EntailmentBank dataset (Dalvi et al., 2021).EntailmentBank contains 1840 multiple-choice science questions (1313 in the training partition) along with their correct answer, a hypothesis expressing the question and answer in declarative form, and a multistep entailment tree expressing how the hypothesis follows from a set of facts drawn from a corpus of science facts (the WorldTree corpus (Xie et al., 2020)).Using the notation introduced earlier, each EntailmentBank example is of the form: < Q, A, H 0 , { (P i ⊢ H i ) * } > where (P i ⊢ H i )* denotes a set of entailment steps forming a tree (with root H i = H 0 ), describing how the corpus facts entail the hypothesis H 0 .

Crowdsourced Data
EntailmentBank only contains positive examples of entailments.To obtain negative examples, we first ran an earlier, positive-only-trained version of Entailer to generate 1-step proofs of (hypotheses for) incorrect answer options in EntailmentBank's questions (4-way multiple choice), resulting in (necessarily bad) proofs for the three false hypotheses.(This was done using just the H→P angle, without verification).Note that Entailer will generate some kind of proof even for false hypotheses, e.g., A rabbit has six legs because: 1.A rabbit is an animal 2. Animals have six legs These invalid proofs will be incorrect either because a generated fact is false, and/or because the inference itself is not a valid entailment.To distinguish these, we use crowdworkers to assess whether the generated facts were true, and if the entailment itself was valid.The 1313 questions result in 3939 proofs for false hypotheses.Dropping the few with more than two premises (to simplify the crowdsourcing interface), crowdworkers annotated 3673 proofs, using labels T/F/? for each premise and T/F/? for the entailment itself.Each proof was annotated by 3 crowdworkers, then an additional 3 workers provided additional annotations for cases with no initial majority verdict.Dropping premises/entailments without a final majority verdict, we end up with 7013 additional labeled premises for the H → S d angle, and 3391 additional labeled entailments for the P H → S e angle.

2082
The crowdworker interface is in Appendix B.

Optional Fields
We also augment the training data with duplicate examples but with additional, optional input fields: QA: The input QA question/answer-option pair, as well as the hypothesis H C: A context consisting of up to 5 relevant sentences, allowing explicit external knowledge to be provided if available.To generate examples of C during training, we use a mixture of sentences drawn from (a) the gold (target) entailment (i.e., the gold premises), and (b) sentences retrieved from a corpus of similar-styled knowledge (namely all the leaf sentences used in the EntailmentBank entailment trees), mixed in different ratios so the model is exposed to a mixture of high and medium relevance sentences.
In this work we do not use any context C at test time, but it is utilized in concurrent work for providing feedback to the overall system (Mishra et al., 2022).
Further details of training are given in Appendix C1.

Model Details
We train a T5-11B multi-angle model, Entailer, following the multi-task setup similar to Macaw (Tafjord and Clark, 2021) for the three functionalities described in Table 1. 6Details of hyperparameters and other settings are provided in Appendix C2.

Evaluation
Our goal is to generate answers supported by faithful, truthful chains of reasoning, without significantly impacting performance.Our two corresponding reserach questions are: 1. How does Entailer's proof-based answer accuracy compare with the direct QA accuracy (both zero-shot)? 2. How good are the generated entailment-based proofs?And are they better than those produced by a purely generative model?For the first question, we evaluate using two existing multiple-choice datasets, namely OBQA (Mihaylov et al., 2018) and QuaRTz (Tafjord et al., 2019).These datasets contain multiple-choice questions that (typically) require multihop reasoning, rather than being simple lookup questions.We use the test partitions7 with sizes 500 questions (OBQA) and 557 questions (QuaRTz).
For the second question, we collect human judgements of whether the hypothesis clearly follows from facts in the proof, and compare its proofs with explanations generated by a high-quality baseline model, Macaw (Tafjord and Clark, 2021).

QA Accuracy
We evaluate Entailer's proof-based QA accuracies under various conditions, and compare them against its direct QA accuracy: 1. Direct QA: Here we measure the model's ability to directly answer the test questions (zero shot) using the H → S d angle.One can think of this as the "fast thinking" answer.Note that this capability of the frozen model was trained on the same data as the rest of Entailer, so is a fair comparison.2. Entailer: Here we measure Entailer's ability to answer the test questions (again zero shot) by generating, scoring, and comparing entailment proofs for each answer option.One can think of this as the "slow thinking" answer.We vary: • maximum proof depth d = 1 (blue in Note that these proofs are are faithful explanations of why an answer was chosen, as the selected answer by definition is always the one with the highest-scoring proof. 3. Entailer + Direct: Here we combine the two by selecting the overall most confident answer of Direct and Entailer (using c d (H) or c r (H), Appendix A).Here, the proofs are not always faithful explanations of an answer, as the chosen answer may not be the one with the highest-scoring proof.

Results
The results are shown in Figure 5, and suggest that: Proof Scores are competitive with direct answer scores.In Entailer's best configuration (sampled top-level k = 6, max depth d = 3, last bar), its reason-based answers have an accuracy of 75.2/75.4for OBQA/QuaRTz respectively, not significantly different to the Direct scores of 75.2/74.1.This is important, as it suggests there is no significant accuracy penalty for proof-based answers.
Sampling proofs helps: Using sampling for the top-level proofs (last 4 bars) always outperforms greedy proof selection by a small amount.
Allowing deeper proofs does not significantly affect accuracy: Although deeper proofs8 slightly help in the combination of Entailer+Direct, and may provide more information to a user, the accuracy differences are not significant.When Entailer + Direct are combined, by selecting the most confident answer (last two columns), we lose the guarantee of faithfulness, as the selected answer may not be the one with the highestscoring proof.In practice, this occurs for 16.8% of the questions (OBQA), 14.2% (QuaRTz).In addition, we find this combination does not have significant performance gains, so has no obvious advantage in these experiments.
Note that we are measuring zero-shot performance on our test datasets, so our results are not comparable with leaderboard entries.9More im-portantly, though, our goal is not a state-of-the-art zero-shot model, but rather a model that can show how answers systematically follow from its own internal beliefs.Our results suggest this is possible with high reliability, and, as an additional bonus, without loss of zero-shot accuracy.Thus, rather than just answers, we now get answers supported by faithful chains of reasoning.

Human Judgements
For our second question of evaluating the quality of Entailer's proofs, we compare against explanations generated by Macaw, 10 a public, state-of-the-art QA system with explanation capability (Tafjord and Clark, 2021).Examples of explanations from both systems are in Appendix D. Six annotators compared 1-deep Entailer proofs with explanations from Macaw, scoring each along four dimensions, and then comparing the two: 1. Does the conclusion clearly follow from the premises?2. Are all the premises correct? 3. Are all of the premises relevant?4. Does the explanation introduce something new and useful, i.e., does more than just restate the conclusion in different words? 5. Which explanation do you prefer?Answer options for questions 1-4 were yes/somewhat/no, and for question 5 were first/similar/second.The ordering of explanations were scrambled so the annotators did not know which explanation was which, and in fact they were unaware that the different explanations had been generated by different systems.We collected annotations for 100 pairs of explanations for correct answers to 100 OBQA dev questions.The annotators were recruited from our institute, spanning a broad range of age (20-60) and experience.
10 Note that comparing with a baseline explanation generator trained on our data, i.e., just using H → P angle without verification, would not be informative, as such explanations would necessarily be worse: By definition, the verifiers remove explanations with either (believed to be) false premises and/or bad inferences, hence removing the verifiers will increase the frequency of false premises and bad inferences (confirmed through sample analysis).Hence we instead use a strong, external system as a comparative reference point.
Figure 6: Users' assessments of Entailer's proofs (red), compared to Macaw's explanations (blue), showing percent of times annotators answered "Yes".Entailer's conclusions were judged to "clearly follow from the premises" in over 70% of the proofs (first bar), substantially more than Macaw's explanations (34%).
judged to clearly follow from the premises in the large majority (over 70%) of the explanations, and substantially more than Macaw's explanations (34%).This potentially contributes to system trustworthiness, where understanding how evidence supports a conclusion is critical.2. ≈90% of Entailer's self-verified premises were judged correct by users.Of the remainder, virtually all were labeled "unsure" (only 1 Entailer fact was labeled not correct), indicating that the there are few false beliefs in proofs for correct answers.Rather, vague facts (e.g., "Claws are used for cracking open shells") make up the remainder.3. Entailer's explanations were clearly preferred (57% to 23%, last bar) over Macaw's.In particular, Entailer's arrangement of facts into a tree discourages irrelevant information (bar #3).Finally we note Entailer's proofs are faithful (showing how an answer was derived) and truthful (reflecting the system's own beliefs), while Macaw's explanations are post-hoc generated text.These all suggest the value of entailment trees as trustable explanations of a system's behavior.

Failure Analysis
If Entailer answers a question incorrectly, either a model belief (belief error) and/or the reasoning itself (reasoning error) must necessarily be incorrect, unless the question itself is malformed (dataset error).To better understand these, we classified the 51/500 cases in OBQA where Entailer selected the wrong answer while the Direct answer was correct, and found:  Note that here the reasoning is correct but the second is premise is false.2. dataset errors (20%).In several cases, the question was ambiguous (e.g., does "animal" include humans?) or more than one answer option was valid (OBQA is a crowdsourced dataset).3. reasoning errors (47%): 3a.Near-tautologies (20%): of the form "X because 1. X ′ , 2. ...", where X ′ is a near-repeat of the hypothesis.In such cases, the proof offers little new information.
2. New leaves begin to grow.
Here entailment is simply invalid.
3c. Incorrect abductive inferences (9%): of the form A because (A → B) and B. While sometimes useful, these inferences are not sound and can produce incorrect conclusions, e.g., Cooking food requires a fire to be built because 1. Cooking food requires heating the food.2. Fire can be used to heat things. 3d.
Quantification and exceptions (8%): where both beliefs and reasoning seem reasonable, but the specific case does not hold, e.g., Seeds are often found inside a strawberry because 1. Seeds are often found inside fruits.2. A strawberry is a kind of fruit.

When do proofs do better?
At its best, Entailer decomposes an uncertain hypothesis H into premises P which it is very confident about.For example, Entailer is unsure whether A suit of armor conducts electricity, but it confidently believes the generated premises: A suit of armor conducts electricty because 1.A suit of armor is made of metal 2. Metals conduct electricity Thus an uncertain conclusion is replaced by more certain premises, and we see Entailer performing particularly well for such questions.However, there are questions that are largely "fact lookup", where a proof may be less helpful.For example, the model is already very confident about the hypothesis Jupiter is the largest planet in the Solar System; decomposing this into 1.The largest planet has the greatest mass plus 2. Jupiter has the greatest mass in the solar system has not obviously made answering easier.In fact, Entailer's algorithm is specifically designed to account for this, only growing a proof when the confidence in the premises improves the confidence in H, thus tailoring its degree of reasoning to the complexity of the problem at hand.

Towards Teachable Systems
If a system can show how its answers systematically follows from its own beliefs, articulating those beliefs in the process, then this opens a window into the model's internal state for the user.As a result, new opportunities arise for identifying and correcting the model's misunderstandings, so the system gradually improves over time (sometimes referred to as explanatory interactive machine learning (XIL) (Teso and Kersting, 2019)).One vehicle for doing this is to use Entailer's (currently unused) context field C at runtime: If the user asks a question, receives a wrong answer, and sees an incorrect belief in the proof, they would provide the corrected belief, then re-ask the question with the corrected belief in the context.This encourages the model to use the corrected belief in its new answer and proof, rather than just repeat the same bad belief.Such overrides would then be stored and retrieved from a persistent memory to use for future questions also.A simple, hypothetical dialog illustrating this is shown in Figure 8.This is an exciting avenue made possible by this work, currently used by the TeachMe system (Mishra et al., 2022).

Conclusion
Our goal is a system that can show how its answers systematically follow from its own internal beliefs, and materialize those beliefs in the process.Entailer is, to our knowledge, the first sys-U: Can a magnet attract a penny?
Entailer tries to prove H (A magnet attracts a penny) and the negation neg(H), and reports the best proof E: A magnet can attract a penny because: 1.A penny is made of copper 2. Copper is magnetic Do you agree?
Here Entailer gives the wrong answer, arising from an incorrect belief #2.The user offers a correction: U: No. Copper is not magnetic System tries again, trying to prove H and neg(H) with context C = "Copper is not magnetic".C biases the model away from its prior mistake, effectively changing the model's belief.The new best proof is reported.E: A magnet cannot attract a penny because: 1.A penny is made of copper.2. Copper is not magnetic Do you agree?
The system now gets the right answer, having learned that copper is not magnetic from the user.C is added to memory for use in future questions, via information retrieva.U: Yes Figure 8: A hypothetical dialog illustrating how a user (U) might identify and correct Entailer's (E) incorrect beliefs through interaction.Here Entailer initially gets the wrong answer due to an incorrect model belief ("Copper is magnetic").The user offers a correction, which is then provided as context when re-asking the question, effectively overriding the prior bad belief (the user has "taught" the model).By storing such corrections in a memory, such belief updates persist over time.
tem to demonstrate this capability, achieved using an "over-generate and verify" approach for each backward-chaining step.Our evaluation suggests that Entailer's proof-based answer accuracy is similar to the "direct" QA accuracy, with the additional benefit of providing a faithful, truthful chain of reasoning showing how its answers follow from its internal beliefs.The significance of this is that these chains provide a window into the model's internal beliefs and their relationships, allowing users to both verify a system's answer, or if the system provides an incorrect answer, to identify the incorrect belief leading to that error.This in turn offers new opportunities for a user to correct the system's misunderstanding by overriding a faulty belief, e.g., by adding a memory of user corrections/overrides (Tandon et al., 2022), or by editing the model parameters directly (Mitchell et al., 2021), a step towards interactive systems that can both explain to, and learn from, users over time (Mishra et al., 2022).Entailer data and models are available at https://allenai.org/data/entailer.We look forward to future developments in these directions.

Limitations
We have shown how a neural system can expose how its answers systematically follow from its own internal beliefs, providing a window into the model's system of beliefs.While exciting, there are several limitations with the current work and opportunities for the future.
First, the system is not perfect at generating coherent chains of reasoning, sometimes producing entailments that are invalid or nearly tautologous (Section 5.4.1 and Figure 7).Improved proof generation and scoring techniques would help address this.
Second, like others, we use textual entailment as the basic reasoning operation, but the definition of entailment remains somewhat imprecise (a valid entailment is one that "a person would typically infer."(Dagan et al., 2013)), contributing to noise in the model's training data.A more precise characterization of reasoning validity would help in both generation and evaluation of reasoning chains.
Third, we assume the model is generally consistent about its beliefs, but in some cases the model may verify contradictory statements, making it less clear what the model actually believes in such cases.We currently do not handle such situations.Use of a global notion of belief (rather than per question) would be a valuable avenue to explore, e.g., (Kassner et al., 2021).
Fourth, as a practical matter, recursive construction of proofs is computationally expensive (≈360 seconds/question for up to depth-3 proofs for 4-way multiple-choice, Appendix A.2). Improvements to the search algorithm would allow faster experimentation, and also help deploy the model in a practical setting.
Finally, we have speculated that showing users faithful, truthful chains of reasoning might be a basis for a conversational system, where users could correct and teach the system in cases where it was wrong.However, this is currently just a conjecturefutures explorations into how this might be realized would be valuable.

Figure 2 :
Figure 2: Questions (from the OBQA dataset) and Entailer's answers, showing its chain of reasoning.

AngleInput→
Output (example) H → P "H: A paperclip is made of metal.P:" → "[PREMISE] A paperclip is made of steel.[PREMISE] Steel is a metal."H → S d "H: A paperclip is made of steel.V:" → 0.995 P H → Se "H: A paperclip is made of metal.P: [PREMISE] A paperclip is made of steel.[PREMISE] Steel is a metal.I:" → 0.998 Table 1: Examples of the three input/output angles used by Entailer.The first generates a candidate entailment rule P⊢H given H.The second and third score whether each premise, and the entailment itself, is valid, using tokens V/I in the input to indicate that S d /S e is the desired output.

Figure 4 :
Figure4: The entailment tree is grown recursively, the algorithm looking for the best tree (maximizes the overall score of the root node).Each node has a fixed, direct ("fast") score (in red), and (for internal nodes) a proof ("slow") score (in blue) computed from its children.The overall node score (highlighted) is the highest of the two.If expanding a node increases its overall score (e.g., step 3), that increase is propagated upwards and recursion continues.If expansions cannot improve a node's score further (e.g., steps 2 and 4), the expansions are pruned and that node becomes a leaf (red bars).

Figure 5 :
Figure 5: QA accuracy of Direct QA, Entailer, and the two combined on two datasets.
Figure 5) or 3 (green) • degree of search: (a) greedy: use the first (k = 1) generated entailment at each step, or (b) sampled top level: pick the best of k = 6 entailments for expanding the root hypothesis node.(≈ six times more computationally expensive than (a)).
1. belief errors (33%), where an incorrect belief led to a wrong answer, for example:In a desert... plants grow closer together because 1.A desert...contains a low amount of water.2. As the amount of water decreases, plants are forced to grow closer together to conserve water.