METGEN: A Module-Based Entailment Tree Generation Framework for Answer Explanation

Knowing the reasoning chains from knowledge to the predicted answers can help construct an explainable question answering (QA) system. Advances on QA explanation propose to explain the answers with entailment trees composed of multiple entailment steps. While current work proposes to generate entailment trees with end-to-end generative models, the steps in the generated trees are not constrained and could be unreliable. In this paper, we propose METGEN, a Module-based Entailment Tree GENeration framework that has multiple modules and a reasoning controller. Given a question and several supporting knowledge, METGEN can iteratively generate the entailment tree by conducting single-step entailment with separate modules and selecting the reasoning flow with the controller. As each module is guided to perform a specific type of entailment reasoning, the steps generated by METGEN are more reliable and valid. Experiment results on the standard benchmark show that METGEN can outperform previous state-of-the-art models with only 9% of the parameters.


Introduction
Explanation is recognized as a key factor toward responsible AI systems (Arrieta et al., 2020). In the context of question answering (QA), providing an explanation of the predicted answers can help improve the understandability, debuggability, and trustworthiness of QA models. Great efforts have been devoted to revealing how the models predict the answers and give explanations in various forms, including showing an attention map over passages (Seo et al., 2017), giving a snippet of textual evidence (DeYoung et al., 2020), and selecting answer-supporting sentences (Xie et al., 2020;Jansen and Ustalov, 2019). Among all explanation forms, the entailment trees (Dalvi et al., 2021) provide the most detailed and informative explanation by exposing the chains of reasoning Step2: 4 + 1 → Figure 1: Given facts related to the question+answer, METGEN iteratively generates an entailment tree that contains the hypothesis (green), used facts (orange), and intermediate conclusions (blue) with several separate entailment modules and a reasoning controller.
from the knowledge to the predictions. As shown in Figure 1(a) and (c), given a hypothesis (summarizing a question+answer pair) and supporting facts (retrieved from a corpus), the goal is to generate an entailment tree where each non-leaf node is an entailment of its children. Providing a valid entailment tree would help users to understand how the hypothesis is proved, obtain novel intermediate conclusions from the basic knowledge, and gain detailed information to support decision making.
To generate the entailment trees, Dalvi et al. (2021) propose EntailmentWriter, an end-to-end sequence-to-sequence generative model, trained by maximizing the generation likelihood of the linearized gold trees. However, they do not have an explicit strategy to constrain the validity of every single step and the tree structure. Thus, the steps are not guaranteed to satisfy the reasoning rules and could be incorrect and unreliable. For example, the step conclusion may not be entailed by the input premises or simply repeat one of the input premises (Dalvi et al., 2021). Furthermore, although their outputs are trees that can indicate the reasoning chains, the mapping mechanisms from the inputs to the trees remain implicit and invisible.
To tackle the above problems, we propose MET-GEN, a module-based framework to generate entailment trees in a more explicit approach and constrain the entailment steps with reasoning rules. As shown in Figure 1(b), given the target hypothesis and known facts, METGEN first uses the reasoning controller to select some steps that can help get closer to the hypothesis. Subsequently, METGEN executes the selected steps with single-step entailment modules and adds the generated intermediate facts into the known facts for the next round of reasoning. Through this iterative approach, METGEN proves the hypothesis step by step and generates the overall entailment tree.
Each module in METGEN is a generative model that can perform a specific type of entailment reasoning (e.g., making a substitution inference). To guide the modules to generate correct and sound conclusions, we train the modules with wellformed synthetic data containing the corresponding logical regularities of the reasoning types (Bostrom et al., 2021). Inspired by the forward chaining and backward chaining algorithms in logic programming (Chein and Mugnier, 2008), we adopt both deductive and abductive modules to execute forward and backward reasoning steps, respectively.
Experiments on the standard benchmark Entail-mentBank (Dalvi et al., 2021) show that METGEN can outperform the previous best model with 9.0% of the model parameters. Manual evaluation results demonstrate that METGEN can generate more reliable steps. Further experiments under the datascarce setting and cross-dataset setting (on eQASC and eOBQA (Jhamtani and Clark, 2020)) show that METGEN is more data-efficient and has better generalization capability compared with the baselines.

Related Works
Explainability in Question Answering. Recent works have explored the explainability of QA in various forms (Seo et al., 2017;Ye et al., 2020;Dalvi et al., 2021;Lamm et al., 2021;Wiegreffe and Marasovic, 2021;Thayaparan et al., 2020;Rosenthal et al., 2021). One way is to retrieve multiple supporting facts related to the question or answer (Xie et al., 2020;Jansen and Ustalov, 2019;Jhamtani and Clark, 2020;Inoue et al., 2020;Yadav et al., 2019Yadav et al., , 2020Valentino et al., 2021;Cartuyvels et al., 2020;Zhang et al., 2020). These "rationales" (DeYoung et al., 2020) provide insights about what are used by the model to inform its predictions, but do not show how the facts are combined to generate novel intermediate conclusions. Some other works explain QA systems in a generative way, including generating explanation sentences that directly link a question to an answer (Camburu et al., 2018;Rajani et al., 2019) and thus expose the relevant knowledge used by models (Latcinnik and Berant, 2020;Shwartz et al., 2020). However, as these models generate explanations in a free form, the generated facts may not be necessarily sound (Bostrom et al., 2021). Recently, Bostrom et al. (2021) propose ParaPattern, an automated pipeline for building two kinds of single-step deductions. Different from the above work, our method generates the explanations in a multi-step tree structure (Dalvi et al., 2021), showing what and how facts are combined to draw novel intermediate conclusions and reach the final answer. The intermediate conclusions are generated by deductive and abductive entailment modules that are constrained to perform specific types of reasoning.
Multi-Hop Proof Generation. Recently, several works propose to use the transformers for multihop logical reasoning and generate reliable formal proofs (Clark et al., 2020;Talmor et al., 2020;Saha et al., 2020Saha et al., , 2021Tafjord et al., 2021). However, they mainly focus on synthetic sentences, which have low linguistic variation and struggle to represent the flexible sentences in real QA scenarios.
Neural Module Networks. Decomposing the reasoning process into several pre-defined operations overlaps with the idea of neural module networks (Andreas et al., 2016;Hu et al., 2017;Gupta and Lewis, 2018;Gupta et al., 2020;Jiang et al., 2019). They typically assume that the question could be parsed into an executable program, i.e., the question explicitly describes the process to arrive at the answer. In our work, we tackle the questions/hypotheses that do not trivially describe the reasoning process and could be more challenging. The goal is to prove the hypothesis with the given facts through reasoning iterations (the upper part). In the first reasoning iteration (the lower part), the initial state is denoted as H ⇐ {s 1 , s 2 , s 3 , s 4 , s 5 }. First, the controller selects promising steps, such as the backward abductive step H − s 5 and the forward deductive one s 4 + s 5 . Then, single-step entailment modules perform the reasoning steps and generate novel intermediate facts including i 1 , i 2 , i 3 . After that, the controller verifies that the states i 2 ⇐ {s 1 , s 2 , s 3 , s 4 } and H ⇐ {s 1 , s 2 , s 3 , i 1 } are closer to the completion of reasoning and thus selects them for the next reasoning iteration.

Task Definition
As shown in Figure 1, the inputs are a hypothesis H and some fact sentences S = {s 1 , s 2 , . . . , s n } (including both relevant and irrelevant ones) expressing knowledge. H is a declarative sentence derived from a question+answer pair and can be proved by the knowledge in S. The desired output is a valid entailment tree T with the root node being H, the leaves being facts selected from S, and the intermediate nodes being novel intermediate facts (e.g., i 1 , i 2 ). T is considered valid if each non-leaf node is a valid entailment (a conclusion that "a person would typically infer" (Dagan et al., 2013)) of its immediate children. We denote the annotated gold tree as T gold and its leaf facts as S gold . Following Dalvi et al. (2021), we consider three increasingly difficult tasks with different S: Task1 (no-distractor): S = S gold , Task2 (distractor): S = S gold + 15-20 distractors, Task3 (full-corpus): S = a corpus C. Figure 2 illustrates the reasoning process of MET-GEN. We reason one step at a time and iteratively generate the entailment trees. In each iteration, given a reasoning state (e.g., the initial state R 0 : H ⇐ S, where we aim to prove H using S), the reasoning controller selects promising steps, including forward deductive steps and backward abductive ones. We then use the corresponding modules to perform single-step entailment on the selected steps and generate novel intermediate facts.

METGEN
Finally, we use the controller to verify the generated facts and select the correct states to perform further reasoning. We introduce details about the module design, reasoning controller, and reasoning algorithm in Sec 4.1, 4.2, and 4.3, respectively.

Module Definition
We propose to divide the single-step entailment reasoning ability into a set of well-defined basic logical operations. Such a design could help improve the generalization capability (Bostrom et al., 2021;Rudin, 2019). As shown in Table 1, we adopt three common reasoning types, covering over 90% of the steps in EntailmentBank according to the analysis by Dalvi et al. (2021). Note that the entailment module types could be adjusted according to the specific tasks or domains, which allows our method to be flexibly applied to other problems.
We adopt both the deductive and abductive versions of the reasoning types. Take a gold step s 1 + s 2 → i 1 as an example. Deduction is the process of reasoning from the premises to reach substitution an animal is a kind of organism.</s>an example of camouflage is organism having the same color as its environment.</s>an example of camouflage is an animal having the same color as its environment. substitution nuts are a kind of fruit.</s>fruit contains seeds.</s>nuts contain seeds. substitution xylem transports materials through the plant.</s>water is a kind of material that is required for plants' survival.</s>xylem transports water that is required by plants.
the mass of a planet causes the pull of gravity on that planet. earth is a kind of planet. the mass of earth causes the pull of gravity on earth.
conjunction grass is a kind of green plant.</s>a tree is a kind of plant.</s>a tree and grass are both kinds of plants. conjunction chemical splashing can cause harm to humans / to the eyes.</s>chemical splashing sometimes occurs during experiments.</s>chemical splashing during experiments can cause harm to the eyes. conjunction planets orbit stars.</s>gravity causes orbits.</s>gravity causes planets to orbit stars.
if-then feeders by a road may cause animals to be killed by a car.</s>if something kills an animal then that something is not protecting that animal.</s>feeders by a road do not protect animals. if-then a forest contains a large amount of wood.</s>if something contains something else then that something else can be found in that something.</s>large amounts of wood can be found in a forest.
{"int4": "if a cell converts something into something else then that cell is a source of that something else", "sent4": "solar cells convert solar energy into electrical energy", "int5": "solar cells are a source of electrical energy"}, "inference_type": "if-then"} {"int2": "it is summer in southern hemisphere", "sent6": "the winter in the northern hemisphere is during the summer in the southern hemisphere", "int3": "it is winter in the Table 1: The used reasoning types. Here, s 1 and s 2 denote input premises for deductive modules, while i 1 denotes the entailed conclusion. For logical regularity, P (x) means that the predicate P is true for the entity x.
a logical conclusion. A deductive module takes the two premises s 1 and s 2 as inputs and outputs a conclusionî 1 according to its reasoning types (denoted as s 1 + s 2 →î 1 ). Abduction is to find the best explanation given complete/incomplete observations (Harman, 1965). In the context of the entailment steps, given a conclusion i 1 and a premise fact s 2 as observations, the abductive module yields a plausible premiseŝ 1 (denoted as i 1 − s 2 →ŝ 1 ), where the generated premiseŝ 1 and the observed premise s 2 would most likely infer the conclusion i 1 . Although the steps in the En-tailmentBank may have more than two premises, we only consider the case of two premises. The reason is that the n-premise step (n > 2) could be further decomposed into several valid 2-premise steps (Dalvi et al., 2021) (See Appendix Figure 8 for a specific example).

Module Training
Training the entailment modules with data that contains the corresponding logical regularities would guide them to perform correct inferences and ensure soundness (Bostrom et al., 2021). We first train the modules with synthetic sentences to learn the logical transformations and then further finetune them with the end task. We follow ParePattern (Bostrom et al., 2021), a pipeline based on syntactic retrieval, rule-based example construction, and automatic paraphrasing, to collect synthetic sentences from Wikipedia. Since Bostrom et al. (2021) only consider the substitution and contraposition deductions, we extend the method to conjunction and if-then deductions by designing the specific syntactic templates and construction rules (See Appendix A.1). In addition, we also considered the abductive form of these modules. We then fine-tune the modules with corresponding steps in EntailmentBank to adapt the modules to the science domain. Since the original steps in EntailmentBank are not annotated with reasoning types, we manually label 400 steps of the training split and train a classifier with these steps. The remaining steps are labeled with the pseudo labels predicted by the classifier. We freeze the parameters of modules once the training is complete.

Reasoning Controller
In addition to single-step reasoning modules, we need to search for the correct path to reach the target hypothesis. The entire reasoning search space would grow rapidly as the number of input facts increases and there would also be complex branching in the trees. We introduce a reasoning controller to filter out incorrect facts, steps, and states to reduce the search space and complete the reasoning accurately and efficiently. Figure 2 shows how the controller is used in each reasoning iteration. At the beginning of the iteration, the controller scores all possible steps and selects the most promising ones for single-step entailment. After the entailment modules generate intermediate facts, the controller estimates which state with a generated fact gets closer to the completion of reasoning and selects the best states for the next iteration. Besides the usage within each iteration, the controller also rates all facts at the start of the whole reasoning process and keeps only the relevant facts for the initial state when fact distractors exist.

Controller Model
The controller model scores steps, facts, and states based on a transformer, and its structure is shown in Figure 3. Encoding. We first encode the target hypothesis and facts of state with a pre-trained transformer: Steps: Facts:  Figure 3: Reasoning controller illustration. Given a state, the controller predicts a score for the whole state, scores for facts, and scores for all possible steps.
tion of all tokens within the sentence.
Steps. We introduce feed forward networks FFN ded and FFN abd for deductive steps and abductive steps, respectively. Each combination of two facts is a possible deductive step (s i , s j ). Each combination of the target hypothesis and a fact is a possible abductive step (H, s k ). We score them by a score function G step , where [·] is the concatenate operation. We normalize the step scores by applying Softmax over all possible deductive and abductive steps.
Facts. The fact score indicates whether the fact is useful by how similar the fact is to the state's target hypothesis. We assume that if a fact has a smaller depth in the gold entailment tree (i.e., closer to the root), it would be more similar to the target hypothesis than those facts with a larger depth. We introduce FFN fact as a learnable similarity function and determine the fact score by comparing it with the target, where σ is the Sigmoid function.
State. The state score reflects the quality of the current state and indicates whether this state should be used for further reasoning. We assign the state score using the following two parts: where λ is a learnable weight, f [CLS] is the representation of [CLS], FFN cls is a feed forward network. The first part helps choose states that contain more relevant facts and fewer distractors. The second part comprehensively considers the whole state and gives the promising one a higher score.

Controller Training
Training State Construction. We decompose the gold entailment trees into several intermediate states for training. We add disturbances to the trees to make positive and negative states. For each gold deductive step (e.g., s 1 + s 2 → i 1 ), we use the deductive module to predict a conclusionî 1 . If the predictedî 1 is correct, we replace i 1 in the state withî 1 to make new positive states. Otherwise, we replace i 1 withî 1 to make negative states. The abductive modules are also used in a similar way. Loss Function. We train the controller with corresponding margin ranking losses L step , L f act , and L state to learn to rank the correct steps, facts, and states ahead of incorrect ones, respectively. Specifically, the loss for scoring steps is where p + and p − are the positive and negative step, is the margin loss, and m step is the margin for steps. For facts, we have where s + 1 is a fact which has smaller depth in the gold tree than s + 2 , s − is the distractor, N 2 is the number of (s + 1 ,s + 2 ) pairs, N 3 is the number of distractors, and m f act is the margin for facts.
For states, we sample a positive state R + and a negative state R − from a tree and train the controller with where m state is the margin for states.
Finally, we average the above losses over all trees in the training split and train the controller with Appendix B gives more controller training details.

Reasoning Algorithm
Since the entailment trees are generated iteratively and the search space for reasoning could be large for each iteration, we adopt beam search for efficient reasoning. Given the initial state H ⇐ S, we first remove s i with a low fact score to filter distractors. Subsequently, we perform several reasoning  iterations until the target hypothesis is proved or the maximum reasoning depth is reached. In each iteration, we select the steps with the highest step scores, execute the steps with all types of deductive or abductive modules, and construct novel states with the generated intermediate facts. We remain the top-K states ranked with state scores for the next iteration, where K is the beam size. More algorithm details are in Appendix Algorithm 1.

Experiments
We conduct experiments on EntailmentBank (Dalvi et al., 2021), the first dataset supporting QA explanations in the form of the entailment tree. Entail-mentBank contains 1,840 entailment trees, each of which corresponds to a question from the ARC dataset (Clark et al., 2018). On average, each tree contains 7.6 nodes across 3.2 steps. Summary statistics are shown in Table 2.

Evaluation Metrics
Following Dalvi et al. (2021), we first align nodes in the predicted tree T pred with nodes in the gold tree T gold and then evaluate with three dimensions: • Leaves: To evaluate whether T pred uses the correct leaf facts, we compute F1 score by comparing the predicted leaf facts S pred to S gold .
• Steps: To evaluate whether the individual steps are structurally correct, we compare all steps in two trees and compute F1. A predicted step is considered structurally correct if its children's identifiers (e.g., s 1 , i 2 ) perfectly match the gold ones.
• Intermediates: To evaluate whether the intermediate conclusions are correct, we report the F1 of comparing the aligned intermediate conclusions.
The AllCorrect score is 1 if F1 is 1, 0 otherwise 2 . Given the above scores, we comprehensively evaluate T pred with Overall AllCorrect whose value 1 The threshold was picked using 300 manually labeled pairs (Dalvi et al., 2021). 2 We repair a bug in the official evaluation code, which makes the Intermediate AllCorrect = 1 if the precision = 1 (rather than if F1 = 1), which leads to an overestimation on the Intermediate AllCorrect.
is 1 if and only if all the leaves, steps and intermediates are all correct. This is a strict metric since any error in T pred will lead to a score of 0.

Baselines
We compare with the SOTA entailment tree generation method EntialmentWriter (Dalvi et al., 2021), which directly generates the linearized trees (e.g., s 2 + s 5 → i 1 : eruptions block sunlight; s 4 + i 1 → H) given H + QA + S with an end-to-end encoderdecoder framework. We also follow the "Iterative" ProofWriter (Tafjord et al., 2021), which is one of the SOTA proof generation methods for logical reasoning, to extend EntialmentWriter to EntialmentWriter-Iter. EntialmentWriter-Iter iteratively generates a part of the linearized tree in one forward process (e.g., s 2 + s 5 → i 1 : eruptions block sunlight;) and concatenates all parts to make the final tree. It completes the step selection and entailment reasoning in a seq2seq model and does not provide the reasoning types of steps.

Implementation Details
Modules. We implement the entailment modules on top of T5-large (Raffel et al., 2020) with the following two implementations. (1) Separated. We implement each module separately. We have six models in total, corresponding to the three reasoning types of deductive and abductive versions. (2) Prefixed. We implement all modules with a single model. To specify which reasoning type the model should perform, we follow Raffel et al. (2020) to add a type-specific prefix (e.g., "deductive substitution:") to the input before feeding it to the model. To evaluate the modules, we annotate the types of 275 steps in the dev split. We train the modules with a batch size of 20 for 100 epochs. Controller. The controller is implemented with albert-xxlarge-v2 (Lan et al., 2019). We train two individual controllers for Task1 and Task2. For Task3, we reuse the Task2 model without additional training. The controllers are trained with a batch size of 10 for 1,000 epochs. The margins m step , m f act , and m state are tuned on the development split and all set to 0.1. Algorithm. For Task1, we iterate until all facts in S are used. For Task2, we use a fact score threshold of 0.001 to filter distractors and a maximum reasoning depth of 5. We select the top 10% steps for each state and set the beam size to 10. All hyper-parameters are selected using the dev split (Appendix C). For Task3, we follow Dalvi et al.   Table 4: Entailment tree evaluation results on 100 uniformly sampled questions from the test split. We report the proportion (%) of the predicted trees that are rated as valid, following automatic and manual evaluation.
(2021) to retrieve 25 sentences from the corpus C using the H as the query. We use the same retrieval results as EntailmentWriter for a fair comparison. Model checkpoints are selected using the dev split. More implementation details can be found in the Appendix.
6 Result Analysis 6.1 Automatic Evaluation As shown in Table 3, our methods outperform all baseline methods on the strictest metric Overall AllCorrect for all three tasks. Notice that the trees generated by our methods only contain 2-premise steps, which would lead to a 0 Overall AllCorrect score on 26% of test samples whose annotations contain n-premise (n > 2) steps. Even so, our METGEN-separated still obtains an absolute improvement of 1.4%/2.4%/5.7% on Task1/2/3 in comparison to the strongest baseline. With only 9.0% of the model parameters, METGEN-prefixed can outperform the EntialmentWriter (T5-11B) by absolute 0.9%/2.1%/5.7% on Task1/2/3. In the case of using a comparable amount of model parameters, METGEN-prefixed also outperforms the EntialmentWriter-Iter (T5-large) by a large margin. For Task3, we note that all methods perform poorly. The main reason is that the retrieved facts may not contain all the required facts S gold (68% of the cases). We note that METGEN underperforms the baselines on some metrics, probably due to the inaccuracy of the tree alignment algorithm in the automatic evaluation (Appendix G).

Manual Evaluation
As analysed by Dalvi et al. (2021), the automated metrics might misjudge some valid trees and thus underestimate the performance. To make a more accurate comparison, we perform the manual evaluation. We compare three methods with a comparable amount of model parameters, EntialmentWriter (T5-large), EntialmentWriter-Iter (T5-large), and METGEN-prefixed. For each step and tree, we invite three students as experts to evaluate the validity. The inter-annotator agreement (Cohen's kappa statistic) is 0.85/0.76 for the step/tree, indicating the substantial agreement between annotators. Validity of Full Entailment Trees. As shown in Table 4, under the manual evaluation, METGEN outperforms the baselines with large margins. Validity of Individual Entailment Steps. We review the validity of the single-step entailment and annotate each step with one of the four categories: • Valid: The step conclusion can be inferred from   the premises and does not trivially repeat them.
• Unsupported: The conclusion is in conflict with, irrelevant with, or not followed from the premises.
• Repeat premises: The conclusion trivially repeats one or more of the premises.
• Missing premises: The conclusion uses knowledge unstated in the premises. The step would be correct if one additional premise from S is added. As shown in Figure 4, METGEN achieves considerable improvement in the validity of steps compared to the baseline methods. We note that 17% of the steps of EntialmentWriter belong to missing premises. METGEN constrains the reasoning types of steps and uses the premise-related and contextindependent entailment modules to perform every single step. This can reduce the cases of missing premises (from 17% to 2%) and improve the validity of the conclusions (from 38% to 70%).

Ablation Study
Entailment Modules Analysis. Table 5 reports the ablation results on modules. We report the Overall AllCorrect on test spilt and the single-step entailment accuracy on the labeled dev steps, and can make the following observations. (1) Separated vs. Prefixed. We can see that METGEN-prefixed achieves slightly worse performance than MET-GEN-separated ((a) vs. (f) and (d) vs. (g)). This is mainly because separate modules could better learn different types of reasoning. However, in our final system, we still choose to use METGEN-prefixed due to the consideration of model size.
(2) Clarifying Reasoning Types. We train a module to infer without distinguishing or assigning specific reasoning types. We find that the performance drops from 27.4% to 25.9% ((g) vs. (h)), suggesting that clarifying the reasoning types of the entailment steps is crucial for generating entailment trees. (3) Training Data. Comparing (a) and (d), we find that training with the synthesis data could improve the accuracy. Without tuning on EntailmentBank (setting (e)), the modules might not adapt to the science domain and obtain low step accuracy. However, the well-trained controller would verify and filter the error conclusions, thus our method can still achieve 23.5% on Overall AllCorrect. (4) Generative Model. A stronger generative model, which achieves higher single-step accuracy, could achieve higher tree generation performance (comparing (a), (b) and (c)), indicating that our method can be further improved with stronger entailment modules. Controller and Algorithm Analysis. (1) Is the reasoning controller necessary? To answer this question, we design a heuristic generation algorithm without the controller (Appendix D). It uses the BLEURT scores as heuristic information to guide the reasoning. As shown in Table 6, the heuristic method achieves observable lower performance. The controller could aid in eliminating the error steps and states, so as to find the valid trees efficiently and accurately. Without the controller, we find it difficult to find effective heuristic information.
(2) Effect of Abductive Steps. The generation performance drops when abductive steps are not used. This suggests that abductive steps, as a way of backward searching, could help improve the quality of generated trees. Figure 5 reports the results in the data-scarce setting. Our method is more data-efficient. With only 1% of the EntailmentBank training data, our  When the data is scarce, the advantage of training our modules with synthetic data becomes more significant. It can help alleviate the overfitting on few EntailmentBank sentences.

Cross-dataset Setting
To test the generalization capability of our method, we conduct cross-dataset experiments on datasets eQASC and eOBQA (Jhamtani and Clark, 2020), which collect one-step entailment trees for questions from QASC (Khot et al., 2020) and Open-BookQA (Mihaylov et al., 2018), respectively. Given H and S, their task requires selecting the valid one-step trees (e.g., s 1 + s 2 → H) from a candidate set. We apply the Task2 models (without fine-tuning on eQASC or eOBQA) to select from the candidate trees (Appendix E). Following Jhamtani and Clark (2020), we evaluate models with the P@1 and NDCG metrics. Questions with no valid tree are filtered. As shown in Table 7, our method achieves better generalization performance. Instead of training a seq2seq model with a single generation loss, our method explicitly models the step and state selection ability (equation (1) and (3)) and guides the controller with specific losses to rank the correct ones ahead of incorrect ones. Such a manner could aid in alleviating the overfitting on training data and improve the generality.

Conclusion
We propose METGEN, a module-based framework to generate the entailment trees for explaining answers. METGEN reasons with single-step entailment modules and the reasoning controller. Experiments on EntailmentBank benchmark show MET-GEN can generate valid trees with reliable steps and achieve SOTA performance.

A.1 Synthetic Data
We follow the ParaPattern (Bostrom et al., 2021) to collect synthetic training data for the entailment modules. Since they only consider the substitution and contraposition deductions, we extend the method to conjunction and if-then deductions by designing the specific syntactic templates and construction rules. Table 9 shows the used syntactic patterns. We use Spacy 3 to match sentences from Wikipedia (version "20200501.en"). In total, we collect about 24k, 443k, and 97k sentences for substitution, conjunction, and if-then modules, respectively. We follow Bostrom et al. (2021) to train the modules on the synthetic data with a learning rate of 3e-5 for 1 epoch.

A.2 Reasoning Type Annotations of EntailmentBank
The original steps in the EntailmentBank are not annotated with reasoning types. We manually annotated the reasoning types of 400 steps in the training split (Train-manual) and 275 steps in the development split (Dev-manual). To label the remaining steps in the training split, we train a classifier with the Train-manual steps. We use the Roberta-large (Liu et al., 2019) as our classifier. It achieves an accuracy rate of 88% on the Devmanual steps. We use the classifier to predict the reasoning types of the remaining 2-premise steps and take the predicted types as the pseudo labels (Train-pseudo).

B Controller Training Details
Training Data. We decompose the gold entailment trees into several intermediate states for training. For example, the tree in Figure 1(c) can be decomposed into the following positive states: R 0 : H ⇐ {s 1 , s 2 , s 3 , s 4 , s 5 }, R 1 : H ⇐ {s 1 , s 3 , s 4 , i 1 }, and R 2 : i 1 ⇐ {s 1 , s 2 , s 3 , s 5 }. The state R 0 has two distractors s 1 and s 3 , one positive deductive 3 https://spacy.io/ step s 2 + s 5 → i 1 , and one positive abductive step H − s 4 → i 1 . We add disturbances to the trees to make positive and negative states. For the state R 1 , the fact i 1 is the conclusion of gold step s 2 +s 5 → i 1 . We use a deductive module to predict a conclusionî 1 given s 2 and s 5 . If the predictedî 1 is correct, we replace i 1 withî 1 to make new positive states R + 1 : H ⇐ {s 1 , s 3 , s 4 ,î 1 }. The R + 1 can be used to perform further reasoning. Otherwise, we replace i 1 withî 1 to make negative states R − 1 . The R − 1 contains an incorrect conclusionî 1 and thus should not be used for further reasoning. The reasoning controller should be trained to learn to distinguish between R + 1 and R − 1 and give the R + 1 a higher state score than R − 1 . To judge whether the generatedî 1 is correct, we follow the evaluation metrics (Dalvi et al., 2021) to use BLEURT. The predictedî 1 is considered correct if the BLEURT score betweenî 1 and the gold i 1 is larger than 0.28.

C Reasoning Algorithm and Hyperparameter Analysis
Algorithm 1 shows the whole reasoning process. The hyperparameters are selected with the development split, as shown in Figure 6. We select a beam size of 10, a max reasoning depth of 5, a distractor threshold of 0.001, and a step sampling rate of 10%. We only consider the steps whose sentences have word overlap. When constructing the entailment tree, we use the BLEURT scores to align the target of a state to the most similar fact. Note that when making a new reasoning state with the step p and the novel intermediate fact i, if the step p is a backward abductive step, we replace the original target hypothesis with i and treat the i as the target hypothesis which the new state aims to prove (as shown in Figure 2). We run our method three times and report the average performance.

D Heuristic Reasoning Algorithm without the Controller
To investigate the effect of the reasoning controller for entailment tree generation, we design a heuristic generation algorithm that does not use the reasoning controller. Since the cost of traversing the entire search space is unaffordable, we adopt the beam search. In each reasoning state, we try all possible steps with entailment modules and make new candidate reasoning states. To select the correct states, we use the BLEURT scores as the heuristic information to guide the search process. Specifically, Align the target of R to the most similar fact sentence of R to make a tree T 26: end for 27: Select the treeT with highest score 28: Return The entailment treeT given a candidate state R : H ⇐ S, we estimate the similarity between a fact s i ∈ S and the target H by and then score a candidate state by The top-K candidate states with the highest state scores are selected to perform further reasoning, where K is the beam size. We use the same beam size as the algorithm with the controller uses.

E Experiment Details on eQASC and eOBQA
For each question+answer pair, the eQASC/eOBQA provides the corresponding hypothesis H, about 10/4 facts as S, and a candidate set of steps. Each candidate step is a 2-premise single step from two facts to H (e.g., s 1 + s 2 → H) and can be viewed as a one-step entailment tree with three nodes. The target is to select the correct trees/steps from the candidate set. There might be more than one correct tree in the candidate set. We conduct experiments on the questions with at least one correct entailment tree (677 eQASC questions and 79 eOBQA questions). Since the given S contains distractors, we adopt the Task2 models trained on EntailmentBank (without further fine-tuning on eQASC and eOBQA) to perform cross-dataset experiments.
For our method, we follow our Task2 reasoning algorithm to select from the candidate trees/steps. Specifically, we first filter out the facts in S with low fact scores using a threshold (selected using the development split). Then we predict the step scores for the candidate steps and select the step with the highest score. For the EntailmentWriter, we feed the S and H to the EntailmentWriter and score each candidate step with the 1 PPL , where PPL is the perplexity of the sequence segment representing the step (e.g., sent1 & sent2 for s 1 + s 2 in the official EntailmentWriter implementation).
We follow the official evaluation metrics of eQASC and eOBQA. The P@1 (Precision@1) measures the fraction of cases where the selected tree (topmost ranked) is correct. It is equivalent to the Overall AllCorrect score between the top-1 predicted one-step tree and the best-matching gold tree. The NDCG (Normalized Discounted Cumulative Gain) metric measures how well ranked the candidate trees are when ordered by the predicted scores. It reflects the model's ability to distinguish the validity of trees and rank the correct trees ahead of the incorrect ones.

F Main Experimental Environments
We deploy all models on a server with 500GB of memory and one 40G A100 GPU. Specifically, the configuration environment of the server is ubuntu 21.04 and our code mainly depends on Substitution Conjunction If-then

Original Sentence:
Slime molds like Physarum polycephalum are useful for studying cytoplasmic streaming.

Premises:
Physarum polycephalum are a slime mold. Slime molds are useful for studying cytoplasmic streaming.

Paraphrased:
The polycephalum is a slime mold. The slimy molds are useful for studying streaming.

Conclusion:
Physarum polycephalum are useful for studying cytoplasmic streaming.

Original Sentence:
Aman is an Indian anti-war movie directed by Mohan Kumar.

Premises:
Aman is an Indian anti-war movie. Aman is directed by Mohan Kumar.

Paraphrased:
Aman is a Indian film that is anti-war. Mohan Kumar was the director of Aman.

Conclusion:
Aman is an Indian anti-war movie directed by Mohan Kumar.

Original Sentence:
If the rebels occupy territory, they make a gain.

Premises:
If the rebels occupy territory they make a gain. The rebels occupy territory Paraphrased: The rebels are able to make a gain if they hold on to territory. The territory was occupied by the rebels.

Conclusion:
The rebels make a gain.

G Discussion on the Automatic Evaluation
As discussed by Dalvi et al. (2021), the automatic entailment tree evaluation metrics might misjudge in some cases (e.g., tree structure variation) and still need to be improved. In fact, how to quantitatively evaluate a predicted tree remains a challenging problem. In the existing metric, the first step is the tree alignment algorithm (Dalvi et al., H: cellular respiration produces energy for animal activities by extracting energy from food 1 : cellular respiration is when cells extract energy to produce energy for animal activities 1 : cellular respiration is when cells extract energy from food to produce energy 3 : cellular respiration is cellular digestion 2 : cellular respiration is a source of energy for animal activities H: cellular respiration produces energy for animal activities by extracting energy from food 1 : cellular respiration is when cells extract energy from food to produce energy 2 : cellular respiration is a source of energy for animal activities 2021). The nodes in the predicted tree T pred are aligned to the nodes in the gold tree T gold for further comparison. Each non-leaf node i pred of T pred is aligned to the first non-leaf node i gold where the Jaccard similarity of their respective leaf sentences is maximum. For any i pred with zero Jaccard similarity to all gold nodes, it is aligned to a dummy gold node with a blank conclusion. In the official implementation, (1) each i gold may correspond to more than one i pred , while there is no penalty for duplication when calculating Intermediate F1; (2) the root node (the given hypothesis sentence which is identical in T pred and T gold ) is trivially viewed as a normal intermediate node ( Figure 7 shows a specific case. To alleviate the inaccuracy caused by the above reasons, we mainly use the more strict metrics (i.e., Leaves/Steps/Intermediates/Overall AllCorrect) for comparison. Furthermore, we adopt manual evaluation on the full trees and individual steps to make a more accurate comparison (Sec. 6.2).

H Case Study
We show some entailment trees generated by our METGEN-separated on the Task2 questions in Figure 8,9,10,11. METGEN can generate a valid entailment tree which may have a different structure with the gold one ( Figure 8). METGEN can handle medium-complexity questions, generate valid entailment trees and provide the reasoning types of steps (Figure 9 and 10). The questions which require more complex reasoning (e.g., the gold tree in Figure 11 requires 11 leaf facts and 8 entailment steps) remain challenging. Although the full tree generated by our method for such complex question can be not entirely correct, the intermediate conclusions (e.g., i 1 , i 2 in Figure 11) are still reliable.
Question Q: Which phase of the Moon occurs after a waxing gibbous? Answer A: full moon Hypothesis H: a full moon is the moon phase that occurs after a waxing gibbous Facts S: s1: state of matter is a property of matter and includes ordered values of solid / liquid / gas s2: the moon is earth 's moon s3: usually means most of the time s4: phase means state s5: occur is similar to appear s6: the moon orbits the earth s7: to be found in means to be contained in s8: revolving around something means orbiting that something s9: the moon orbiting the earth occurs once per month s10: a complete revolution / orbit of the moon around the earth takes 1 / one month s11: amount is a property of something and includes ordered values of none / least / little / some / half / much / many / most / all s12: a waxing gibbous is a kind of phase of the moon s13: warm / becoming warm means heat is added s14: a phase change is when matter / a substance changes from one state of matter into another state of matter s15: a full moon occurs after a waxing gibbous moon s16: the moon reflects sunlight towards the earth s17: generally means usually s18: to happen means to occur s19: type of moon / kind of moon means moon phase s20: a full moon is a kind of phase of the moon s21: visible means able to be seen s22: motion / movement means moving / to move s23: the moon completes a lunar cycle over a period of 29 days s24: the moon rising occurs once per day s25: lunar phase is synonymous with moon phase

Gold Tree
H: a full moon is the moon phase that occurs after a waxing gibbous 12 : a waxing gibbous is a kind of phase of the moon 15 :a full moon occurs after a waxing gibbous moon 20 : a full moon is a kind of phase of the moon Predicted Tree 1 : a full moon and a waxing gibbous moon are kinds of phases of the moon H: a full moon is the moon phase that occurs after a waxing gibbous 12 : a waxing gibbous is a kind of phase of the moon 15 :a full moon occurs after a waxing gibbous moon 20 : a full moon is a kind of phase of the moon Figure 8: Case 1. The predicted entailment tree consists of two 2-premise steps, while the gold tree consists of one 3-premise step. Under the automatic evaluation metric, the predicted tree would be rated as invalid (Overall AllCorrect = 0), since the predicted steps do not match the gold step. However, the predicted tree should be valid because each step in the tree is a valid entailment (i.e., the 3-premise step can be decomposed into two valid 2-premise steps). It would be rated as valid under manual evaluation.
Question Q: According to the Periodic Table of the Elements, which set of elements has similar properties? Answer A: He, Ne, Ar Hypothesis H: he, ne, ar have similar properties Facts S: s1: cannot is the opposite of can s2: helium / neon / argon belong to noble gases family , group 18 on the periodic table s3: a periodic table is a kind of scientific model s4: charge is a property of an object / a material / a substance and includes ordered values of negatively-charged / neutral / positively-charged s5: including means containing s6: similar means in common s7: magnetism is a property of materials / objects and includes ordered values of nonmagnetic / magnetic s8: a proton has a positive 1 electric charge s9: the chemical symbol for helium is he s10: similarity means the same property s11: identical means copy s12: chemical reactivity is a property of elements and includes ordered values of reactive / unreactive s13: the chemical symbol for argon is ar s14: amount is a property of something and includes ordered values of none / least / little / some / half / much / many / most / all s15: an element is identified by its number of protons s16: according to is similar to be determined by s17: a group / family in the periodic table means a column in the periodic table s18: characteristic means property s19: the chemical symbol for neon is ne s20: same means identical / equal in value / amount / number / quantity s21: made of is similar to contains s22: elements in the same group on the periodic  Figure 9: Case 2. Explaining the question and answer in this case requires 5 leaf facts from the given 25 facts. METGEN can select the correct facts, generate valid entailment trees, and provide the reasoning types of steps.
Question Q: Which trait would a cat most likely inherit from its parents? Answer A: having white fur Hypothesis H: the cat will inherit the white colored fur from its parents Facts S: s1: heredity is similar to inheritance s2: the color of / coloration of fur is an inherited characteristic s3: the mature / adult form of a kitten is called a cat s4: if an organism passes on its traits then future generations will have those traits s5: freckles are an inherited characteristic s6: a cat is a kind of animal s7: inheriting is when an inherited characteristic is passed from parent to offspring s8: the parent cats have white fur s9: coloration means a thing 's color s10: an animal knows how to do instinctive behaviors when it is born s11: color is a kind of physical / visual property s12: offspring receives half of the genes from each parent s13: a homozygous recessive organism contains only recessive genes s14: animals produce offspring s15: trait means property s16: the size of an organism is an inherited characteristic s17: genetic / hereditary means of genes / heredity s18: coat means fur coat s19: genes contains genetic information s20: white fur is white in color s21: coloration means a pattern of colors s22: hair / fur is a part of skin for protection / keeping warm s23: the shape of body parts is an inherited characteristic s24: hair is similar to fur s25: dna are a vehicle for passing genes from parent to offspring H: the cat will inherit the white colored fur from its parents 1 : the offspring will inherit the color of the fur of its parent.
2 : the color of / coloration of fur is an inherited characteristic 7 : inheriting is when an inherited characteristic is passed from parent to offspring 8 : the parent cats have white fur. 14 : animals produce offspring 6 : a cat is a kind of animal 2 : a cat will inherit the color of the fur of its parents 20 : white fur is white in color.
H: the cat will inherit the white colored fur from its parents 1 : animals can inherit characteristic from their parents 2 : the color of / coloration of fur is an inherited characteristic 7 : inheriting is when an inherited characteristic is passed from parent to offspring 8 : the parent cats have white fur. 14 : animals produce offspring 6 : a cat is a kind of animal 2 : animals inherit color / coloration of fur from their parents 20 : white fur is white in color.
3 : a cat inherits color / coloration of fur from its parents 4 : the fur of the parent cats is white in color Gold Tree

If-then Substitution Conjunction
If-then Figure 10: Case 3. METGEN can handle medium-complexity questions and provide the reasoning types of steps. source of heat 15 : if a spoon is used to stir a liquid then that spoon is touching that liquid 13 : a hot substance is a source of heat 8 : hot chocolate is kind of hot substance 23 : hot chocolate is a kind of liquid 6 : a spoon is a kind of object 4 : metal is a thermal conductor 2 : touching is similar to being exposed to 3 : if a spoon is used to stir hot chocolate then that spoon is touching that liquid 4 : a spoon is exposed to the hot chocolate 2 : spoons are usually thermal conductor 1 : if heat is conducted to a spoon then that spoon will become hot 7 : the spoon in the hot chocolate will become hot 6 : the spoon being exposed to the hot chocolate is an example of thermal conductor being exposed to a source of heat 25 : a spoon is used to stir a cup of hot chocolate 20 : spoons are usually made of metal 18 : if heat is conducted to an object then that object will become hot 21 : if a thermal conductor is exposed to a source of heat then that conductor may become hot / warm Gold Tree Predicted Tree 13 : a hot substance is a source of heat 8 : hot chocolate is kind of hot substance 6 : a spoon is a kind of object 18 : if heat is conducted to an object then that object will become hot 1 : hot chocolate is a source of heat 2 : if heat is conducted to a spoon then the spoon will become hot H: heat is transferred to the spoon from the hot chocolate through conduction s1: static electricity is when electrons are exchanged between objects through friction s2: touching is similar to being exposed to s3: if something transfers energy to something else then that something else absorbs that energy s4: metal is a thermal conductor s5: friction occurs when two object 's surfaces move against each other s6: a spoon is a kind of object s7: if something is in something else , then that something is exposed to that something else s8: hot chocolate is kind of hot substance s9: conductivity is a measure of how easily electricity travels through a material s10: friction causes the temperature of an object to increase s11: the heat energy in the cooler object increases in thermal conduction s12: if one object absorbs more energy than another object , then the object will be warmer s13: a hot substance is a source of heat s14: conductivity is a kind of physical property s15: if a spoon is used to stir a liquid then that spoon is touching that liquid s16: thermal energy is a kind of energy s17: absorbing energy causes objects / materials / substances to heat s18: if heat is conducted to an object then that object will become hot s19: sending electricity through a conductor causes electricity / electric current to flow through that conductor s20: spoons are usually made of metal s21: if a thermal conductor is exposed to a source of heat then that conductor may become hot / warm s22: heat means heat energy s23: hot chocolate is a kind of liquid s24: heat energy is synonymous with thermal energy s25: a spoon is used to stir a cup of hot chocolate C S

I Substitution Substitution
If-then Figure 11: Case 4. The question requires more complex reasoning, where the gold tree contains 11 leaf facts and 8 entailment steps. Although the full tree generated by METGEN is not entirely correct, the intermediate conclusions i 1 , i 2 are still reliable.