Entailment Tree Explanations via Iterative Retrieval-Generation Reasoner

Large language models have achieved high performance on various question answering (QA) benchmarks, but the explainability of their output remains elusive. Structured explanations, called entailment trees, were recently suggested as a way to explain and inspect a QA system's answer. In order to better generate such entailment trees, we propose an architecture called Iterative Retrieval-Generation Reasoner (IRGR). Our model is able to explain a given hypothesis by systematically generating a step-by-step explanation from textual premises. The IRGR model iteratively searches for suitable premises, constructing a single entailment step at a time. Contrary to previous approaches, our method combines generation steps and retrieval of premises, allowing the model to leverage intermediate conclusions, and mitigating the input size limit of baseline encoder-decoder models. We conduct experiments using the EntailmentBank dataset, where we outperform existing benchmarks on premise retrieval and entailment tree generation, with around 300% gain in overall correctness.


Introduction
Large neural network models have successfully been applied to different natural language tasks, achieving state-of-the-art results in many natural language benchmarks. Despite this success, these results came with the expense of AI systems becoming less interpretable (Jain and Wallace, 2019;Rajani et al., 2019a).
With the desire to make the output of such models less opaque, we propose a question answering (QA) system that is able to explain their decisions not only by retrieving supporting textual evidence (rationales), but by showing how the answer to a question can be systematically proven from simpler textual premises (natural language reasoning). * Work done during an internship at the AWS AI. Code and model checkpoints are publicly available at https:// github.com/amazon-research/irgr.  Figure 1: Task has as input a hypothesis H (e.g. an answer to a question) and a corpus of premises C (simple textual evidences), the goal is to generate an entailment tree that explains the hypothesis H by using premises from C. These explanations are represented using entailment trees, as depicted in Figure 1. First introduced by , entailment trees represents a chain of reasoning that shows how a hypothesis (or an answer to a question) can be explained from simpler textual evidence. In comparison, other explanation approaches such as retrieval of passages (rationales) (DeYoung et al., 2020) or multi-hop reasoning (chaining) (Jhamtani and Clark, 2020) are less expressive than entailment trees, which are comprised of multi-premise textual entailment steps.
In order to generate such entailment trees, previous works Bostrom et al., 2021) have used encoder-decoder models that takes as input a small set of retrieved premises and output a linearized representation of the entailment tree. Such models are limited by (1) the language model's fixed input size, and they may construct incorrect proofs when the retrieval module cannot fetch all relevant premises at once The approach in (D) allows for more detailed inspection of the reasoning behind an explanation. Nodes in gray are retrieved from a corpus, nodes in blue are generated, and the red node is the hypothesis or answer that is being explained.
(2) such approaches do not leverage the partially generated entailment trees. In contrast, we propose Iterative Retrieval-Generation Reasoner (IRGR), a novel architecture that iteratively searches for suitable premises, constructing a single entailment step at a time. At every generation step, the model searches for a distinct set of premises that will support the generation of a single step, therefore mitigating the language model's input size limit and improving generation correctness. Our contributions are two-fold. First, we design a retrieval method that is able to better identify premises needed to generate a chain of reasoning, which explains a given hypothesis. Our retrieval method outperforms previous baselines by 48.3%, while allowing for a dynamic set of premises to be retrieved. Secondly, we propose an iterative retrieval-generation architecture that constructs partial proofs and augments the retrieval probes using intermediate generation results. We show that integrating the retrieval module with iterative generation can significantly improve explanations. Our proposed approach achieves new state-of-the-art result on entailment tree generation with over 306% better results on the All-Correct metric (strict comparison with golden data), while using a model with one order of magnitude fewer parameters.

Related Work
Traditionally, natural language processing (NLP) frameworks were based on white-box methods such as rule-based systems (Allen, 1988;Ribeiro et al., 2019;Ribeiro and Forbus, 2021) and decision trees (Boros et al., 2017), which were inherently inspectable (Danilevsky et al., 2020). More recently, large deep learning language models (black-box methods) have gained popularity (Song et al., 2020;Raffel et al., 2020), but their improvements in result quality came with a cost: the system's outputs lack explainability and inspectability.
There have been many attempts to mitigate this issue, including input perturbation (Ribeiro et al., 2018) and premises selection (DeYoung et al., 2020). One promising explanation approach is to combine the model's output with a humaninterpretable explanation. For instance, Camburu et al. (2018) introduced the concept of natural language explanation in their e-SNLI dataset while Rajani et al. (2019b) expanded this idea to commonsense explanations. Jhamtani and Clark (2020) further explored the notion of explanation in multihop QA, where explanations contain a chain of reasoning, instead of simple textual explanations. Different from these explanation approaches, our work generates explanations in the form of entailment trees, introduced by , which are composed of multi-premise textual entailment steps. Entailment trees are more detailed explanations, making it easier to inspect the reasoning behind the model's answer. Figure 2 shows a diagram compering different natural language explanation methods according to their structure and use of textual evidence.
The first approach used to generate entailment trees was based on the EntailmentWriter model by . However, their approach is limited by the input size of the encoder-decoder language models, where a fixed set of supporting facts is used to generate an explanation. Instead, our model iteratively fetches a set of premises using dense retrieval conditioned on previous entailment steps, allowing for more precise explanations.
Our work is also related to some recent approaches that combine retrieval and neural networks for QA tasks Guu et al., 2020). The work of   t ≥ 1 Iterations Figure 3: IRGR is composed of two modules, IRGR-retriever and IRGR-generator . The IRGR-retriever iteratively fetches a set of premises from a corpus C in order to generate an entailment tree (structured explanation for a given hypothesis). The IRGR-generator computes a single entailment step at a time, and the intermediate generated steps are stored and used for subsequent retrieval and generation.
for each generated character. Conditioning the retrieval of a passage on previously retrieved passages has been explored in the context of multi-hop QA , and multi-hop explanations (Valentino et al., 2021;Cartuyvels et al., 2020). However, these approaches either are not used to generate explanations or do not use inferred intermediate reasoning steps to retrieve premises.

Problem Definition
The problem input consists of a corpus of premises C (simple textual statements) and a hypothesis h. The objective is to generate an entailment tree T that explains the hypothesis h by using a subset of the premises in C as building blocks. Entailment trees are represented as a tuple T = (h, L, E, S), where leaf nodes l i ∈ L are retrieved from the corpus (i.e. L ⊆ C), internal tree nodes e i ∈ E are intermediate conclusions (new sentences not present in corpus C, note that intermediate conclusions are generated by the model), and s i ∈ S is a list of entailment steps that can explain the hypothesis h, which is always the tree root and the final conclusion. An illustration of the problem and expected entailment tree can be found in Figure 1.
Each entailment step s i represents one inference step from a conjunction of premises to a conclusion. For instance, "l 1 ∧ l 2 ⇒ e 1 " or "l 1 ∧ l 2 ∧ e 1 ⇒ h" could be valid entailment steps in S. Note that the root of T is always the node representing h.

Architecture
Our approach, which we call Iterative Retrieval-Generation Reasoner (IRGR), consists of two modules, the IRGR-retriever and the IRGR-generator . The initial input to the model is the hypothesis h and the corpus of premises C. The generation process is performed through multiple iterations. At each iteration step t ≥ 1 the IRGR-retriever selects a subset of premises from the corpus L t ⊆ C. The IRGR-generator outputs one entailment step s t per iteration until the entailment tree T is fully generated. Given S 1:t−1 = (s 1 , . . . , s t−1 ) as the list of the entailment steps generated up to the previous iterations t − 1, the generator takes as input L t and S 1:t−1 and produces the next entailment step s t . The generation stops when the entailment step's conclusion is the hypothesis h, i.e., the proof is finished. Formally, the t-th iteration of the generation process is defined as: The IRGR-retriever searches over the premises in corpus C using dense passage retrieval .
Meanwhile, the IRGR-generator was implemented using T5, the Text-to-Text Transformer (Raffel et al., 2020), while any other sequence-to-sequence language model could also be used. An overview of the model can be seen in Figure 3.

IRGR-retriever
The IRGR-retriever module proposed in this work aims to retrieve premises from the corpus C. In existing baseline models the retrieval is done in one single step, fetching a fixed set of premises before generation . However, the generation of entailment trees requires a different set of leaves for each entailment step. To address this issue, our IRGR-retriever fetches k t premises from C to produce L t at iteration step t. Note that the size of C can be very large (k t << |C|). The value k t is chosen such that the size of L t is small enough to fit in the context of a language model while still being large enough to fetch as many premises as possible (in our experiments, the value k t is always below 25). We define the retrieval probability of a premise c ∈ C at a certain iteration step t as: Where ϕ is the sentence encoder function used to encode both premises and hypothesis, transforming the input text into a dense vector representation in The operator ⟨.⟩ represents the inner product between two vectors.
The encoder follows the Siamese Network architecture from Reimers and Gurevych (2019). We select a set of N positive and negative examples in the form of query-value pairs {(q j , c j )} N j=1 for training. Queries q j encode both the hypothesis h and previous entailment step s t−1 by concatenating their textual values. The positive examples are taken from the golden entailment trees, where c j ∈ L. For negative examples, we pair a query q j with either random premises from C or premises retrieved using the not fine-tuned version of the encoder (hard negatives).
We defineŷ j as the label given to the training example (q j , c j ). For positive examples, the label y j depends on how close the leaf node l i ∈ L is to the intermediate step s t−1 in the golden tree: Where ant(s t−1 ) denotes the set of antecedents in some entailment step s t−1 , and l i ∈ ant(s t−1 ) means that the leaf node l i is used in the entailment The value λ ∈ [0 : 1] is used to give lower priority to leaf nodes not relevant to the current entailment step (λ = 0.75 gave the best results in our experiments). Finally, we fine-tune the encoder ϕ by minimizing the following loss function L ϕ , where N is the number of training examples: One significant challenge is that for the first generation step, when t = 1, the list of previously generated entailment steps S 0 is empty. The retrieval only depends on h, meaning L 1 = IRGR-retriever(h). It is more difficult to retrieve premises for leaf nodes when the entailment tree's 468 depth is large since the leaf nodes have low syntactic and semantic similarity with the hypothesis h. For instance, the example in Figure 4 shows how leaf node "a human is a kind of animal" (depth 3) is needed to build the entailment tree, but is syntactically distinct to hypothesis "an astronaut requires the oxygen in a space suit backpack to breath".
To mitigate this problem, we perform a conditional retrieval on the first step, where the retrieval module uses partial results as part of the query, as depicted in Algorithm 1. This algorithm assumes that leaf nodes (premises) further from the root node (hypothesis) are more similar to each other than to the root node itself. The parameter ω (value ω = 15 yields the best results on development set) is used to split the search, such that part of the retrieved premises only depend on the hypothesis h. In contrast, the other parts of the retrieved premises depend on the hypothesis and previously retrieved premises stored in the set Q.

IRGR-generator
The IRGR-generator consists of a sequence-tosequence model that outputs one single entailment step given a context. One key aspect of this module is encoding the input and output as plain text.
Encoding Entailment Trees: Entailment trees are linearized from leaves to root. Each leaf node l i ∈ L, intermediate node e i ∈ E and root node h are encoded with the symbols "sent", "int" and "hypothesis", respectively. The entailment steps represent conjunctions with "&" and entailment with the symbol "->". For instance, the entailment tree depicted in Figure 1 can be represented as: Note that the text of intermediate nodes have to be explicitly represented, since they are not part of the corpus C. Ultimately, they have to be generated by the model. The input to the model encodes the hypothesis h and retrieved premises l t i ∈ L t , which are straightforwardly encoded as follows: When a leaf sentence l i t is used in the entailment step, it is removed from the context for the following step, and the premise sent identifier is not used to encode new retrieved premises. A detailed example of input and output for the IRGR-generator module is shown in Appendix A.3.

Datasets
We evaluate our architecture on the ENTAILMENT-BANK dataset , which is comprised of 1,840 questions (each associated with a hypothesis h i and entailment tree T i ) with 5,881 total entailment steps. On average, each entailment tree has 7.6 nodes (including leaf, intermediate, and root) and around 3.2 entailment steps. The corpus of premises C has around 11K entries and is derived from the WorldTree V2 (Xie et al., 2020) in addition to a few premises created by the Entail-mentBank annotators.

Retrieval
We evaluate our IRGR-retriever module using two different sets of metrics. The first one is "Recall at k" (R@k), a standard evaluation metric for information retrieval. The second metric "All-Correct" is more strict, and the results are only considered correct if all the premises from the golden tree are retrieved. Formally, given the retrieved premises L and the set of gold premises L * , the metrics R@k is given by |L ∩ L * | / |L * |, and the metric All-Correct is 1 if |{x ∈ L * : x ̸ ∈ L}| = 0, or 0 otherwise. For our experiments, we consider k = 25 since that's roughly the maximum number of sentences that can fit in the T5 language model's 512 tokens context.

Entailment Tree Generation
We adopt the evaluation metrics defined by , which compares the generated entailment tree T = (h, L, E, S) with the golden entailment tree T * = (h, L * , E * , S * ). The metrics evaluate the correctness along four dimensions: (1) leaf nodes, (2) entailment steps, (3) generated intermediate nodes, (4) and overall correctness. The first step is to align the nodes from T with the nodes from T * by Jaccard similarity (alignment algorithm and further details of metrics described in Appendix A.2). This method tries to ignore variations between predicted and gold trees that do not change the semantics of the output. The four metric dimensions are described below as follows.
For each metric with F1 value, there is also a strict "All-Correct" metric that is equal to 1 when F1 = 1 and 0 otherwise.

Leaf (F1, All-Correct):
Tests if the predicted and golden leaf nodes match. This metric compares the sets L and L * using F1 score.
Steps (F1, All-Correct): Tests if the predicted entailment steps follow the correct structure. Given that s i ∈ S matches s j ∈ S * according to the alignment algorithm, tests if the premises of s i are equal to those of s j , and computes the F1 score according to the set of all matched steps.
Intermediates (F1, All-Correct): Tests if the sentences of the generated intermediate nodes are correct. Given that intermediate nodes e i ∈ E and e j ∈ E * were matched by the alignment algorithm, the F1 score is computed by comparing the textual similarity between the set of the aligned and correct pairs e i and e j .
Overall (All-Correct): Tests all previous metrics together. The All-Correct value is only 1 if the All-Correct values for leaves, steps, and intermediates are 1. Note that this is a strict metric, and any semantic difference between T and T * will cause the score to be zero.

Implementation Details
All experiments were conducted using a machine with 4 Tesla V100 GPUs with 16GB of memory. Our code is based on HuggingFace's Transformers (Wolf et al., 2020) implementation of the t5-large model (Raffel et al., 2020). The retrieval module uses the Sentence Transformers (Reimers and Gurevych, 2019) sentence embeddings by fine-tuning the all-mpnet-base-v2 encoder. Please refer to Appendix A.1 for further details on hyper-parameters and training settings.

Retrieval Results
We compare our retrieval module against two baselines: Okapi BM25 and the retrieval module of En-tailmentWriter, which constitutes of a classifier that retrieves relevant sentences using RoBERTA  and performs re-ranking with Tensorflow-Ranking-BERT (Han et al., 2020). For comparison, we break down the results of our approach (the IRGR-retriever module) into three variations. The IRGR-retriever (sing.) method retrieves premises from the corpus using a single query element, namely the hypothesis h. The IRGR-retriever (cond.) method performs conditioned retrieval as described by Algorithm 1. This retrieval method is not iterative and fetches a fixed set of premises per example. Finally, IRGRretriever tries to emulate the retrieval when combined with the generation module. It not only performs conditional retrieval, but also fetches a different set of premises for each iteration depending on the generated intermediate nodes. In this retrieval experiment, the IRGR-retriever uses the intermediate nodes from the golden entailment trees.

Gold
EntailmentWriter 98 Table 3: Entailment tree scores for baseline methods and IRGR, along four different dimensions (test set). The "Gold" and "Gold+Dist." tasks do not require retrieval and evaluates solely on the model's entailment tree generation capabilities.
Therefore, IRGR-retriever results should be considered an upper bound since the generator might not produce the desirable intermediate steps used for queries. Table 1 shows the R@25 and All-Correct metrics results for different methods. Our premise retrieval module performs consistently better than baselines. For instance, the "IRGR-retriever (cond.)" outperforms the retriever from EntailmentWriter by 14.2% on R@25 and 28.8% on All-Correct metric. Note that "IRGR-retriever" may retrieve a variable number of premises (greater than 25), so we are not reporting R@25 for this method.

Entailment Tree Generation Results
We compare our method against EntailmentWriter baseline model on entailment tree generation. As shown in Table 2, our method outperforms the En-tailmentWriter in all metrics. The overall tree structure better matches the golden tree, where the score for Overall All-Correct metric has an impressive increase of over 300.0%. Note that EntailmentWriter uses the T5-11B model, which has around 10 times more parameters than our model.
We also show the ablation results of combining different retrieval modules with our proposed generation module on Table 2. The "w/o iter." method does not iteratively retrieve premises, relying on one-shot retrieval at the beginning of the generation. As for the "w/o iter. & cond." method, the model does not use the conditioned retrieval, only relying on the trained dense retrieval with the hypothesis h as the query instead.
The work of  defines two other simplified entailment tree generation tasks for further ablation studies. We report the results for what they define as "Task-1" and "Task-2", which are generation tasks where the golden premises are given as input, disregarding the retrieval component. Results in Table 2 report what they define as "Task-3". For clarity, we rename "Task-1" and "Task-2" to "Gold" and "Gold+Dist.", respectively, and show the results in Table 3. In the "Gold" task, each context uses the golden leaves as input, while the "Gold+Dist." task uses the golden leaves plus some distractors (up to 25 distractors). When comparing models with the same number of parameters (we use their reported T5-large results), the generation results without retrieval are roughly the same as the EntailmentWriter method. This experiment shows that the iterative generation can create accurate explanations compared to a single pass generation when using golden retrieved premises.

Results Breakdown
We investigate how well the system performs relative to the number of steps in the gold tree. Figure  5 contains two graphs with results breakdown. The graph on the top shows the all-correct metric values for all three tasks (golden, golden + distractors, and retrieval). The bottom graph shows all F1 metrics (leaves, steps, and intermediates), but only for the "retrieval" task.
The results demonstrate that generating entailment trees becomes increasingly difficult as the size of the tree increases. The IRGR model cannot perfectly predict trees with more than four steps for any of the three different tasks. For the "retrieval" task (without the golden leaf sentences provided as input), the IRGR model cannot successfully generate trees with three or more steps. This could be explained by the fact that the all-correct metric is very strict, and missing or misplacing a single leaf sentence can result in an incorrect tree.
This downwards trend is also present in the "Break Down by Metrics" graph. Most noticeably, the "Intermediates (F1)" metric is especially challenging, having values close to zero for entailment trees with more than five steps. This metric is one  of the main bottlenecks that lowers the value of the "Overall All-Correct" metric.

Analysis
To understand the strengths and weaknesses of our model, we conduct further analysis of the output of the IRGR. When analyzing errors in the generation of entailment trees, we use the results on the development set for the task with distractors. We manually annotate 50 predicted trees that contain some error compared to the golden tree. We categorize the different types of errors, identifying both individual generated steps errors and entailment tree errors.

Retrieval Error Analysis
We use ENTAILMENTBANK's development set to automatically compute metrics that will give us some insights into the type of errors made by the IRGR-retriever module. We use "IRGR-retriever (cond.)" to fetch a set of 25 premises for each data point, where we identify the set of true positives (correctly retrieved premises) and the set of false negatives (missing premises).
To understand if the false negatives are more challenging to retrieve than the true positives, we compute the number of overlapping uni-grams and bi-grams between premises and hypotheses in these two sets. We notice that true positives contain 28.5% more uni-gram overlap and 68.6% more bi-gram overlap to the hypothesis compared to the false negatives. These results suggest that premises lexically dissimilar to the hypothesis are, in theory, more challenging to retrieve.
We also investigate how the depth (number of edges in a path from the tree root) of a leaf node in the gold tree correlates to the errors of the IRGR-retriever module. We compute the average depth of true positive nodes as 2.3, while for falsenegative nodes, the average depth is 3.0. These results strengthen the idea that leaf nodes deeper in the tree tend to be harder to retrieve, as depicted in Figure 4.

Entailment Step Error Analysis
The first error case is called invalid entailment steps (56% of errors), meaning that the conclusion of a step did not follow from the premises. For instance, in "kilogram is used to measure heavy objects" ∧ "an automobile is usually a heavy object" ⇒ "kilogram can be used to measure the mass of an automobile", the model assumes that "measure" is the same as "measure of mass", even though that is not explicitly stated.
The second error case accounts for misevaluation and irrelevance (27% of errors). It happens when the step is correct but does not match the golden tree, or when the step is correct but is not relevant or well placed in the final entailment tree. In the third error case, labeled repetition (17% of errors), the conclusion directly copied the premises, not creating a new sentence for the intermediate step.

Entailment Tree Error Analysis
When analyzing errors between the entire generated and golden trees, we noticed that incorrect or missing leaves (52% of errors) is the most common type of problem. For instance, when explaining the hypothesis "light year can be used to measure the distance between the stars in milky way" the premises "the milky way is a kind of galaxy" and "a galaxy is made of stars" are missing from the generated tree, making it impossible to explain the second part of the hypothesis.
The remaining errors are categorized as invalid or skipped steps (32% of errors), where the model commonly concludes an invalid conclusion from premises. This error often overlaps with missing leaves due to the fact that the model uses fewer premises when it skips important intermediate steps; Imperfect evaluation (12% of errors), where the tree produced is valid, but does not match the golden tree; Disconnected or degenerate trees (4% of errors), where the generated output does not form a tree, or follows the desired output format.

Conclusion
As deep learning models become more ubiquitous in the natural language field, it is desirable that users can understand the model's answer by inspecting the reasoning chain from simple premises to the answer hypothesis. To generate rich, systematic explanations, we proposed a method that can iteratively generate and retrieve premises to produce entailment trees. We show how our approach has advantages over previous baselines, where the retrieved premises and generated explanations are more accurate.
In future work, we plan to improve the generation module by leveraging the structure of the entailment tree instead of relying purely on the encoder-decoder models. This idea could potentially fix the issues with "invalid entailment steps" and "repetition", which account for 73% of entailment step errors. We also plan to understand how explanations can be generated in the case of a false hypothesis, where we would expect the model to build a conclusion explaining why a statement is incorrect. It could help users verify false claims and understand the meaning behind their incorrectness.