Investigating Transformer guided Chaining for Interpretable Natural Logic Reasoning

,

Most of the traditional neural approaches tackled multi-step reasoning as a single pass 'all-at-once' inference.For reasoning problems that are inherently multi-step, it is more natural to consider a symbolic machinery in tandem with the neural model.Taking inspiration from this philosophy, a new class of works have emerged recently by combining neural models (popularly using transformers (Vaswani et.al. 2017)) with symbolic chaining.The central idea is to iteratively perform 1-step neural inferences and chain together the results to generate a multi-step reasoning trace.ProofWriter (Tafjord et.al. 2021) was one of the first to explore this idea, and demonstrate >95% multi-hop reasoning accuracy on several synthetic datasets.(Picco et.al. 2021) and (Bostrom 2022) reported similar results.Recently, several works (Qu et.al. 2022;Yang et.al. 2022;Tafjord et.al. 2022;Ghosal et.al. 2022;Ribeiro et.al. 2022;Hong et.al. 2022) applied variants of this approach on EntailmentBank (Dalvi et.al. 2021) and showed superior performance.The iterative approach is attractive because i) it is faithful in that it naturally reflects the internal reasoning process, and is inherently interpretable, ii) it has been shown to be easily adapted for multiple choice Q&A (Shi et.al. 2021) and open-ended Q&A (Tafjord 2022), besides Natural Language Inference (NLI), iii) it enables teachable reasoning (Dalvi et.al. 2022).
While the above results are promising, we argue that an unbiased third-party investigation is important to facilitate a better understanding of the strengths and weaknesses.This is the main goal of this paper.Towards this, we develop a reference implementation, called Chainformer that captures  {kanagasa, saravanan_rajamanickam, shi_wei}@i2r.a-star.edu.sg the core idea behind the chaining approach, and benchmark on a multi-hop FOL reasoning task using a recently proposed diagnostic dataset, called LogicNLI (Tian et.al. 2021).The dataset is composed of a rich class of FOLs that go beyond conjunctive implications and is non-trivial with a reported human reasoning accuracy of 77.5% (Tian et.al. 2021).
We conduct several experiments to analyze the performance in terms of accuracy, generalization, interpretability, and expressiveness over FOLs.
Our key findings are: 1) human level multi-step reasoning performance is achieved (84.5% machine vs 77.5% human), with a minimalist transformer guided chaining implementation, and even with a base model (80.4% base vs 84.5% large).However, this requires the 1-step inferences be carefully trained for high accuracy; 2) the inferred reasoning chains are correct 78% of the time but could be more than twice longer than the optimal chains; 3) FOLs with simple conjunctions and existential quantifiers are easier to handle, whereas FOLs with equivalence are harder especially with universal quantifiers and disjunctions.Our results highlight the key strengths of the transformer-guided chaining approach and faithful reasoning in general, and suggest possible weaknesses that could motivate future research in multi-hop reasoning.
In related work, (Yu et al., 2020;Liu et al., 2020;Dalvi et.al. 2021;Tian et.al. 2021) have performed diagnostic studies on popular language models and pointed out limitations in logic reasoning capabilities.(Li et.al. 2022) investigated NLU datasets to measure correlation with logic reasoning as a key skill.Our focus is different, and we aim to specifically analyze the iterative reasoning strategy for multi-hop logic reasoning, and hence is novel.

Problem Definition
We consider the NLI setting (Bowman 2015;Storks et.al. 2019).Let F = {f1,f2,••• ,fn}, be n simple sentences, called Facts; R = {r1,r2,••• ,rm}, a set of m compound sentences, called Rules.Then, given the tuple P = (F, R), called the Premise, and a statement h, called the Hypothesis; the inference problem is to determine i) the inference relation of h, and ii) a reasoning chain X, where  = { 1 ,  2 , … ,   , … ,   } is a sequence such that   = (  ,   ) , where   ∈  and   is a set of intermediate facts, with members not necessarily from F.
The inference relations can be entailment, contradiction, neutral, or paradox, as defined in Table 1, where ⊢ is the entailment operator.
It is easy to see that the complexity of the problem varies based on the constraints imposed on F, R, X and the target inference labels of h.For example, RuleTaker (Clark et.al. 2020.)considers h to be 'true or 'false'.Additionally, R is restricted to be implication rules with conjunctions and negations.ProofWriter (Tafjord et.al. 2021) adopts a similar setting but allows h also to be undetermined ('neutral'), In this paper we consider a more general NLI problem following (Tian et.al 2021), where i) R is expressed using a rich class of FOLs with universal ∀ and existential ∃ quantifiers, logic connectives such as disjunctions ∨, implications →, equivalence ≡ and negations ¬..ii) h can take any of the 4 inference labels (Table 1).Figure A-1 in the Appendix presents a sample problem instance.

Logic Reasoning Method
Logic reasoning using chaining strategy can be implemented in several ways, e.g. with fact selection (Bostrom et.al. 2022), rule selection (Sanyal et.al. 2022), inference verification (Tafjord et.al. 2022), etc.We aim to adopt a minimalist implementation, as we believe it facilitates better examination of the strengths and weaknesses of the central methodology,.
We consider the Forward Chaining algorithm from Sec 9.3.2 of (Russell et.al. 2010), which is known to be sound and complete for a rich class of FOLs.Basically, the algorithm starts with the known facts and applies rules whose preconditions are satisfied, to infer new facts repeatedly until the hypothesis can be verified.To extend to natural language, our idea is to employ a transformer Table 1.Inference Relations between P and h.
model to do fact unification and rule inference, and a second transformer to verify the given hypothesis against the currently known facts.
Rule Inference: In this step, given the current known facts and a rule, the rule preconditions are matched through unification to check for a rule match.If the latter succeeds, new facts are inferred (intermediate facts); otherwise, no facts are generated and the control moves to the next rule.
We model this as an abstractive Q&A task, with the current facts as the 'context', the chosen rule as the 'question' and the inferred facts as the desired 'answer'.A T5 transformer model (Raffel et.al. 2020) is employed for this purpose.In particular, the processed input to the model is 'question: <rule> context: <known facts>' and the output is 'inferred facts' if the rule can be triggered and 'none' otherwise.
Facts Checking: This step verifies the given hypothesis against the currently known facts based on Table 1.In our implementation, we accomplish this by formulating a 2-class NLI task, for inferring ′ ⊢ h and ′ ⊢ ¬h, where F' is the currently known facts.

Experiments and Results
We perform several experiments using the multihop FOL reasoning dataset LogicNLI, (Tian et.al 2021).The dataset includes 16K/2K/2K train/dev/test instances, with each instance consisting of over 12 facts and 12 rules, along with labeled statements and reasoning chains (called proof paths).A sample instance is in Figure A-4.
The results are presented in the following subsections.In all the tabulated results, the performance metrics are averaged over 10 runs and quoted in % for easier interpretation, unless stated otherwise.Details about the implementation, hyper-parameter settings and machine configuration are provided in Appendix A.

Comparison with Baselines (Table 2)
Firstly, we compare the performance in terms of accuracy against the baseline language models BERT (Delvin et.al. 2019), RoBERTa (Liu et.al. 2019), and XLNet (Yang et.al. 2019).Additionally, we considered a naïve algorithm, called NaiveFactsChecker, that does facts checking as in Sec 3 but without rule inference.We observe that NaiveFactsChecker achieved ~50% (2x more than Random), suggesting that about 50% of the hypotheses in LogicNLI may be verifiable from the given facts alone.All LM baselines, barring BERT-base, performed better, with RoBERTa-large+MLP the best model.In comparison, Chainformer significantly outperformed all baselines and even exceeded human performance.This is surprising given that our implementation was minimalist without other functionalities often used in the published approaches.We argue that the results highlight the strength of iterative LM-guided reasoning over 'all-at-once' approach.Furthermore, the t5-base model version was comparable in performance to the t5-large version, which gives promise for lowcompute possibilities in implementing logic reasoning.

Detailed Performance Analysis
Here we investigate Chainformer approach in more detail to derive further insights.

Performance over FOLs (Table 3, 4 & 5)
For our next, experiment, we studied the ability in reasoning over various FOL classes.LogicNLI contains 23 FOL classes in total, and we first analyzed Chainformer to determine the respective inference accuracies.A summary of the results is presented in Table 3, 4 and 5. Details about the individual classes and the respective accuracies are provided in the Appendix.
We notice that FOLs with logical equivalence are harder than implication, rather unsurprisingly, and the easiest with neither of them (Table 4).Similarly, disjunctions are harder than conjunctions (Table 5).Universal quantifiers are harder than no quantifiers, but existential quantifiers are easier in comparison (Table 3).A possible explanation is that neural unification is easier when matching any one relevant fact is sufficient rather than requiring the same for all relevant facts.However, this depends on the modeling and implementation specifics.It might be possible to alter the behavior with approaches, e.g. using different angle than 'Abstractive Q&A' (Section 3), but this needs more research.

Interpretability (Table 6 & 7)
We next analyzed interpretability of the predicted reasoning chains, by asking two questions i) Is the chain correct? and ii) Is the chain optimal compared to the ground truth.
Towards this, we define two metrics viz.correctness and minimality.A chain is deemed correct if and only if every chain fragment corresponds to a valid entailment.Minimality is defined as the ratio of the length of the target chain over the length of the predicted chain.Note that a chain may be incorrect even if one step corresponds to an invalid entailment.Thus, we may have situations where the hypothesis is successfully inferred but the chain is incorrect.Such chains are called partially correct.
As an exhaustive analysis of all chains is arduous, we sampled 200 'entailment' and 200 'contradiction' instances from the predicted chains, as a preliminary evaluation, and tasked a student (not part of the project) to manually label the validity of every chain fragment.The labels were later verified via a random check by two project members to remove incorrect entries.The results are presented in Table 6.
On average, we observed that 78.8% of the chains were fully correct (Table 6), providing support for chaining as a faithful reasoning approach.In fact, about 10% of the chains were partially correct and only 11.2% were incorrect.
To analyze minimality, we extracted the correct chains and computed the minimality score against the gold standard chains.An overall average score of 0.42 was observed (Table 7), implying that the correctly predicted chains could be 2.3 times longer than the optimal ones.

Discussion and Conclusions
We considered the recently emerging neurosymbolic approach for addressing multi-step natural logic reasoning, called the transformer guided chaining.The approach adopts an iterative reasoning strategy in contrast to the traditional neural approaches that tackle multi-step reasoning as a single pass 'all-at-once' inference.The iterative approach is attractive as it offers several advantages such as i) it is faithful in that it naturally reflects the internal reasoning process, ii) it is inherently interpretable, iii) it can be applied to multiple choice Q&A and open-ended Q&A, besides Natural Language Inference.
We performed a detailed empirical investigation of this approach, using a challenging FOL reasoning dataset.Our key findings are: 1) human level performance is achieved on multi-hop FOL reasoning task with a minimalist implementation (80.4% machine vs 77.5% human), and even with a base model (80.4% base vs 84.5% large).This provides support for the potential of chaining strategy and encourages possible applications on real life texts; 2) FOLs with simple conjunctions and existential quantifiers are easier to handle, whereas FOLs with equivalence are harder especially with universal quantifiers and disjunctions, suggesting scope for further research; 3) the predicted reasoning chains are correct 78% of the time, but could be more than twice longer than the optimal chains.The latter implies that two or more correct reasoning chains are possible, and iterative reasoning strategy might return one of them (though sub-optimal).This underscores the importance of human validation in interpretability evaluation, as automating it, say by scoring exact match, is likely to underestimate the true performance, A key observation is that the approach hinges on how accurately the 1-step inferences can be performed, as small errors can propagate over multiple iterations and get magnified.For example, if the rule inference step results in false positives/negatives, it is unclear how the chaining performance will be impacted.In addition, if facts are incomplete or even inconsistent, how effective will the reasoning be?These are interesting research questions for further investigation.(Ghosal et.al. 2022;Dalvi et.al. 2022) are steps along this direction.
On another direction, most of the chaining-based works have considered mainly 'entailment' as the inference relation.To handle real-life texts, it is important to go beyond simple entailment relations, and consider more sophisticated ones, e.g.necessity, possibility and rebuttal (MacCartney and Manning, 2014;Huang et.al. 2022).To, cover such relations, new models and approaches are required, and they could facilitate enhancing the scope of current faithful reasoning approaches towards addressing advanced multihop reasoning scenarios.

Limitations
Our work is one of the first to perform a detailed empirical investigation of transformer guided chaining but is clearly preliminary.The following are some key limitations: -Evaluation of Interpretability: A fair evaluation of interpretability is not straightforward.In this paper, we reported results from a preliminary study with limited human labor.-Analysis of negations: LogicNLI dataset uses negations in the facts, rules and statements but it is difficult to disentangle them for a fair investigation.Hence, we were unable to rigorously analyze the ability in handling negations.

Baseline Models
Initially, we performed evaluation on LogicNLI dataset (Tian, J. 2021).LogicNLI dataset contains different section: facts, rules statements and labels.We have used train/dev/test 16000/2000/2000 examples for our models.For baseline experiments, we have re-implemented the finetuned BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) base version and used [CLS] facts rules [SEP] statement [SEP] as input to the transformers to predict the logical relation.BERT uses 12-layer, 768-hidden, 12-heads, 110M parameters for base version and RoBERTa uses 123M parameters.
Our models are trained end-to-end using AdamW optimizer with the decay rate of 0.9.In addition, we have experimented with different learning rates to understand if there is any change in performance.However, learning rate of 5e-6 shows a steady linear increase with the specified decay rate for RoBERTa model.Hence, we have retained the similar hyper-parameters as mentioned in the LogicNLI dataset (Tian, J. 2021), for our BERT and RoBERTa base version.RoBERTa performs better than BERT base and shows 59% on the validation set and 57% on the test set.
The hyper-parameters are listed in the

Logic Reasoning Model
Rule Inference We apply T5 (Raffel et al., 2020) as the encoder-decoder model to generate new facts given the input facts and rule.Given labeled reasoning chains in the LogicNLI dataset (Tian et al., 2021), it is not straight forward to train the model as they provide only 'positive' examples.We build our training set as follows.Given a training instance, we use the logic representation of the facts and rules and apply every rule expression on the fact expressions to generate 1-step inference with an off-the-shelf logic reasoner.The inferred facts are converted to natural language using a simple rule-based technique.The natural language versions of the source rules and facts are extracted from the dataset, and a training set is prepared using the processed input as 'question: <rule> context: <known facts>' and the output 'inferred facts', if the rule can be triggered and 'None' otherwise.
During training, we set number of beams as 50 and number of returned sequences as 5.We randomly split the 9600 instances into 80% training and 20% test for 5 times and report the average performance.We measure accuracy using the exact matching ratio.
As the input size depends on the facts, which may grow over multiple iterations, there is an impact on the token size limitations.We analyze the instances and find that the average size of each instance 191.758 tokens (Min: 171;Max: 240).Our current T5 model can handle sequences with up to 512 tokens.Assuming the worst case (Max size 240; 4 tokens/fact), the chaining process can go up to depth=68, before the limit is reached.We argue that this is sufficiently large for LogicNLI.dataset.For inference/real world examples, working with documents greater than 512 size, we can chunk the document (facts/rules) and use Roberta to encode each chunk accordingly.

Machine Configuration
For baseline models, initially we have used NVIDIA-GeForce RTX 2080 series with eight cores of GPU machines for all our experiments.Later, in order to train t5 large models, we have used NVIDIA-GeForce Tesla V100 series SXM2-32GB with 5 cores of GPU machines.Models were trained for 3-5 hours for training and reasoning.

Analysis of Inference Relations (Table A-2)
Here, we present the detailed reasoning performance for the four labels.'Entailment' and 'Contradiction' performance were similar.'Paradox' was the toughest (F1=74.4)among all.It had a high precision but low recall, as two reasoning chains are required for its classification.In contrast, 'neutral' had a lower precision but higher recall since most of the missed hypotheses will be labeled thus.

Algorithm Chainformer
Input: F, Facts; R, Rules; h, Hypothesis Output: Inference ∈ {E,C,N,P}; X, the reasoning chain.LogicNLI dataset tags over 23 classes of FOLs.Each class is named using an abbreviation of the rule members as below.Given a rule, we denote the FOL connectives viz.conjunction ∧ (C), disjunction ∨ (D), implication → (I), equation ≡ (Q), universal quantifier ∀ (A), and existential quantifier ∃ (E), with a letter (bracketed), and concatenate their respective letters in the order they appear in the rule.For example, Rule 4 in Figure 4 would belong to the class 'ACQ'.The accuracy results of all classes are presented in Table A-3.

Rules:
(1) If someone is fast or tall, then he is athletic.
(2) Someone is fast if he is tall and he is lean.
(3) All those who are athletic will be not slow.(4) If someone is lean and tall, then he is fast, and vice versa.
Hypothesis: John is not slow.

Analysis of FOLs with Conjunction (Table A-4)
We also analyzed the accuracy over FOLs with conjunctions in implication rule (before and after →) and similarly in rules with equivalence.Results imply that conjunctions in the consequent are harder for implications.In case of equivalence, it is even harder possibly the implication works both ways.

Sample Instance from LogicNLI
Additionally, for interpretability, we store the rule and the intermediate facts, every time a rule is satisfied.If the hypothesis is successfully verified, then the stored rules and facts are assembled to form a reasoning chain and returned.An outline of the complete algorithm (Figure A-2), and an illustration (Figure A-4) are presented in in the Appendix, along with the training details.

Figure 1 :
Figure 1 : Performance over various 1-step inference training set sizes, where 1-step accuracy is plotted in Blue and the final reasoning performance in Orange.
Figure A-1 presents a sample instance for illustration.Algorithm pseudocode Figure A-2 provides the full pseudocode of our algorithm outlined in Section 3. Illustration of Output Figure A-3 presents an illustration of the output by algorithm Chainformer.
Figure A-4 presents a sample instance from LogicNLI as an illustration.

Figure A- 3 :
Figure A-3: Sample Output of Chainformer with for instance in Figure A-1.

Table 2 :
Comparison of Accuracy against Baseline models on Dev/Test

Table 5
Analysis of Accuracy over FOLs w.r.to Conjunctions ∧ and Disjunctions ∨

Table 3
Accuracy over FOLs w.r.to Quantifiers Table 4 Analysis of Accuracy over FOLs w.r.to Implication → and Equivalence ≡

Table 6 :
Correctness of Predicted ChainsTable 7 Minimality of Verified Chains -Evaluation on Real-life data: Our reported work focused on a synthetic dataset.For a more rigorous evaluation, it is imperative to consider more datasets including real-life ones.
base version and used [CLS] facts [SEP] hypothesis [SEP] as input to the transformers to predict the inference relation.The hyperparameters are as in the Table A-1.

Table A -
1 Hyperparemeters for Experiments A-1: Sample Instance for Illustration Sample Instance for Illustration, dataset showing facts, rules, a statement, proofs, the path and the label.

Table A -
3 Performance over FOLs Inference : Entailment Reasoning Chain: Tim is not slow.Tim is fast.Tim is athletic.John is tall.John is not slow.