Neural Natural Logic Inference for Interpretable Question Answering

Many open-domain question answering problems can be cast as a textual entailment task, where a question and candidate answers are concatenated to form hypotheses. A QA system then determines if the supporting knowledge bases, regarded as potential premises, entail the hypotheses. In this paper, we investigate a neural-symbolic QA approach that integrates natural logic reasoning within deep learning architectures, towards developing effective and yet explainable question answering models. The proposed model gradually bridges a hypothesis and candidate premises following natural logic inference steps to build proof paths. Entailment scores between the acquired intermediate hypotheses and candidate premises are measured to determine if a premise entails the hypothesis. As the natural logic reasoning process forms a tree-like, hierarchical structure, we embed hypotheses and premises in a Hyperbolic space rather than Euclidean space to acquire more precise representations. Empirically, our method outperforms prior work on answering multiple-choice science questions, achieving the best results on two publicly available datasets. The natural logic inference process inherently provides evidence to help explain the prediction process.


Introduction
Question answering (QA) is an important real-life NLP application but also a challenging task for assessing how well AI systems understand human language and perform reasoning to answer questions. A main challenge of QA is that the answers often do not explicitly exist in a supporting knowledge base but instead need to be inferred from it. Prior work (Angeli et al., 2016) has viewed QA as a textual entailment problem performed on a large premise set, where a question and candidate answers are formulated as hypotheses that need to be proved. * Corresponding author.
In this paper, we investigate a neural-symbolic QA approach that integrates natural logic reasoning (Lakoff, 1970;Nairn et al., 2006;MacCartney and Manning, 2009) within deep learning architectures for QA, aiming to keep the backbone of inference based on the natural logic formalism, while integrating neural networks to make the systems powerful and robust. Conventional natural logic has been designed for natural language inference and question answering (MacCartney and Manning, 2009;Angeli and Manning, 2014). As opposed to performing deduction on an abstract logical form, e.g., first-order logic (FOL) or its fragments, in which obtaining representation for abstract logic forms is known to face many thorny challenges, natural logic provides a formal proof framework based on the monotonicity calculus or projectivity.
We present the Neural Natural Logic Inference (NeuNLI) framework for question answering. The core idea of NeuNLI is bridging a hypothesis and candidate premises by following natural logic inference steps and incorporating neural models to help build the proof paths. NeuNLI first converts a question and candidate answers to form declarative sentences, namely hypotheses. It then rewrites these original hypotheses to obtain intermediate hypotheses and repeats this process to construct a proof tree for each question-answer pair.
Since the reasoning process forms a tree-like, hierarchical structure (Angeli and Manning, 2014), it can lead to structural distortion when learning embeddings for hypotheses and premises in the Euclidean space (Sarkar, 2011;Sala et al., 2018). Additionally, natural language text exhibits hierarchical structure in a variety of respects (Dhingra et al., 2018). NeuNLI projects the question and answer embeddings to the Hyperbolic space. For a proof tree, NeuNLI computes an entailment score between tree nodes and candidate premises in a Hyperbolic space and use that to help select the answer. We demonstrate modelling entailment score in the Hyperbolic space improves the performance.
To train the above process in an end-to-end differentiable manner, we utilize the Gumbel-Softmax technique (Jang et al., 2017), which can effectively approximate the discrete variable, as an approximation of the non-differentiable selecting process of candidate mutations. In summary, the contributions of our work are as follows: (1) We introduce a novel framework NeuNLI, which combines the advantages of natural logic and deep neural networks for question answering.
(2) Our proposed model provides step-by-step explanation for how the prediction was derived. (3) The proposed model achieves new state-of-the-art performance on two QA datasets. We provide detailed analyses demonstrating how the model works to achieve the improvement. The code is released at https://github.com/Shijihao/NeuNLI.

The Problem
Consider an example from a multiple-choice science question from (Clark, 2015) and as shown in the following example.
Example-1: Question: The main function of a fish's fins is to help the fish _____.
Knowledge Base: . . . A fish has a flipper or fin that helps them swim. The dorsal fin can help to keep the fish stable in the water. . . . Given a science question, four candidate answers, and relevant knowledge, a model needs to choose the correct answer supported by the knowledge base. Following Clark et al. (2018), we explore to solve the multiple-choice question answering as a textual entailment problem. Specifically, a question and four candidate answers can be converted to four declarative sentences, namely target hypothesis h i where i P t1, 2, 3, 4u. We

Relation
Name Example x " y equivalence garbage " rubbish x Ď y forward entailment dog Ď animal x Ě y reverse entailment animal Ě dog x N y negation usual N unusual x ë y alternation monkey ë elephant x y cover mammal nonhuman x # y independence angry # fridge will retrieve relevant knowledge, a premise set P " tp 1 , . . . , p j , . . . , p k u, from the knowledge base and determine one that entails one of the four hypotheses, where k represents the number of supporting premises. Central to our approach is the development of neural-symbolic model that uses natural logic as the backbone prover and leverages the expressiveness of neural models to help construct this proving process.

Natural Logic
Natural Logic (Lakoff, 1970) is a formal proof theory built on the syntax of human language, which can be traced to the syllogisms of Aristotle. It aims to capture logical inferences by appealing directly to the structure of language. Specifically, the logical inferences are directly operated on the surface form of language based on the monotonicity calculus or projectivity (MacCartney and Manning, 2009;Valencia, 1991), as opposed to running deduction on an abstract logical form, first-order logic (FOL), or its fragments. For natural language, obtaining a representation of abstract logic forms is known to face many thorny challenges. In this research, we investigate developing neural natural logic models for QA, which provide insight into the derivation process but also sidestep the difficulties of translating sentences into FOL. Natural logic proving is operated by inserting, deleting, or mutating words following monotonicity calculus or projectivity (MacCartney and Manning, 2009;Valencia, 1991). In their recent work MacCartney and Manning (2009) utilize seven logical relations as shown in Table 1. For example, mutating animals to dogs corresponds to a reverse entailment relation, i.e., animals Ě dogs. Natural logic then projects the lexical relation based on the monotonicity or projectivity determined by the context. According to the monotonicity calculus, upward monotone preserves the logical relation, while downward monotone can change the logical  relation. For example, the quantifier all has a downward monotone in its first argument. Accordingly, given animals Ě dogs, we know all animals Ď all dogs (e.g., as in all animals need water Ď all dogs need water).

Natural Logic Inference
Natural logic inference casts inference as a search problem: given a hypothesis and an arbitrarily large corpus of text, it searches through the space of lexical mutations (e.g., eat Ñ consume), with associated costs, until a premise is found (Angeli and Manning, 2014). The entire inference process, constructed in reverse, starts from the hypothesis. An example search using natural logic inference is given in Figure 1. The root denotes one of the hypotheses in our task, and the relations along the edges denote relations between the associated sentences.

Method
In this paper we propose the Neural Natural Logic Inference (NeuNLI) framework, aiming to combine the advantages of natural logic and deep neural networks for question answering, which builds explainability in the model and leverages the powerful capacity and robustness of neural models. Figure  2 depicts the overall architecture of NeuNLI; the pseudocode of NeuNLI is listed in Algorithm 1. In the following subsections, we discuss NeuNLI in detail.
As the starting point, given a question sentence, "In New York State, the longest period of daylight occurs during which month" and candidate answers, NeuNLI converts the question and each answer (say, "June") to a declarative hypothesis sentence h i , i.e., "In New York state, the longest period of daylight occurs during June".

Candidate Premises Retrieval
The knowledge base K consists of unstructured text. This makes available the great amount of text as knowledge source to help perform question answering. Given a hypothesis, as shown in the right part of Figure 2, NeuNLI first retrieves candidate premises. Specifically, a premise is one of the sentences in the knowledge base K " tp 1 , . . . , p n u. Given a hypothesis h i , we obtain the representation of h i and each p j in K by computing the average Glove word embeddings (Pennington et al., 2014) of it, respectively. Then we calculate the cosine similarity between h i and each p j in K, respectively, to find the top k relevant candidate premises (k is tuned on the development set).

Contextualized Neural Natural Logic Prover
Candidate Proof Path Generation. As shown in Figure 2, starting from the hypothesis at the root, the backward proof process needs to generate proof paths to help find supporting premises from the candidate premise pool retrieved above.  Intuitively, there is no need to mutate each word in a hypothesis. Thus, we first find out function words in advance. For each word w i in a hypothesis, we use the NLTK (Bird et al., 2009) toolkit to obtain its part of speech tag w ipos and apply rules to filter out words that have little influence on the semantics of the hypothesis. Words with the following part of speech will be neglected: preposition, determiner, coordinating conjunction, cardinal numbers, personal pronoun, and modal verb. Also, punctuation words and stop words will be excluded. Subsequently, we conduct inference starting from the original hypothesis h i that consists of L words, h i " pw i 1 , . . . , w i l , . . . , w i L q. We first mask a word in the hypothesis and then feed it into BERT to predict the masked token as shown in the upper-left subfigure of Figure 2.
The probability of the word w on the l'th position of h i with parameters θ is defined by: where w o is the one-hot vector for the word w on the l'th position; f θ p¨q is a multi-layer bidirectional transformer model (Vaswani et al., 2017) and S zl " pw i 1 , ..., w i l´1 , [MASK], w i l`1 , ...w i L q. In order to narrow the semantic distance, the reversed search also deal with lexical insertion and deletion. For example, the sentence some grey squirrels eat nuts would entail some squirrels eat nuts by lexical insertion. As we know, deleting noun (or verb) is very likely to result in incomplete sentences, whereas inserting noun (or verb) does not guarantee the resulting sentences conform to the grammar. Thus, we choose to only insert (or delete) adjective, to conduct inference. Generating candidate words for insertion also utilizes the mask mechanism: we insert a mask in front of the corresponding noun. The position of insertion and deletion will be tagged to avoid repetitive insertion/deletion operations at the same location.
Due to the nature of masked language modeling, we take advantage of the mask mechanism for lexical mutation. In this way, the context of the mutated word w l can be considered. According to the probability p l pw|θq, the candidate words can be ranked in descending order. The higher the probability, the more relevant the candidate word w 1 l to the original word w l . So far, we have obtained a list of candidate words.
Proof Path Filtering. The mask mechanism does not guarantee semantic coherence to the original hypothesis as shown in the left part of Figure  2. The original hypothesis is ". . . the longest period of daylight . . . ". Through the mutation of the word "longest", the candidate words may contain the word "shortest", which fits well into the grammar and context of the sentence but changes the semantics of the original hypothesis. To keep a high semantic similarity, we need to judge whether the mutation operation would change the semantics of the original hypothesis using a logical relation prediction module, and filter out the incorrect mutations. Here, the candidate word w 1 l is "short- The projection function φ when the lexical polarity of the mutated word is downward. The input r is the predicted lexical relation between the mutated word and the mutating word. Note that the projection function φ is the identity function when the lexical polarity of the mutated word is upward.
est", while the mutated word w l is "longest". First, we use the fine-tuned RoBERTa (Liu et al., 2019) to predict the logical relation between "shortest" and "longest". The input form of the RoBERTa , where w l is assigned to segment 0 and w 1 l is assigned to segment 1. The predicted result is negation relation (N), calculated by the representation of the [CLS] token.
Then, we use the projection function φ to obtain the sentence-level semantic relation according to the predicted lexical relation and the lexical polarity of the word w l . If the lexical polarity is upward, the sentence-level relation will be identical to the lexical relation. Otherwise, the projection from the word-level relation to the sentence-level relation is performed as shown in Table 2. We employ Stanford natlog parser  to acquire the lexical polarity of words. For example, as the polarity of the mutated word "longest" is upward, and the logical relation between "longest" and "shortest" is N, the semantic relation of the hypothesis h i and the intermediate hypothesis h 1 i still maintains N. If the predicted polarity of "longest" is downward, the sentence-level relation will be ë. As we only conduct inference on the sentence-level relation of " or Ě, this mutation would be filtered out. . For the tree-like, hierarchical structure constructed by the reasoning process, the number of intermediate hypotheses grows exponentially. However, the Euclidean space grows polynomially, which would lead to structural distortion in the Euclidean space (Sarkar, 2011;Sala et al., 2018). Additionally, natural language text itself exhibits hierarchical structure. Thus, we calculate the entailment scores between them in Hyperbolic space as shown in the right part of Figure 2. Here, we choose the Poincaré ball model (Cannon et al., 1997) to project the candidate premise and intermediate hypothesis into the Hyperbolic space to acquire more precise representations. We exploit the re-parameterization technique (Dhingra et al., 2018;López et al., 2019;Cao et al., 2020) to implement it, which involves calculating a direction vector m and a norm magnitude µ. Take v E p j as an example to illustrate the procedure:

Entailment Score Estimation in Hyperbolic
σ is the sigmoid function to ensure the resulting norm µ p j P p0, 1q. The re-parameterized premise representation is defined as v H p j " µ p j m p j , which lies in Hyperbolic space B d H . The re-parameterization technique has the ability to avoid the need to adopt the stochastic Riemannian optimization method (Bonnabel, 2013). Instead, we can exploit AdamW (Loshchilov and Hutter, 2019) to update the parameters in the entire model.
The entailment score in Hyperbolic space is calculated by the hyperbolic distance: where q to a scalar entailment score s ij . The maximum entailment score s i " max j ps ij q is used as the supporting probability to the hypothesis h i , i.e. the probability of the corresponding answer. This is repeated for all answers, and the answer with the highest entailment score max i ps i q is selected as the correct answer.

Gumbel-Softmax Training
Note that the above learning process is not differentiable and the training signal cannot be passed to the parameters of pre-trained language model. To address this, we adopt the Gumbel-Softmax technique (Jang et al., 2017) to train the whole process in an end-to-end manner. Gumbel-Softmax technique has been shown an effective approximation to the discrete variable. Therefore, we use w j " exppplogpp l pw j |θqq`g j q{τ q ř i exppplogpp l pw i |θqq`g i q{τ q (4) as the approximation of the one-hot vector of a selected mutating word on the l'th position, where w i is the i'th token that appears in the vocab of BERT model. g j are i.i.d samples drawn from Gumbel(0,1) 1 and τ is a constant that controls the smoothness of the distribution.

Objective Function
We normalize prediction scores across all candidate answers using the softmax function and train the model using the cross-entropy loss: where C is the number of candidate answers. s i is the entailment score corresponding to the answer i and t i is 1 when the i'th candidate answer is correct, otherwise t i is 0. We minimize the cross-entropy loss between the prediction result and the ground truth.

Experiment Set-Up
Datasets, Baselines, and Implementation Details. We evaluate the performance of our model on two publicly available datasets (Angeli et al., 2016). Both datasets are made up of non-diagram multiple-choice science questions from the New York Regents 4 th Grade Science Exams (NYSED, 2014). We use the same datasets (QA-S and QA-L) and knowledge bases (Barron's and SCITEXT) as the baseline (Angeli et al., 2016). The details of datasets and knowledge bases can be found in Appendix A. We compare NeuNLI with Solr, Classifier (Angeli et al., 2016), Evaluation Function, 10,000 10,000 10,000 1,320 10,000 1,650 10,000 NaturalLI (Angeli et al., 2016), HyperQA (Tay et al., 2018), SemBERT (Zhang et al., 2020) and NeuNLI-E. Descriptions of the baseline methods are detailed in Appendix B. Additionally, experiment settings are further discussed in Appendix C.

Construction of Lexical Relation Prediction
Corpus. To better predict lexical relations between the original word and the candidate mutating word, we build a set of lexical pairs to train the prediction model. These lexical pairs are built upon the lexical knowledge base WordNet (Miller, 1992). We regard words in the same synsets of the WordNet as having the equivalence relation ". Words with hypernymy and hyponymy relations in the WordNet are cast as having the forward Ď and reverse Ě entailment relation, respectively. The antonymy relation in the WordNet can be naturally projected as the negation relation N of the natural logic. For a synset in WordNet, the relation between its hypernyms (or between its hyponyms) is regarded as the alternation relation ë in natural logic. Besides, for a synset in WordNet, its hyponymy and its antonym have the cover relation in natural logic. As for the independence relation # in natural logic, we randomly extract lexical pairs from the WordNet and then filter out pairs that have the other six lexical relations and the rest can be regarded as the independence relation.
The number of seven lexical relations in natural logic is shown in Table 3. We split the number of each relation with the ratio of 8:1:1 to fine-tune a pre-trained language model.

Experiment Results
We list the test accuracy of baseline methods and NeuNLI on two test sets in Table 4 (QA-S) and Table 5 (QA-L), respectively. In Table 4, we also present results utilizing two different knowledge bases. We find that: (1) Compared with NaturalLI (Angeli et al., 2016), our method performs better because we consider the contextual information during the natural logic-based reasoning process. This helps to reduce the unnecessary expansion of irrelevant lexical mutation and make NeuNLI focusing on the right HyperQA (Tay et al., 2018) 54 62 SemBERT (Zhang et al., 2020) 53 59 NeuNLI-E (Ours) 57 67 NeuNLI (Ours) 64* 72* reasoning path.
(2) Comparison between HyperQA (Tay et al., 2018) and NeuNLI shows that natural logicpowered neural networks can achieve better performance on the QA datasets. Moreover, the process of natural logic reasoning can serve as the explanation of the results, while HyperQA can hardly give a reasonable explanation for its results.
(3) Our method also performs better than Sem-BERT (Zhang et al., 2020). Both approaches incorporate contextual semantic information with BERT for QA. In comparison, we involve natural logic for achieving this goal, which is the main reason for the improvements.
(4) NeuNLI outperforms NeuNLI-E mainly because we learn embeddings of the candidate premise and hypothesis in Hyperbolic space, which can acquire more precise representations.
(5) NeuNLI achieves the best results on the test set with two different knowledge bases: Barron's and SCITEXT. We also notice that the model with a larger knowledge base SCITEXT can achieve a better performance, which coincides with human intuition that with more knowledge, we can choose more correct answers.
(6) The experimental results on the QA-L test set in Table 5 are consistent with those on the QA-S test set in Table 4, which shows the generalization of our approach.
Precision of Lexical Relation Prediction. As the lexical relation prediction is an important module in NeuNLI and can affect the performance of NeuNLI, we evaluate the performance of this module and show the results in Table 6. We com-
Human Evaluation for Explainability. We quantitatively evaluate the explainability of our model through human evaluations. Specifically, we evaluate NeuNLI on the QA-S dataset with the Barron's knowledge base. We employ three graduate students that majored in natural language processing to give a score belonging to {0, 1, 2} to evaluate whether the inference path derived by our model is reasonable. The semantic between the final intermediate hypothesis and the premise is irrelevant, then the score is tagged 0. The semantic between the two is very close, then tagged 2. If the gap between the two needs evaluators to imagine a context, then tagged 1. For comparison, we set Nat-uralLI (Angeli et al., 2016) as the baseline and the significance test is conducted using paired t-test at a significance level of 0.05. The average scores are shown in Table 7 and the significance difference is less than 0.05. We can observe that the score of NeuNLI is significantly higher than that of NaturalLI. This is mainly because NeuNLI can generate more reasonable words by incorporating contextual semantic information into the natural logic inference process.

NaturalLI NeuNLI
Avg. Explainability Score 1.09 1.31*  For example, the hypothesis is "in order to survive, all animals need food, water and air". By lexical mutation in NeuNLI, we get the sentence "in order to live, all animals need food, water and air", which is closer to the premise "animals need air, water, and food in order to live and thrive".
Ablation Study. We conduct the ablation study on the QA-S test set with the Barron's knowledge base. The experimental results are shown in Figure  3. From the figure, we can observe that: (1) Effect of Number of Relevant Premises. The accuracy continues to increase as the number of relevant premises increases from 1 to 4 in Figure 3. This is mainly because the more knowledge is involved in the model, the better performance can be achieved. While when the number of relevant premises exceeds 4, the accuracy starts to decrease, as there may be noise information included in the model by the retrieval method.
(2) Effectiveness of Natural Logic-based Reasoning. Comparing NeuNLI with NeuNLI w/o reasoning, we can find the performance improves significantly. The accuracy score improves from 57.35% to 64.71% on the QA-S test set with the Barron's knowledge base (setting the number of relevant premises is 4). The same conclusion can be drawn from the comparison between NeuNLI-E and NeuNLI-E w/o reasoning. It indicates that exploiting natural logic-based reasoning is very effective for QA.

Related Work
Question answering systems that integrate deep learning methods have made great progress in recent years (Lukovnikov et al., 2017;Bhandwaldar and Zadrozny, 2018;Jia et al., 2018;Yang et al., 2019). Many works first adopt learnable encoders for sentence representation like convolutional encoders (Zhang et al., 2017), recurrent encoders (Tay et al., 2017) and transformers (Yang et al., 2019). Then an interaction layer is devised to calculate the semantic similarity, which is the main difference in many models. Severyn and Moschitti (2015) utilize a multi-layered perceptron to combine the CNN encoded representations. Yang et al. (2016) perform a soft-attention alignment to measure word similarity between the question and the answer.
Though neural networks-based models make great advances in QA, they are short of illustrating the step-by-step prediction derivation process, where the logic-based method is adept (Rocktäschel and Riedel, 2017;Weber et al., 2019;Minervini et al., 2020), which differs from the widely used attention mechanism (Doshi-Velez and Kim, 2017;Jain and Wallace, 2019). Angeli et al. (2016) proposed a Natural Logic Inference framework to utilize natural logic to conduct interpretable question answering and viewed the open-domain question answering as a textual entailment problem. Our NeuNLI is inspired by natural logic inference but can achieve better performance by modeling the contextual information during natural logic proving using two pre-trained language models and training the whole process in an end-to-end fashion.

Conclusion
In this work, we explore the feasibility of combining natural logic with neural networks for interpretable question answering. We present an end-toend differentiable method for learning the parameters as well as the structure of natural logical rules, which is capable of considering the contextual information while conducting natural logic-based reasoning. Experimental results on the Regents Science Exam of the Aristo dataset show that our proposed model could bring improvements over baseline methods.