Diagnosing the First-Order Logical Reasoning Ability Through LogicNLI

Recently, language models (LMs) have achieved significant performance on many NLU tasks, which has spurred widespread interest for their possible applications in the scientific and social area. However, LMs have faced much criticism of whether they are truly capable of reasoning in NLU. In this work, we propose a diagnostic method for first-order logic (FOL) reasoning with a new proposed benchmark, LogicNLI. LogicNLI is an NLI-style dataset that effectively disentangles the target FOL reasoning from commonsense inference and can be used to diagnose LMs from four perspectives: accuracy, robustness, generalization, and interpretability. Experiments on BERT, RoBERTa, and XLNet, have uncovered the weaknesses of these LMs on FOL reasoning, which motivates future exploration to enhance the reasoning ability.


Introduction
Recently, Transformers-based (Vaswani et al., 2017) language models (LMs), such as BERT (Devlin et al., 2019) and RoBERTa , have achieved great success on natural language understanding (NLU). However, there are growing concerns about whether LMs can truly understand natural language or not. Tasks with complex reasoning have provided evidence that LMs lack expected reasoning abilities (Liu et al., 2020;. Even if neural models can make correct predictions, they tend to make decisions through spurious statistical correlations rather than reasoning abilities (Kaushik and Lipton, 2018;Ribeiro et al., 2019;Jiang and Bansal, 2019;Mc-Coy et al., 2019). Therefore, an increasing number of studies have focused on diagnosing specific reasoning abilities of state-of-the-art LMs (Sugawara et al., 2020;Gontier et al., 2020). † These authors contributed equally. ‡ Corresponding author.
First-order logical (FOL) reasoning is one of the most widely used reasoning forms in natural language (Davis, 2017;Yu et al., 2020), which has a simple paradigm consisting of combinations of seven fundamental logics (FOLs, including conjunction ∧, disjunction ∨, negation ¬, implication →, equation ≡, universal quantifier ∀, and existential quantifier ∃) with simple propositions (Davis, 2017). Nevertheless, whether LMs can truly make FOL reasoning is still inconclusive in NLP (Hahn et al., 2021;Clark et al., 2020).
As a result, we propose a systematic diagnostic method for FOL reasoning by proposing a novel benchmark, named Logical Natural Language Inference (LogicNLI). The proposed benchmark follows three principles: 1) It includes abundant logical expressions covering all seven FOLs and their commonly used combinations in texts; 2) The instances of the benchmark conform to natural language; 3) It introduces as little commonsense as possible to prevent the targeting FOL reasoning and commonsense inference from being entangled with each other (Clark et al., 2020). According to the principles, LogicNLI is an NLI-style dataset (Bowman et al., 2015;Talmor et al., 2020), including triplets of facts, rules, and a statement. The objective is to determine the logical relation (entailment, contradiction, or neutral in NLI (Bowman et al., 2015)) between the premise (facts and rules) and its corresponding hypothesis (statement) by FOL reasoning shown in Figure 1. In practice, we have introduced an additional logical relation, "Paradox", to represent the situation where the hypothesis and its negative proposition can be logically entailed to the premise simultaneously based on different reasoning paths (bottom of Figure 1). This novel logical relation forces the model to search at least two reasoning paths to infer the authenticity of two opposing propositions, thereby effectively avoiding spurious correlations caused by dataset bias.
Based on LogicNLI, we propose a systematic Anna is static. Static or large people are clever.
Someone who is not good is always round.

Static(Anna)
If there is someone who is clever, then Bob is not good.
∃ Clever( ) → ¬ Good (Bob) diagnostic approach by comprehensively considering four perspectives: accuracy, robustness to irrelevant information, more-hop generalization, and proof-based traceability. We perform diagnosis on three state-of-the-art LMs, BERT (Devlin et al., 2019), RoBERTa , and XL-Net (Yang et al., 2019). Results reveal that LMs can neither fully understand the logical rules nor apply them to reason like humans. In conclusion, our main contributions include: 1) We design a novel benchmark, LogicNLI, following three basic principles to diagnose LMs' FOL reasoning ability. This method of benchmark construction is general for different reasoning types in NLU. 2) Based on LogicNLI, we design a diagnostic approach composed of accuracy, robustness, generalization, and traceability, which measures LMs' FOL reasoning ability from different perspectives. 3) Results on three LMs show that even the best performing model on LogicNLI, RoBERTa, cannot fully infer according to logic and generalize to different scenarios. Analysis could inspire the further exploration of incomprehensible logic.
2 Related Work

FOL Reasoning Benchmark
Among these NLU abilities, FOL reasoning is a fundamental reasoning ability that attracts an increasing number of studies to benchmark. LogiQA (Liu et al., 2020) and ReClor (Yu et al., 2020) are two comprehensive datasets with domain knowledge. However, even if a model performs poorly on these datasets, it is inconclusive that the model lacks the FOL reasoning ability because the targeting ability cannot be disentangled from other reasoning abilities, such as commonsense in- ference. In addition, these two datasets do not provide proofs to trace back the reasoning process. CLUTRR (Sinha et al., 2019) also requires two FOLs but focuses more on the predicate relation (belongs to commonsense) understanding. LTL (Hahn et al., 2021) is a propositional logical benchmark containing five FOLs but does not conform to natural language. Clark et al. (2020) propose a series of novel FOL benchmarks (named SoftReasoner) that introduce as little commonsense as possible. It concentrates on a specific FOL combination, conjunctive implication with negation, rather than on diverse FOL forms. Inspired by Soft-Reasoner (Clark et al., 2020), we construct Logic-NLI with common combinations of all seven FOLs to diagnose the FOL reasoning ability. Compared with other datasets (shown in Table 1), LogicNLI covers the most comprehensive FOL forms and effectively separates logic and commonsense. Furthermore, LogicNLI also provides all proofs for each instance so that we can evaluate LMs' FOL reasoning from different perspectives.

Task Definition
In this section, we introduce how the task on Log-icNLI is defined. On the basis, we also exhibit how FOL reasoning is embodied in LogicNLI. We first define elements in LogicNLI: Facts F = {f 1 , f 2 , · · · , f n } are composed of simple propositions; Rules R = {r 1 , r 2 , · · · , r m } are always compound propositions with FOL; Statement s is the targeting proposition; Premise P = (F, R) includes all facts and rules. Based on the above definitions, the final objective of LogicNLI is to determine the logical relation between P and s under two assumptions: 1) World assumption is open (OWA); 2) The statement s and Rules: (R1) If someone is alive, then he is neither grieving nor worrisome. (R2) If there is at least one people who is distinct, then Alan is grieving. (R3) Harold being alive is equivalent to Alan being grieving. (R4) Someone being both worrisome and drab is equivalent to being colorful and distinct.

Overview
LogicNLI includes more than 30K instances consisting of facts, rules, a statement to be judged, proofs, the reasoning path, and the label (shown in Figure 2). For each instance, it requires a multihop FOL reasoning process to reason out the final answer. To simplify the reasoning process, we set two limitations: 1) only considering the reasoning from cause to effect; 2) neglecting the true meanings of predicates. Therefore, LogicNLI is more suitable for benchmarking the specific (FOL) reasoning ability instead of serving as a comprehensive NLU task. As a result, we leave open the question of how LMs perform in real reasoning scenarios with FOLs because it is difficult to disentangle multiple influencing factors.
LogicNLI also provides four kinds of test sets that correspond to four diagnostic abilities in diagnosis, including total accuracy, robustness to irrelevant information, more-hop generalization, and proof-based traceability. Specifically, we attempt to answer the following questions relevant to the FOL reasoning ability based on these evaluations: Q1: Do models truly perform FOL reasoning automatically in diverse scenarios? Q2: Do reasoning results accord with reasonable logic? Accuracy, robustness, and generalization are adopted to answer Q1 from different conditions. Accuracy is the most common in-domain evaluation that measures the overall performance of LMs. Compared with accuracy, the robustness test offers a scenario that increases/decreases non-proof sentences. As robustness does not change the reasoning process, it can be regarded as an in-domain evaluation. The generalization test offers a scenario that increases the reasoning hop and therefore increases proofs, so it is an out-of-domain evaluation. Traceability test is introduced to answer Q2 by validating the whole reasoning process according to the proofs.

Dataset Gerneration and Statistics
We adopt a semi-automatic method to generate Log-icNLI with two steps: 1) logic generation, and 2) natural language generation. As for the logic generation, we adopt an automatic method to generate each logic expression to ensure the validity of FOL reasoning. Specifically, We first select a list of subjects, S = {s i }, i ≤ n, and a list of adjectives as predicates, P = {p j }, j ≤ m, and define a set of logical templates T in advance. For each instance, we randomly select logic expressions from T and the corresponding subjects and predicates from S and P . In terms of the natural language generation, we first adopt a rule-based method to generate initial language expressions and then make manual revisions. Manual correction aims to fix grammatical errors and semantic ambiguities. Besides, it also enhances the diversity of expressions. As for test sets of different abilities, we add additional limitations to generate data that meets different needs based on the above generation method.
Statistics of LogicNLI are listed in Table 2. Log-icNLI includes 9 training sets, 9 development sets, and 15 test sets. We adopt different subjects and predicates for independently constructed training sets, development sets, and test sets to avoid the spurious correlations between subjects and predicates. To undermine the label bias, we ensure the balance of different labels in each dataset.

Diagnosis
Total Accuracy is the most intuitive indicator to measure the performance of a model in most NLU tasks (Storks et al., 2019), but it may not be sufficient as it cannot avoid the impacts of spurious correlations. In this work, the accuracy-test set (Test-A) has a similar distribution to the training set and the development set, except that the subjects and predicates are zero-shot.
Robustness to Irrelevant Information is an indomain evaluation that measures the model's ability to extract relevant information from noisy data, which is typically the first step in many NLU tasks. Unlike Sinha et al. (2019), our work focuses on the amount of noise, rather than its taxonomy. Therefore, we adopt an elimination method to generate training sets (Train-R), development sets (Dev-R), and test sets (Test-R). Firstly, facts and rules are classified into relevant sentences (R1, R2, and R3 in Figure 2) and irrelevant sentences (R4 in Figure 2). Secondly, we fix the relevant sentences to ensure that the label remains unchanged and gradually eliminate irrelevant ones. We finally acquire robustness sets with different numbers of facts and rules (from 10 to 24 in steps of 2).
More-hop Generalization is an out-of-domain indicator to judge whether a model truly understands the logic rules and applies them to reasoning instances. Following the setting in CLUTRR (Sinha et al., 2019), generalization can be measured by training a model on examples with ≤k-hop reasoning and evaluated on ones with >khop reasoning. Therefore, we generate a series of the more-hop test sets (Test-G) only by controlling the generation iterations during the logic generation.
Proof-based Traceability is used to post-verify whether a model infers the correct answer according to the human-understandable logic. In multihop reasoning tasks, it is reasonable to measure traceability through proofs (Yang et al., 2018;Gontier et al., 2020). Therefore, we propose proofbased traceability (the example of proofs is shown in Figure 2) based on the intuitive that if a model can infer the correct answer according to the right reasoning paths, it will correctly validate each proof. Specifically, we construct an traceabilitytest set (Test-T) with 6-hop instances to make the final task an out-of-domain evaluation while ensuring the judgments of proofs are in-domain. Since "Neutral" samples do not provide any proofs, we remove them. To perform the diagnosis, we first train the model on the training set and test it on Test-T.  Next, we extract the instances that are correctly predicted to form the target set. We then revise all proofs of the target set to positive expressions to avoid the "negation" logic's impact on the evaluation and re-annotate them. Finally, inspired by the exact match metric (Yang et al., 2018), we define a proof-based extract match (P-EM) to calculate the percentage of instances whose proofs are completely correctly predicted. We adopt P-EM and proof accuracy (P-Acc) to measure the traceability.

Degraded LogicNLI
"Paradox" provides a virtual scenario that is not common in texts, so most classic NLI tasks do not have this condition. To further understand why we introduce "Paradox" to LogicNLI, we construct a degraded dataset, named d-LogicNLI, as a comparison. Compared with LogicNLI, d-LogicNLI only contains premises and hypotheses with logical relations of "Entailment", "Contradiction", and "Neutral". From the perspective of dataset construction, we only need to set a filter in the logic generation stage to filter out paradox propositions. The statistics of d-LogicNLI are listed in Table 2.

Experimental Settings
We conduct experiments on three state-of-the-art language models (LMs), BERT (Devlin et al., 2019), RoBERTa , and XL-Net (Yang et al., 2019), to systematically measure their FOL reasoning ability. For a fair compari-   Table 3. We set random selection and human performance as the lower and upper boundaries of accuracy. As for human performance evaluation, we employ four Ph.D. students and five post-graduate students of different majors, reporting the average scores on 500 randomly selected instances from the development and test sets. We consider a question as being correctly answered if one of the students gives the correct answer.

Results
Total Accuracy. From Table 4, all three LMs perform better than random guess (25.0%) but worse than humans (77.5%). RoBERTa performs the best Robustness to Irrelevant Information. Table 4 shows the average results on all Dev-R(s) and Test-R(s). Similar to accuracy, RoBERTa's performance is slightly better than XLNet, but the gap between the two is not significant. BERT still performs the worst on both Dev-R and Test-R. Average accuracy on Test-R(s) cannot effectively reflect the robustness directly. We plot the line graph that describes the trend of the result on Test-R(s) with the change of the number of sentences (facts+rules) in Figure 3. All three LMs show downward trends as the number of irrelevant sentences increases. The performances of BERT and RoBERTa decrease evenly with the noise increasing, while the performance of XLNet is fluctuating in the former period but declines rapidly in the latter. Furthermore, we calculate the degradation rate δ R from the 10-sentence Test-R to 24-sentence Test-R to measure robustness. Since the descent process is non-linear, we replace original polylines with their fitting lines (dotted lines in Figure 3)   and 21.5%, which shows that XLNet's robustness is slightly better than BERT and RoBERTa.
More-hop Generalization. We plot accuracies on Test-A and each Test-G in Figure 4 and show the total accuracy on Test-G in Table 4. From Figure 4, all three LMs' performances have dramatically dropped when transferring from in-domain scenarios to out-of-domain scenarios. However, their out-of-domain accuracies can almost keep stable as the number of hops continues to increase (up to 10). To further compare the generalization, we define an indicator, δ A→G = M 1 −M 2 M 1 × 100%, to reflect the percentage of performance degradation when transferring from in-domain scenarios to out-of-domain scenarios, where M 1 is the indomain result on Test-A and M 2 is the average out-of-domain result on Test-G. The performance degradation rates of BERT, RoBERTa, and XLNet are 43.5%, 26.9%, and 34.3%, respectively. Therefore, RoBERTa shows the best generalization when transferring to more-hop reasoning, while BERT cannot effectively understand logical rules and apply them to out-of-domain instances.
Proof-based Traceability. Considering P-Acc on Test-T (Table 4), it seems that 87.6% of proofs can be validated when adopting RoBERTa to make the prediction. Even BERT can explain more than 60% proofs. However, we usually judge whether an instance is understood logically by verifying the completeness of the whole logical chain instead of the ratio of understandable proofs. Therefore, P-EM is more suitable than P-Acc to measure traceability. Considering EM, RoBERTa can validate 53.1% correctly predicted instances, while BERT and XLNet can only validate 9.3% and  Table 4: Diagnostic results of LMs on LogicNLI. All information provide the percentage (%) of each evaluation except for #Target and #Proof.
28.6% instances, respectively. This result means that RoBERTa is the only LM that can perform FOL reasoning to some extent, which has significantly better proof-based traceability than BERT and XL-Net. However, even the best model, RoBERTa, can only explain approximately half of the predictions, indicating that the overall predictions made by LMs do not conform to human logic.

Overall Diagnosis
Considering four evaluations comprehensively, RoBERTa has the best FOL reasoning ability in complex scenarios and is the only one of three LMs that can provide a certain degree of traceability. Considering accuracy (in-domain evaluation) and generalization (out-of-domain evaluation), RoBERTa performs significantly better than BERT and XLNet. Especially when transferring from the in-domain scenarios to the outof-domain, RoBERTa's degradation ratio is significantly lower than BERT's and XLNet's, which means that RoBERTa is better at understanding logical rules and applying them than the other two LMs. This conclusion can also be proven by the traceability test. In reality, although BERT and XLNet can make correct predictions to some extent, most of these results cannot be traced back by the validation of proofs. A certain percentage of prediction results of RoBERTa can be explained. Finally, as for robustness, RoBERTa is indeed more susceptible to irrelevant information than XLNet. Even though, RoBERTa still performs better than XLNet on robustness test, as XLNet's performance drops rapidly after reaching a threshold. In general, even for RoBERTa, there is still a long way to the real FOL reasoning. On the one hand, its performance needs to be improved in both in-domain and out-of-domain scenarios. On the other hand, even if RoBERTa makes the correct prediction, nearly half of its prediction results are   Table 6: Performance on each FOLs (%). Except for ¬, other FOLs cannot imply "Paradox", so we remove "Paradox" and the random accuracy is 33.3%. still unexplainable. The gap between LMs and humans motivates us to explore more effective ways to make more effective reasoning in NLU. Maybe neural symbolic models are solutions to FOL reasoning (Kalouli et al., 2020).

Analysis of Each FOLs
To further understand the FOL reasoning ability, we perform the analysis on how LMs understand each FOL. Specifically, we are required to disentangle the target FOL from other FOLs by adding logical filters when selecting filters. Among seven FOLs, only implication and equivalence can be fully entangled from other FOLs and directly used for reasoning, while others alone cannot constitute complete reasoning. Therefore, we combine the other five FOLs with the implication logic to make the reasoning process effective. Statistics are shown in Table 5. Ta-ble 6. RoBERTa outperforms the other two LMs on all FOLs. Considering each FOL, the performance of LMs is almost difficult to surpass humans, except on the existential logic. In reality, existential logic is difficult for humans (with the lowest human performance) because it requires traversing all information to extract relevant information. However, it is not difficult for LMs as existential logic provides weak constraints that are easy to satisfy. As a result, most LMs perform better than humans on such logic. On the contrary, LMs' performances on universal logic and negation logic are significantly worse than humans'. As for universal logic, its complexity may come from its ambiguity in language. For example, comparing ∀xF (x) → G(a) and ∀x(F (x) → G(x)), although both use universal logic for reasoning, the former requires stronger conditions but can only provide simpler conclusions than the latter. This phenomenon makes universal logic difficult to understand consistently. In terms of negation, many studies (Hossain et al., 2020b,c,a) have proved that negation logic itself is critical but difficult to be understood by neural networks, which results in more auxiliary methods to identify and process in natural language.

Results of FOLs experiments are shown in
In addition, we find that all LMs perform better on single FOL datasets than on Test-A, which is evidence that LMs suffer from the coupling of different FOLs. Therefore, the analysis of FOLs motivates us to modify LMs by 1) focusing on specific logic types (negation and universal logic), and 2) disentangling the different logical forms.

Analysis of "Paradox"
In this section, we provide further analysis on why to introduce the virtual label "Paradox" into Logic-NLI by comparing d-LogicNLI and LogicNLI. As shown in Figure 5, d-LogicNLI is a particular case of LogicNLI under the mutually exclusive condition of "Entailment" and "Contradiction". Therefore, although "Paradox" is usually a virtual label in most scenarios, it is critical to complement the space of the logical relation.
In practice, we can summarize two effects of "Paradox": 1) "Paradox" provides more accurate FOL information for model training, thereby effectively suppressing the impacts of spurious correlations caused by dataset bias; 2) "Paradox" makes the diagnostic scenarios more complete and complex, so it can better distinguish the FOL reasoning abilities of different LMs. We will illustrate   these two statements by comparing important indicators on d-LogicNLI and LogicNLI (shown in Table 7). Firstly, the in-domain results (Accuracy and δ R ) of all three LMs (and human performance) on d-LogicNLI are overall better than those on Log-icNLI, proving that either d-LogicNLI provides much simpler evaluation datasets than LogicNLI does, or d-LogicNLI provides more precise and unbiased training instances than LogicNLI provides. Secondly, we observe that LMs trained on d-LogicNLI are hardly traceable based on Test-T (the maximum P-EM achieved by XLNet is only 4.1%), while LMs trained on LogicNLI have significantly better traceability. This phenomenon support that d-LogicNLI does not provide sufficient information for LMs to master the FOL reasoning ability. Finally, the generalization indicators δ A→G of BERT, RoBERTa, and XLNet trained on d-LogicNLI are 44.0%, 40.0%, and 36.8%, respectively, showing that the transferring ability of LMs trained on d-LogicNLI is not as good as those trained on Log-icNLI. This is implicit evidence to support that LogicNLI provides more information for LMs to understand FOL rules.

Discussion
From Table 4, RoBERTa performs the best on Log-icNLI while XLNet outperforms the other two LMs on d-LogicNLI. According to the original work of these LMs Yang et al., 2019), XLNet modifies the architecture of BERT, while RoBERTa mainly introduces a larger corpus to train the model. In most simple reasoning scenarios, such as RACE (Lai et al., 2017) and SQuAD (Rajpurkar et al., 2016), the performance of XLNet is usually better than RoBERTa's. However, in other scenarios that require more complicated reasoning processes, such as LogiQA (Liu et al., 2020) and datasets defined in GLUE (Wang et al., 2019b) and SuperGLUE (Wang et al., 2019a), RoBERTa, trained on a larger corpus, usually outperforms XLNet. Based on the above analysis, LogicNLI provides more complex reasoning scenarios than d-LogicNLI. Therefore, RoBERTa can highlight its advantages even more on LogicNLI.

Conclusion
In this paper, we propose a diagnostic method to diagnose LMs' FOL reasoning ability. This method introduces a novel proposed benchmark, LogicNLI, that disentangles the FOL reasoning from commonsense inference. Specifically, it includes four evaluations to measure the FOL reasoning ability from different perspectives. Results on three LMs show that although some LMs (RoBERTa) own a certain interpretable FOL reasoning ability, they still cannot make sensible FOL reasoning like humans. Detailed analysis motivates us to enhance specific reasoning abilities or explore new methods to make neural models understand more refined logic.