LogicNMR: Probing the Non-monotonic Reasoning Ability of Pre-trained Language Models

,


Introduction
Non-monotonic reasoning, also called defeasible reasoning, is one of the important reasoning modes in logic, which has been extensively studied in classical AI.The term non-monotonic reasoning was first introduced by Minsky (1975).Generally, non-monotonic reasoning refers to the fact that conclusions may be invalidated with new information (Lukaszewicz, 1990).The research on non-monotonic reasoning in traditional AI mainly focuses on formalizing nonmonotonic reasoning via different logics, such as default logic (Reiter, 1980), circumscription (McCarthy, 1980), and Autoepistemic Logic (Moore, 1983).Non-monotonic reasoning is widespread in everyday life and plays a crucial role in both daily decision-making (Benferhat et al., 2000;Szalas, 2019) and legal reasoning (Lawsky, 2017).
Most of what we learn about the world is in terms of generics, properties that hold "in general", but with exceptional cases.When we say "birds can fly", we mean "birds can usually fly", and there are exceptional cases such as ostrich or wounded birds.Such rules are called default rules.Figure 1 shows a typical example of non-monotonic reasoning.Suppose we desire to find John on Saturday evening and know he usually visits his club every Saturday evening.We thus infer that he will be at his club.However, if we are told John had a car accident yesterday, we would conclude he will likely be in a hospital.Then if we get to know John was not injured, we redraw the conclusion he visits his club.This example illustrates the dynamic nature that a context is constantly updated with new information and queried.
Recently, whether pre-trained language models truly have logical reasoning abilities has received extensive attention.Although pretrained language models have made significant progress on many natural language understanding tasks, such as knowledge-based question answering (Lv et al., 2020) and com-

Premise:
Old man crafting something in his workshop.Hypothes: An old man is working.

Update:
The man is wearing pajamas and is chuckling.
Type: strengthener / weakener [Answer: W] monsense reasoning (Bhagavatula et al., 2020), etc, some research has shown that the prediction of pre-trained language models is easily affected by spurious correlations (Kaushik and Lipton, 2018;Jiang and Bansal, 2019), so it is difficult to judge the logical reasoning abilities of the evaluated models.
It is still in the preliminary research stage to probe whether pre-trained language models have non-monotonic reasoning mechanisms.Rudinger et al. (2020) construct a non-monotonic inference dataset δ-NLI through crowdsourcing based on three existing datasets.For δ-NLI, the authors develop a classification and generation task, and demonstrate that the classification task is easily solved by pretrained language models, but the generation task is much more challenging.Table 1 is an example of the classification task.However, δ-NLI entangles non-monotonic reasoning with commonsense reasoning.For instance, to solve the above example, we need the commonsense knowledge "people usually wear pyjamas when they are resting".
To disentangle deductive reasoning from commonsense reasoning, and explore pure deductive reasoning capabilities of pre-trained language models, Clark et al. (2020) introduce a synthetic dataset with explicit rules and facts, and show that fine-tuned language models perform well on this dataset.Table 2 is a simplified example from this dataset.In their work, rules have the semantics of logic programs with negation (Apt et al., 1988).Thus they make the closed-world assumption (CWA) (Reiter, 1977): unless an atomic sentence is known to be true, it can be assumed to be false.In the above example, all the facts and rules needed for reasoning are explicitly given.Since we cannot infer that Arthur is abnormal, we assume he is not abnormal, hence we deduce that he can  (Clark et al., 2020).

Facts:
Arthur is a bird.

Rules:
If someone is a bird and not abnormal then they can fly.
If someone is a bird and wounded then they are abnormal.
Query: Arthur can fly.True/false? [Answer: T] fly.But if we are also given Arthur is wounded, we would withdraw that he can fly.So CWA is only a special case of non-monotonic reasoning.
In this paper, inspired by the research methodology of (Clark et al., 2020), different from the research problem of (Rudinger et al., 2020), we explore the pure non-monotonic reasoning abilities of pre-trained language models, disentangling from commonsense reasoning.We propose LogicNMR, a non-monotonic reasoning benchmark with three distinguished features.First, each context is given by explicit facts and default rules such as "a bird can fly unless he is wounded".So we handle explicit non-monotonic reasoning rather than implicit one by δ-NLI, and we deal with non-monotonic reasoning in a more general way than CWA.Second, each context is repeatedly updated with new facts and queried.This is in line with the phenomenon that human constantly receive new information and redraw conclusions.Third, the labels of the dataset are automatically generated by resorting to a formal non-monotonic reasoning solver and hence guaranteed with correctness.The non-monotonic reasoning ability in pre-trained language models are explored from accuracy, generalization, proof-based traceability and robustness.The experimental results reveal that even though the fine-tuned language models achieve a high accuracy on the in-distribution samples in Log-icNMR, they perform unsatisfactorily, with a significant drop, in generalization and proofbased traceability.

Related Work
Many benchmarks involve logical reasoning, but entangled with commonsense reasoning, On one hand, natural language inference (NLI) is to determine the inference relation between two texts, including entailment, contradiction, or neutral.For example, Bowman et al. (2015) present a significant NLI benchmark SNLI, a collection of 570k English sentence pairs.Richardson et al. (2020) explore the symbolic reasoning and monotonic reasoning abilities in pre-trained language models through semantic fragments.On the other hand, in machine reading comprehension, LogicQA (Liu et al., 2020) and ReClor (Yu et al., 2020) are two popular multiple-choice datasets involving complex logical reasoning, such as deductive reasoning and abductive reasoning.Li et al. (2022) and Xu et al. (2022) construct relationship graphs by extracting the basic units in the context, and then combine pre-trained language models and graph neural networks to solve LogicQA and ReClor.
Following (Rudinger et al., 2020), several other works focus on non-monotonic reasoning.To further explain why the updated information can cause the credibility of the original conclusion to change, Brahman et al. (2021) use distant supervision to generate reasons for δ-NLI.Madaan et al. (2021a) generate influence graphs through transfer learning to effectively improve performance on defeasible reasoning tasks.Madaan et al. (2021b) propose a model that can simulate thinking about question scenarios based on influence graphs to enhance the performance of defeasible reasoning tasks.
The work of Clark et al. (2020) initiated a line of research to explore logical reasoning capabilities in language models.Tian et al. (2021) present the LogicNLI benchmark for first-order logical reasoning and propose proof-based traceability to more effectively evaluate the logical reasoning abilities of language models.Saeed et al. (2021) propose a dataset RuleBERT with rules with probability in order to teach pre-trained language models to reason on soft Horn rules.Dalvi et al. (2021) introduce a dataset EntailmentBank with explanations in the form of entailment trees.

Default Logic and ASP
In this work, we choose Reiter (1980)'s default logic, one of the major formalisms for non-monotonic reasoning, as the logic underlying LogicNMR.A default rule is in the form of α : β 1 , β 2 , . . ., β m /γ, where α, β i and γ are formulas in first-order logic, α is called the prerequisite, β 1 , β 2 , • • • , β m the justifications, and γ the conclusion.The interpretation of the default rule is that if you can infer α, and β 1 , β 2 , • • • , β m are consistent, then infer γ.A default theory is a pair T = ⟨W, D⟩, where W is a set of facts, which are firstorder sentences, and D is a set of default rules.For example, a default theory T 0 consists of W 0 = {f resh(A), af raid(A)} and D 0 = {f resh(x) : ¬af raid(x)/cute(x)}, f resh(x) : Here |= denotes the logic entailment relation for first-order logic.For example, T 0 has a unique extension E 0 = W 0 ∪ {¬worried(A)}.
In this paper, to reduce the complexity of the default theory, we make a set of restrictions.Variables and constants are terms, and P (t 1 , ..., t n ) is an atom when P is an n-ary predicate symbol and t 1 , .., t n are terms.A literal is an atom or the negation of an atom.W is restricted to be a set of literals.α is the conjunction of at most two literals, there are at most two justifications, each justification is a literal, and γ is a literal.In addition, we require that: for each default theory, for any justification of any default rule, its negation does not appear as the conclusion of any other default rule.It is easy to show that under the above restrictions, each theory T has a unique extension, written E(T ).Then for any sentence ϕ, we write T ⊢ ϕ if ϕ ∈ E(T ).We define Ans(T, ϕ) as follows: Chen et al. (2010) show that in the propositional case, each default theory is equivalent to an answer set program.Answer Set Programming (ASP) (Brewka et al., 2011) is an approach to declarative programming with efficient solvers, such as Clyngor1 .Under our restrictions, a default theory T can be converted to an equivalent answer set program δ(T ) as Then the unique extension for T can be computed by computing the answer set for δ(T ) by resolving to an ASP solver such as Clyngor.

LogicNMR Benchmark
To probe pure non-monotonic reasoning abilities in language models, we generate a nonmonotonic reasoning benchmark LogicNMR with explicit facts and default rules and iterative updates and queries.Our dataset is available at https://github.com/sysulic/LogicNMR.
Overview of Dataset Generation. Figure 2 gives an overview of the generation process for a sample in LogicNMR.First, we generate an initial knowledge base (KB) containing default rules and facts.Then, we generate the iterative updates and queries.Next, we generate the label and associated proof for each update and query.Finally, we convert the initial KB, updates, queries and proofs into synthetic English using simple templates.
The predicate pool for LogicNMR includes unary and binary predicates, where the unary ones are 529 adjective words from (Tian et al., 2021), and the binary ones are 46 adjective words that describe relationships between subjects.Each sample is restricted to a subject, which is a name.
Figure 3 shows a LogicNMR sample represented with formulas.Here, T refers to the initial KB where there is a single fact and multiple default rules.U i is the new fact for the i-th update, and Q i is the query for the i-th update.For each query, when the label is T or F, a proof is a sequence of intermediate necessary conclusions during the reasoning process for the query.
In the following, we give detailed descriptions for each step of Figure 2.
Initial KB Generation.To generate the first default rule, we randomly select predicates from the predicate pool as the predicates for the prerequisite, justifications, and conclusion.We then add negations to the prerequisite, justification, and conclusion atoms with a probability of 0.5.Next, for every new default rule, the prerequisite literals are randomly selected from the existing conclusion literals, the justification literals are randomly generated and different from the prerequisite literals, and the conclusion literal is randomly generated so that it is different from the prerequisite literal and its negation does not appear in justifications for existing default rules.Finally, we generate the initial facts by instantiating prerequisite literals or the negation of justification literals of default rules with the unique subject.For example, in Figure 3, ¬aggressive(Amery) is generated by negating the justification literal of the first default rule.
Iterative Updates and Queries Generation.Each LogicNMR sample is updated for five times.The updates are generated from the prerequisite and justification literals of default rules.If generated from prerequisite literals, the update is the instantiation of the prerequisite literal; if generated from justification literals, the update is the instantiation of the negation of the justification literal.The queries are generated from the instantiation of conclusion literals of default rules and being negated with a probability of 0.5.Task Definition.As shown in Figure 3, a sample in the LogicNMR dataset is a triple (T, U, S), where T is the initial KB, U = ⟨U 1 , ..., U 5 ⟩ is the sequence of updates, and Q = ⟨Q 1 , ..., Q 5 ⟩ is the sequence of queries.For i = 1, . . ., 5, we let The task is to decide the answer A = ⟨A 1 , ..., A 5 ⟩ where Labeling and Proofs.
To compute Ans(T i , Q i ), we convert T i into an answer set program δ(T i ), call Clyngor to compute the unique answer set, and then check if Q i or ¬Q i is in the answer set.If Q i or ¬Q i is in the answer set, we produce a sequence of proofs for it, where each proof is the conclusion of the default rules applied in the reasoning chain.For example, for Q 1 in Figure 3, with U 1 , using the third default rule, we get ¬worried(Amery); then using the fourth default rule, we get ¬lucky(Amery); next, using the fifth default rule, we get grumpy(Amery); finally, we obtain Q 1 .
Conversion into English.We convert the initial KB, updates, queries, and proofs into English by using simple templates.For example, the default rule "dynamic(X) : petite(X)/cute(X)" is translated into "If someone is dynamic then he is cute, unless he is petite".Dataset Statistics.Table 3 shows the statistical information of LogicNMR.The training, validation, and test sets in LogicNMR contain 5k, 2k, and 2k samples, respectively.The number of default rules in the initial KB is at most 12, and the number of initial facts is at most 2. Avg.Length and Max.Length represent the average and maximum number of words in the initial KB, respectively.To explore robustness of language models on LogicNMR, for each sample, there are six irrelevant facts and six irrelevant default rules.To reduce the bias caused by label imbalance, we require the labels in the dataset to be balanced.Also, the predicate pools for the training, validation, and test sets are different to avoid answering queries according to the correlations among predicates.

Experiments
In this section, by following (Tian et al., 2021), we explore the non-monotonic reasoning ability of pre-trained language models in terms of accuracy, generalization, proof-based traceability, and robustness, respectively, based on LogicNMR.

Experimental Settings
In this paper, we would like to investigate three mainstream pre-trained language models: BERT-large (Devlin et al., 2019) , RoBERTalarge (Liu et al., 2019), and GPT2 (Radford et al., 2019).The hyperparameters in such language models are shown in Table 4.

Accuracy
Table 5 shows the accuracy results of RoBERTa, BERT, and GPT2 models on the LogicNMR dataset.It is not difficult to find that the language models achieve a high accuracy on answering queries in the LogicNMR dataset after fine-tuning.Generally, RoBERTa has the top performance.With the number of updates increasing, all language models yield only a slight drop.The high accuracy on answering non-monotonic reasoning queries looks like that the language models have already mastered the non-monotonic reasoning ability after finetuning.However, it is also possible that they only perform well because of their strong fitting ability on the in-distribution samples.To further probe whether the language models master the ability of non-monotonic reasoning, we still need to evaluate how they performs in terms of generalization and proof-based traceability on LogicNMR.

Generalization
In this paper, the generalization is used to measure whether the model truly understands nonmonotonic reasoning, i.e., assesses how a model performs on the out-of-distribution samples.In CLUTRR (Sinha et al., 2019), the metric of generalization is measured by training a model on samples with inference depths less than or equal to K and then testing it on samples with an inference depth greater than K.However, different from monotonic reasoning in first-order logic, non-monotonic reasoning pays attention to the dynamicness of the knowledge bases.In other words, to evaluate whether a model masters the non-monotonic reasoning ability, we need to test its performance varying different updates.More formally, if a model has learned non-monotonic reasoning on a knowledge base with K updates, it should also be effective on knowledge bases with different numbers of updates.In this way, not only we make sure that samples in the tthe raining set are balanced in size, but also we can independently see how well models generalize in terms of different updates.Therefore, we define the generalization metric ofa model over the update number, noted Avg * , as its average accuracy on the samples with updates number U ̸ = K with being trained on the samples with the update number U = K.Table 6 shows the generalization results of the language models for the number of updates on LogicNMR.First, the generalization performance of the RoBERTa is the best among the three models.Specifically, throughout the whole LogicNMR, the average generalization metric Avg * of RoBERTa, BERT, and GPT2 are 75.3%,61.5%, and 62.9%, respectively.However, for each model at each update, its Avg * is lower than its average accuracy shown in Table 5.Second, the more significant difference between the number of updates of the testing set and the number of updates of the training set, the worse the generalization performance of the language models.For example, RoBERTa trained on the samples with update number U = 3 has average accuracies of 93.7% and 75.4% on the samples with U = 4 and U = 5, respectively.It reflects that the difference on the distributions of the samples is a significant challenge for language models, causing a unsatisfying performance on generalization.

Proof-based Traceability
The notion of proof-based traceability to evaluate whether a model can infer the correct answer according to the right reasoning path, which yields two metrics: a proof-based accu-racy (P-AC) and a proof-based exact match (P-EM) (Yang et al., 2018;Tian et al., 2021).As there are some samples whose query or its negation has no proof, i.e., those queries labelled as and "M", we remove them from the testing samples.Also, since the samples using only one default rule in the inference process have no intermediate proof, such samples will be ignored.P-AC is the ratio of the samples that the model correctly answers the query with proofs on the testing samples, and P-EM is the ratio of the samples hat the model correctly predicts all proofs on the testing samples.
Table 7 shows the results of the language models on the in-domain and out-of-domain datasets, respectively.For the in-distribution samples, the average P-AC of RoBERTa, BERT, and GPT2 are 98.3%, 96.1%, 96.3%, and the average P-EM is 97.3%, 93.5%, 92.3%, respectively.Unsurprisingly, the three language models all achieve an excellent performance in terms of the proof-based traceability on indistribution samples, as they perform well in terms of the accuracy.On the other hand, for out-of-distribution samples, the average P-AC * of RoBERTa, BERT, and GPT2 models are 77.7%,66.8%, and 67.6%, and the average P-EM * are 70.9%,55.2%, and 57.8%, respectively.It shows that the language models perform worse on out-of-distribution samples.The gap of the performances between in-distribution and out-of-distribution samples indicates that the three language models cannot generalize their reasoning ability to out-of-distribution samples, further suggesting that it is suspicious if we say the language models have already mastered the non-monotonic reasoning ability.

Robustness
We also evaluate the robustness of the language models to irrelavant sentences.Only one irrelevant fact and one irrelevant default rule are added to the knowledge base each time.Figure 4 shows the robustness analysis to irrelevant sentences on the samples with U = 5.When the number of irrelevant facts and default rules in the knowledge base increases, the performance of the language models decreases rapidly.The reason for the poor robustness of the models should be that the pattern of the generated samples is simple, and the language models only make predictions by association match.It further suggests that the language models by no means totally master the non-monotonic reasoning ability after finely tuning on a large number of samples about non-monotonic reasoning.

Case Study
Figure 5 shows an analysis of RoBERTa on some sample.RoBERTa is trained on the dataset with U = 2.The bold black sentences in fact represent updated facts currently added to the knowledge base.The solid underlined ones in proofs are the sentences that were predicted correctly by the model, and the dotted underlined ones are the sentences in proofs that were predicted wrong by the model.
In this example, after adding the new fact to the knowledge base for the third time, RoBERTa still predicts all proofs correctly.However, after the fourth update, although the model answers the query correctly, the proof P2 is predicted incorrectly, indicating that the model does not exactly recover the proofs of the query.After the fifth update, the query and its proofs are predicted incorrectly.The above case shows that as the number of updates to the knowledge base increases, the performance of the language model is getting worse.Even the query is answered correctly by the language model, in fact it is not obtained via a correct reasoning procedure by the language model.

Conclusions
In this paper, we construct a synthetic nonmonotonic reasoning benchmark, LogicNMR, with explicit facts and rules, to capture the iterative update on the knowledge base.We probe whether the pre-trained language models have truly mastered the non-monotonic reasoning ability.The experimental results show that even though the fine-tuned language models all achieve a high accuracy, they perform worse on generalization, proof-based traceability and robustness to irrelevant information.Consequently, we cannot give a positive answer to the research problem whether the language models master the non-monotonic reasoning ability.It suggests us to explore a better approach to take advantage of the language models to conduct non-monotonic reasoning tasks.

Limitations
Although we construct a dataset to probe the non-monotonic reasoning ability of language models and conduct some experiments, we have to admit that there are still some limitations.First, only three language models are used in this paper.More language models with different architectures should be evaluated.Second, the synthetic rules of LogicNMR are too strong.We will relax some restrictions of generating rules, such as query extraction way.Third, we limit the default theory to only one extension to reduce reasoning complexity, resulting in simpler non-monotonic inference patterns.A future work is to probe non-monotonic reasoning ability in a more general and systematic way, such as by allowing plural extensions.

Figure 1 :
Figure 1: An example of non-monotonic reasoning in everyday life.

Figure 3 :
Figure 3: An example in the LogicNMR dataset is represented by the default logic.

Figure 5 :
Figure 5: An example about RoBERTa from the LogicNMR benchmark.In "Query" column, ✓ represents RoBERTa answers the query correctly.

Table 1 :
An example from the δ-NLI dataset.

Table 2 :
A simplified example from

Table 5 :
The accuracy results of three language models on LogicNMR.
U represents the number of updates to the KB.

Table 6 :
The generalization results of the three language models on LogicNMR.Avg * is the average accuracy on U ̸ = K.