MT R : A Dataset Fusing Inductive, Deductive, and Defeasible Reasoning

A long-standing difficulty in AI is the introduction of human-like reasoning in machine reading comprehension. Since algorithmic models can already perform as well as humans on simple quality assurance tasks thanks to the development of deep learning techniques, more difficult reasoning datasets have been presented. However, these datasets mainly focus on a single type of reasoning. There are still significant gaps in the studies when compared to the complex reasoning used in daily life because we can mix and match different types of reasoning un-consciously. In this work, we introduce a brand-new dataset, named MT R . There are two sub-sets of it: (1)the first is mainly used to explore mixed reasoning abilities and combines deductive and inductive reasoning; (2)the second integrates inductive and defeasible reasoning for detecting non-monotonic reasoning ability. It consists of more than 30k instances, requiring models to infer relations between characters in short stories. Compared with the corresponding single reasoning datasets, MT R serves as a more challenging one, highlighting the gap in language models’ ability to handle sophisticated inference.


Introduction
Natural language understanding (NLU) has long pursued the goal of working like a human that can perceive information and conducts logical reasoning over knowledge (Du et al., 2022).Deep neural networks (DNNs) have achieved great success recently (Devlin et al., 2019) and have excelled at information perception tasks such as text classification and sentiment analysis (Lee-Thorp et al., 2022;Yang et al., 2019;Schick and Schütze, 2021).However, logical reasoning, which needs to confront a novel environment or complex task, exposes the ‡ Corresponding author.weakness of DNNs' tendency to make decisions by non-generalizable shortcuts (Du et al., 2022).Although an array of existing datasets are available for exploring different reasoning capabilities of neural networks, such as CLUTRR (Sinha et al., 2019), RuleTakers (Clark et al., 2020) ,and WIQA (Tandon et al., 2019), most of them primarily highlight monotonic logic (Choi, 2022) and a single form of reasoning.For instance, RuleTaker (Clark et al., 2020) and LogicNLI (Tian et al., 2021) only in-clude deductive reasoning, while CLUTRR (Sinha et al., 2019) is exclusively related to inductive reasoning.As a non-monotonic reasoning dataset, δ-NLI (Rudinger et al., 2020) only contains instances with the simplified form of reasoning.These settings bring two problems: (1) The establishment of these datasets does not conform to our daily reasoning habits.(2) They also impact the effectiveness of the evaluation of models.To solve the problems, we explore combining different reasoning forms.

Query
We first refer to the theory of the human reasoning process, which shows that induction and deduction are two major monotonic forms to make logical reasoning.Based on the theory, we integrate induction and deduction to generate cases.Besides, most of our day-to-day reasoning is always accompanied by non-monotonic reasoning (Choi, 2022).Psychological theoretical research also shows that human reasoning lacks clarity and does not distinguish between various forms of reasoning in a straightforward manner (Johnson-Laird, 2010b).Different forms of reasoning are interrelated and support each other (Liu et al., 2020;Li et al., 2022).
We also explore introducing non-monotonic logic into the dataset.Especially, inspired by CLUTRR (Sinha et al., 2019) for inductive reasoning diagnosis on relationships, we introduce a new dataset for multiple reasoning combinations, MT R 1 (Multi-Type Reasoning Dataset).MT R is the semi-automatic extension to CLUTRR and includes two parts, D-MT R and F-MT R. Part I (D-MT R) includes various deductive rules and combines deductive reasoning with inductive reasoning.As a result, D-MT R involves negation logic in relationship understanding, which prevents the reasoning process from using shortcuts.In practice, we have introduced an additional relation, "unknown", to represent the situation where the relationship cannot be inferred through the given text reasoning.Examples are provided in Figure 1(a).Part II (F-MT R) makes the process of inductive reasoning defeasible.We first construct new inductive reasoning stories.These stories have the characteristic that when only given an inductive reasoning story, two compatible relationships between family members can be inferred.As the example in Figure 1(b), both "father" and "father-in-law" are reasonable without supplementary facts.However, with the additional new fact, this inference tends towards one 1 The dataset will soon be available.

Dataset
Deductive Inductive Defeasible of the answers.For example, if we subsequently learn that "someone cooks for him since he was a child", the choice of "father-in-law" is greatly abandoned.
We also experiment on MT R, with several state-of-the-art neural models developed for NLU.Results show that models' performance on MT R is significantly reduced compared with the one on CLUTRR.This phenomenon is evident that stateof-the-art neural models still lack logical reasoning capabilities in logic-entangling scenarios.Further analysis on D − MT R shows that similar inference rules can significantly interfere with models' hybrid inference and models trained on noninferential order data have better anti-interference ability.During non-monotonic reasoning tests on F − MT R, neural models cannot benefit from the supplementary facts before answering a defeasible inference query.
2 Background and Related Work

Reasoning Datasets
Many datasets have been proposed to test the reasoning ability of NLU systems.RuleTaker (Clark et al., 2020) is a dataset known as deductive reasoning.Many neural methods have been developed for this dataset and achieved results when only dealing with single deductive reasoning.ProofWriter (Tafjord et al., 2021) and Log-icNLI (Tian et al., 2021) also focus on deductive reasoning but enrich in logical forms.The dataset LogiQA (Liu et al., 2020) also includes multiple types of deductive reasoning.In contrast, many datasets like HotpotQA (Yang et al., 2018), QuaRTz (Tafjord et al., 2019), CLUTRR (Sinha et al., 2019), etc., deal with inductive reasoning over textual inputs.δ-NLI (Rudinger et al., 2020) is a dataset for defeasible inference in natural lan-guage.However, few datasets include the combination of various reasoning types, which make them difficult to evaluate the logical reasoning ability of models comprehensively.

Reasoning Definition
Traditional logic have two branches: deduction and induction (Gillies, 1994).These classical logic forms can ensure the certainty of reasoning both syntactically and semantically.But in real-world situations, a clash of knowledge frequently appears (Allaway et al., 2022), introducing uncertainty into the daily reasoning process.Nonmonotonic reasoning is therefore recommended as a crucial artificial intelligence thinking technique (Strasser and Antonelli, 2019;Ginsberg, 1987).Defeasible reasoning is a type of nonmonotonic logic, where logical conclusions are not monotonically true.
Deductive Reasoning is described as applying broad concepts to specific situations (Johnson-Laird, 2010a;Sanyal et al., 2022).Deductive reasoning relies on making logical premises and basing a conclusion around those premises.Starting with a rule, the deduction task is then applied to a real-world scenario.For instance, we can conclude that "Socrates is mortal" based on the tenets "All men are mortal" and "Socrates is a man" (Johnson-Laird, 1999;Heit and Rotello, 2010).
Inductive Reasoning is described as drawing conclusions by going from the specific to the general.It includes the process of making predictions about novel cases based on past experience or observations (Hayes et al., 2010;Lavrac and Dzeroski, 1994).The form of induction task begins with facts about individual cases and then generalizes to a general rule, such as deducing from the facts that "Swallows can fly" and "Orioles can fly" the consequence that "All birds can fly" (Heit, 2000;Hayes and Heit, 2018).So, if given "Tweety is a bird", we can entail that "Tweety can fly".
Defeasible Reasoning is the mode of reasoning where conclusions are modified with additional information (Pollock, 1987).It has been studied by both philosophers and computer scientists (Koons, 2005).The conclusion is not logically sound and could be refuted by fresh information, such as the clarification that "Tweety is a penguin" provided in the case above (Lascarides and Asher, 1991).
Table 2: Statistics of D-MT R and F-MT R.

Overview
To evaluate the models' ability under complex mixed-reasoning scenarios, we create a new dataset MT R, including multi-type reasoning that requires kinship inferring.MT R is the extension to the existing natural language inductive dataset, CLUTRR (Sinha et al., 2019), and includes the mixed reasoning types of induction, deduction, and defeasibility.As a result, MT R is a dataset with the expansion of relation types and complexity.Specifically, we add two kinds of logic (deduction and defeasibility) to construct two sub-datasets, D-MT R and F-MT R. The detailed statistics are summarized in Table 2.

D-MT R
D-MT R is the subset that inferring is accomplished by the combination of induction and deduction.Taking Type ① in Figure 2  We adopt a semi-automatic method to generate D-MT R with three steps: 1) logic generation, 2) logic correction and 3) natural language generation.As for the logic generation, we adopt an automatic method to generate each logic expression to ensure the validity of deductive reasoning.Specially, we define a set of logical templates T in advance.It concentrates on diverse first-order logical forms (including conjunction ∧, disjunction ∨, negation ¬, and implication →).Then we conduct the knowledge base (KB) that contains all single rules, such as [grandfather, X, Y] ⊢ [[father, X, Z], [father, Z, Y]].In this type of knowledge graph, we present the vector computation method and the accompa- A: Grandfather Story We incorporated different methods formulate the deductive rules to stop the model from processing reasoning through erroneous statistical correlations.On the one hand, we introduce two negations in deductive reasoning.By introducing negation, the model is prevented from drawing relational conclusions from erroneous correlations.(1) Negation words are used to introduce the first type of negation, which is logical negation, such as the fact "Colin's grandfather is not Dwight.".(2) We formulate relation contradiction as contradictory statements without negation cues.Relation contradiction events are not identifiable as negations on their own, but demonstrate reversed semantic or pragmatic meaning when paired with their affirmative counterparts (e.g., the fact that "Colin's grandfather is Dwight."vs. "Colin's father is Dwight.").These negated or contradictory statements shift the relation implications of the original premise in nontrivial ways.
On the other hand, we don't just make up our deductive principles at random.The five rules of deductive logic are all artificially designed.For instance, in Type⑤, we make the reasoning more challenging by allowing both positive and negative derivations existing simultaneously.The model can only carry out further reasoning after determining whether or not the conditions are true.To avoid spurious reasoning brought by a single rule, we add homologous interference rules to each form of reasoning.In practice, we have also introduced an additional relation, "unknown", to represent the situation where the relationship between the two cannot be inferred from given facts or rules.This innovative relation keeps the option of expanding to more unknown relationships while also increasing uncertainty during model inference.

F-MT R
F-MT R is designed to evaluate the nonmonotonic reasoning ability, specifically combining defeasible and inductive inference in natural language.As shown in Figure 1 help take care of him for a week.
tells him not to be late for school every morning.
plan to go on vacation with his parents next week.

… …
(b) Some supplementary facts are used to update the defeasible reasoning Figure 3: Examples of the defeasible pair and supplementary facts collected for F-MT R.
not unique.Both "father" and "father-in-law" are reasonable.The likelihood of a particular choice changes when the supplementary fact is supplied, either strengthening or weakening it.In a word, neural models can benefit from the supplementary fact before answering a defeasible inference query.
To generate F-MT R, we first construct defeasible pairs D. The defeasible pairs must satisfy the following conditions, all of which can be derived from the same relationship path.Figure 3 ).Then, for each defeasible pair, we build supplemental facts U that is used for update reasoning.We employ three post-graduate students to collect supplemental facts.There are at least four supplementary facts provided for each option to be chosen at random.Specifically, given the premise of a text, the model's conclusion derivation is not unique.Given a supplementary fact u ∈ U, the model may determine whether a specific fact is less likely to be true or more likely to be true.Given the inference instance in Figure 3(a), the model is unable to distinguish between the defensible pair.When we are given additional information, such as "tells a story every night before bed."(see in Figure 3(b)), the model would infer that "son" is most likely true.
We conduct hop extension on inductive inferences to produce the final dataset with defeasible reasoning.The new relational paths are selected from the defeasible pairs D. It can guarantee that the outcomes of the new reasoning is ambiguous.Then, we select one supplementary fact u ∈ U at random for each of the different defeasible pairs to make one of the inference directions stronger or weaker.

Baselines
We conduct experiments on several natural language understanding systems to systematically measure their reasoning ability.Bidirectional LSTMs (Hochreiter and Schmidhuber, 1997;Graves, 2012;Cho et al., 2014) (with and without attention) are always used to reason on unstructured text.Relation Networks (RN) (Santoro et al., 2017) and Compositional Memory Attention Network (MAC) (Hudson and Manning, 2018) are recently proposed methods, which outperform other systems when dealing with relational reasoning.Pre-trained models also give the current state-ofthe-art results on machine reading.In particular, we measure the reasoning ability of BERT (Devlin et al., 2018), as well as a trainable LSTM encoder on top of the pre-trained BERT embeddings.In our task, both BERT and BERT-LSTM (a one-layer LSTM encoder is added on top of pre-trained BERT embeddings) are 12-layered frozen and encode the sentences into 768-dimensional vectors.

Experimental Setup
The final dataset D-MT R contains 30K questions split into [25K|5k] questions in the [train|test] folds.During the experiments, we performed sequential and random operations on the data set, specifically referring to arranging the sentences in the input text according to logical reasoning order and illogical order(using "Sequential" and "Random" for abbreviation).We also compare how these models perform on the single inductive reasoning dataset CLUTRR and test in the same way.The final dataset F-MT R contains 2K questions.We assess the accuracy of the F-MT R with and without supplementary facts.We consider a model to be correct if it predicts one of the answers without  providing any additional information.We adopt a similar setting as Sinha et al. (2019) during training.Specially, all models were trained for 40 epochs with Adam optimizer with a learning rate of 1e-3.We train our models with a batch size of 8.All experiments were run 5 times with random data classification.

Main Results
Total Accuracy.it causes great difficulties to models.In the followup, we will undertake a more detailed analysis.Table 4 illustrates the performance of different models on F-MT R. We consider a model to be correct if it predicts one of the answers on F-MT R with no supplementary facts associated to the input texts.As a result, we can consider it a single inductive reasoning problem.Results show that models trained on D-MT R(both "Sequential" and "Random") do not have the ability to transfer to F-MT R(o).This demonstrates that simply adding deductive reasoning to the data set does not increase the model's inductive reasoning abilities, but rather interferes with them.
When we compare the accuracy with and with-out the supplemental facts, we find that models can be loosely classified into two groups.The first category includes models lacking BERT, such as BiLSTM-Attention, BiLSTM-Mean, RN, and MAC, whose performance on F-MT R with supplemental facts is roughly half that without.These models do not deal with defeasible reasoning.Supplementary facts used to aid defeasible reasoning actually hinder rather than help model reasoning.However, supplementary facts can help with the rest's models.Although there is still a performance discrepancy when compared to D-MT R, it can be seen that supplementary facts enhance the defeasible inference.This phenomenon supports the notion that pre-trained language models contain a plethora of information (Petroni et al., 2019).This knowledge assists the models in distinguishing differences between defeasible pairs.

Further Analysis
In this section, we provide further analysis of our designed dataset.On the one hand, we present additional insight as to why the virtual label "unknown" is introduced into D-MT R. On the other hand, more experiments are conducted to investigate the unexpected deductive rule (Type①).Analysis of "unknown".As shown in Figure 1(a), the "unknown" label indicates that the relationship between the two cannot be deduced from the known material.Situations that are unknown or cannot be reasoned about are common in everyday life.Therefore, it is critical to complement the space of the relation.We can summarize two effects of the "unknown": 1) "unknown" provides more accurate relation information for model training, thereby effectively suppressing the impacts of spurious correlations caused by dataset bias; 2) "unknown" makes the diagnostic scenarios more complete and complex, so it can better distinguish the relation reasoning abilities of different models.
As shown in Figure 5, we examine the models' accuracy on the new label "unknown" and compared it to the overall dataset average.On "Sequential" data (Figure 5(a)), most models are unable to reason about this new label and perform much worse than the average.Judging the conclusion of the "unknown" requires the models to exclude all inferable relations."RN" and "BERT" do poorly on the full dataset, but deduce most of the answers as "unknown"(BERT: from 0.13 on average up to  0.42 on "unknown").This demonstrates that the two models have not fully developed their reasoning abilities.On "Random" data(Figure 5(b)), the performance of the models on the new labels is significantly improved (BiLSTM-Attention from 0.07 up to 0.33, MAC from 0.09 up to 0.26) and remains close to the average.Comparing "Sequential" data with "Random" data, the former calls for a more robust level of model reasoning.Tests demonstrate that models trained on more complicated datasets perform better on new labels.This is implicit evidence to support that the "unknown" demands more precise reasoning abilities from the model.Analysis of Type①.To further understand the results on Type① in Table 3, we perform the analysis of why models produce surprising results.The form of Type① is defined as "(¬)R 1 (X 1 , X 2 ) → R 2 (X 2 , X 3 )".It only includes negation(¬) and Figure 5: Further results on Type ①.In both "Sequential" and "Random" D-MT R, we test on two different Type ① test datasets."Noise" represents the result on Type①, where we retain deductive inference rules that have the same form but are unrelated to reasoning."Clean" represents the improvement of accuracy after removing the interference rules.
implication(→) but is almost the hardest type to handle in "Sequential".We discover that models tend to mistakenly select the deductive principles that are seemingly similar.When there are just two classes of first-order logic, noise is considerably more deceiving.By contrasting rules that account for noise versus those that do not, we shall illustrate the propositions(shown in Figure 5).
After using the data without noise rules, the accuracy is improved, as indicated by the blue "Clean" portion.On Type① dataset free of noise and models without retraining, all models exhibit noticeable performance gains.In partial, models with more inference capability also perform better after eliminating noise(such as BERT-LSTM and BiLSTM-Attention improve accuracy by 0.31 after removing interference rules in "Sequential" D-MT R).Comparing the results in Figure 5(a) and Figure 5(b), we discovered that the performance improvement of the models trained on "Random" data is significantly less than that trained on "sequential" data.

Case Study
To further understand the defeasible inference process of the model, we perform a case study on F-MT R. A comparison between the prediction made by BiLSTM-Attention with and without additional facts is shown in the situation in Figure 6.Two possible relations can be deduced from the existing text in the absence of the supplementary fact.BiLSTM-Attention can successfully predict one of the correct relations (son).
When we provide the new fact "[Josephine] said to visit [Norman]'s father next time.",the model provides the wrong prediction "father".This means that the model cannot capture the fact that [Josephine] is not the "father" of [Norman], which is even affected.However, it is easy for us to rule out the "son" relationship between [Josephine] and [Norman] from the provided fact.We examine the model's inference process and discover that the model focuses on the erroneous relationship "father" in the fact.This shows that the model does not capture the information of the entire sentence, but focuses on the part.It also demonstrates how far behind humans in terms of complicated reasoning state-of-the-art neural models perform.

Conclusion
In this paper, we propose MT R, a large-scale logical reasoning dataset including deductive, in-10085 ductive, and defeasible reasoning.It is a more complex relational inference dataset with a mixture of various inferences.In addition to testing the reasoning capacities of state-of-the-art neural models, our dataset helps to re-examine some deficiencies in the research of logical artificial intelligence in the era of deep learning NLP.The results demonstrate that even the most advanced machine readers lag well below human ability.

Limitations
There are two limitations: (1) Although MT R include three types of reasoning types (deductive, inductive ,and defeasible reasoning), we only focus on relation reasoning task.For other tasks, it is also necessary to construct more datasets with the fusion of multiple reasoning types.(2) Our primary focus remains monotonic reasoning, however, the combined reach of deduction and induction is only the tip of the iceberg of human reasoning (Choi, 2022).This also inspires us to focus on more nonmonotonic reasoning and more logical combinations.
C2. Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Section 5.
C3. Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Section 4 C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?All packages are common used in NLP, such as Transformers and Pytorch.D Did you use human annotators (e.g., crowdworkers) or research with human participants?Section 3 D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? Section 3 D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Section 3 D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.
took her daughter [Gladys] to the store to find her some new boots for the cold winter weather.[Thomas] loved his mother, [Constance].If [Gladys]'s uncle is not [Thomas] and [Patty]'s niece is [Serena] then [Thomas]'s son is [Patty] and [Mona]'s brother is not [Thomas].[Gladys] brother [Thomas] [Patty] is always around her niece [Serena].
is happy that his brother, [Timothy], is becoming successful.[Timothy]'s mother [Mona] secretly put money away for him to go on a trip next fall.[Colin]'s grandfather is [Dwight].brought a lot of presents when she first visited [Dwight].(b) An example of F-MT R.

Figure 1 :
Figure 1: Examples of different logical combinations in MT R. Named entities are represented by words in bold and in parenthesis, whereas relationships are represented by words in orange.
enjoyed a homemade dinner with her son [Christopher].If [Sharon]'s mother is [Debra] then [Debra]'s Sister is [Lois].If [Sharon]'s grandmother is [Debra] then [Debra]'s mother is [Lois].[Christopher] took his Aunt [Diana] out for her favorite meal.[Debra] had a daughter named [Diana].Story: [James] and his brother [Shawn] are constantly trying to one up each other.[Shawn] bought a present for his mother [Kathryn].If [James]'s mother is [Kathryn] and [Maryann]'s father-in-law is [Gwendolyn] then [Kathryn]'s father is [Maryann].If [James]'s aunt is [Kathryn] and [Maryann]'s father-in-law is [Gwendolyn] then [Kathryn]'s son is [Maryann].[Maryann]'s mother [Kathryn] wanted to surprise him for his birthday, so she baked him a cake.Text Story: [Kathryn] is playing in the park with her son [Shawn].[Norman] is calling his sister [Kathryn] to let her know it's going to start to rain.[Timothy] went to the movies with his daughter-in-law [Veronica].If [Norman]'s brother is [Shawn] or [Geraldine]'s brother is [Alfredo] then [Shawn]'s aunt is [Geraldine].[Geraldine] bought his brother [Alfredo] a new wallet for his birthday.[Timothy] took his son [James] to school this morning because he missed the bus.[James] and his aunt, [Aurora], went to Disney World.They had a great time!If [Timothy]'s sister is [Aurora] then [Aurora]'s grandfather is [Thomas] and [Timothy]'s mother-in-law is [Brittney].If [Timothy]'s mother is [Aurora] then [Aurora]'s grandson is [Thomas] and [Timothy]'s father-in-law is [Brittney].

Figure 2 :
Figure 2: Examples of each type of deductive logical reasoning in D-MT R. Circles with different letters indicate the different entities.Underlined sentences indicate corresponding rules and partially displayed noise rules.
(b), we find that the results of the same inductive reasoning text are (a) shows an example of defeasible pair.Both [son, A, C] and [nephew, A, C] can be deduced through the same relational path ([[father, A, B],[grandson, B, C]] Performance of different models when trained on "Random" D-MT R.

Figure 4 :
Figure 4: The different accuracy of "unknown".The bar in the figure represents the average accuracy rate, and the line represents the accuracy rate of the "unknown" label.
on "Random" D-MT R.

[Figure 6 :
Figure 6: Case study of BiLSTM-Attention on F-MT R. Model(w) and Model(o) are the model's prediction results with and without additional supplemental facts, respectively.The word in the gray background has the model's attention.

Table 1 :
Comparison with existing reading comprehension datasets and our MT R.

Table 3 :
Results on D-MT R and CLUTRR."Sequential" means that we train on D-MT R in the logical order of inference input and test on the sequential test set.On the contrary, "Random" means that we train on D-MT R in the random logical order of inference input and test on the random test set.

Table 3
tion ∨.The performance on Type⑤ is all considerably above average.The results on Type① are worse than our expectation.Compared with other types, Type① does not contain complex combinations of first-order logic in deductive reasoning, but

Table 4 :
Results on F-MT R. We train on the D-MT R sequentially and test on F-MT R with and without supplementary facts."w" indicates that the input texts are supplemented with additional information."o" indicates that the input texts lack extra facts.