LogiTorch: A PyTorch-based library for logical reasoning on natural language

Logical reasoning on natural language is one of the most challenging tasks for deep learning models. There has been an increasing interest in developing new benchmarks to evaluate the reasoning capabilities of language models such as BERT. In parallel, new models based on transformers have emerged to achieve ever better performance on these datasets. However, there is currently no library for logical reasoning that includes such benchmarks and models. This paper introduces LogiTorch, a PyTorch-based library that includes different logical reasoning benchmarks, different models, as well as utility functions such as co-reference resolution. This makes it easy to directly use the preprocessed datasets, to run the models, or to finetune them with different hyperparameters. LogiTorch is open source and can be found on GitHub 1 .


Introduction
Machine reasoning over natural language has been an object of research since the 1950s (Newell and Simon, 1956;McCarthy et al., 1960).One prototypical task in the domain is Textual Entailment: Given a premise (such as "I ate a cake"), the goal is to determine whether a hypothesis ("I ate something sweet") is entailed or not.Other logical reasoning tasks are question answering, multiple choice question answering, and proof generation.
Lately, deep learning models have shown impressive performance on tasks such as these, in particular transformer-based models such as BERT (Devlin et al., 2019) and GPT-3 (Brown et al., 2020).However, the models can be distracted easily by trap words, syntactic variations (Kassner and Schütze, 2020), or negation (Kassner and Schütze, 2020;Ettinger, 2020;Hossain et al., 2020Hossain et al., , 2022;;Helwe et al., 2021).Hence, the question of whether these models can logically reason on text is still open (Niven and Kao, 2019;Helwe et al., 2021).New models are being created incessantly (e.g., LogiGAN (Pi et al., 2022) and Logiformer (Xu et al., 2022) in 2022), and new datasets are being created to evaluate these models, including, e.g., LogiQA (Liu et al., 2021b) and ProofWriter (Tafjord et al., 2021).The initiative of open-sourcing toolkits has accelerated the progress in the field of natural language processing, driven by projects such as Transformers (Wolf et al., 2020) from HuggingFace and Stanza (Qi et al., 2020) from Stanford.However, this progress has not yet arrived in the field of logical reasoning: researchers still have to find and download different models, parameterize them, find the corresponding datasets, bring them into suitable formats, and fine-tune the models.The datasets are maintained on different Web pages, exhibit different formats (JSON vs. full text, numerical vs. textual labels, etc.), and follow different conventions, which makes it cumbersome to apply one model across several sources.The models themselves are implemented in different frameworks, have different input and output formats, require different dependencies, and differ in the way of running them, which makes it burdensome to exchange one model for another.Some models are not even available online, but have to be re-implemented from scratch based on the diagrams in the scientific publications.All of this hinders reproducibility, re-usability, comparability, and ultimately scientific progress in the area.
In this paper, we propose to bring the benefits of open source libraries to the domain of logical reasoning: we build a Python library, LogiTorch, that includes 14 datasets and 4 implemented models for 3 different logical reasoning tasks.All models can be called in a unified way, all datasets of one task are available in the same standardized format, and all models can be run with all datasets of the same task.All models have been re-implemented from the research papers that proposed them, and they have been validated by subjecting them to the same experiments as the original papers, with comparable results.More models and benchmarks are in preparation.LogiTorch works on top of PyTorch (Paszke et al., 2019), and uses the Transformers library.It also includes utility functions used for preprocessing, such as coreference resolution and discourse delimitation.
The rest of the paper is organized as follows.Section 2 discusses the design and components of LogiTorch, and describes the datasets, utility functions, and models.Section 3 shows the experimental results of our implemented models on different logical reasoning tasks.We conclude in Section 4.

LogiTorch
LogiTorch is our Python library for logical reasoning on natural language text.Figure 1 shows the tree structure of our library.It is built on top of PyTorch and consists of 5 parts: Datasets.We gathered different logical reasoning datasets that allow users to evaluate the reasoning capabilities of deep learning models on natural language.Once a dataset is called from LogiTorch, it is downloaded, and wrapped into an object that inherits the Dataset class of PyTorch.This means that all datasets are accessible via the same interface.We describe the datasets in detail in Section 2.1.Data Collators.Different models require different preprocessing steps for the same data and same task: one model may work on numerical vectors, the other on textual input.Hence, we designed, for each pair of a dataset and a model, a data collator that brings the dataset into the format required by the model.Utilities.Some models require supplementary features in addition to the input text.For example, the DAGN model (Huang et al., 2021) requires the discourse structure of the input in order to create a logical graph representation of it.For such cases, LogiTorch provides different utility functions, most notably for discourse structure analysis, coreference resolution, and logical expression extraction, which we discuss in Section 2.2.Models.LogiTorch provides several deep learning models that have been designed to perform logical reasoning tasks such as proof generation and textual entailment.For each model, we either provide an implementation from scratch, or a wrapper over its original implementation.For the LogiTorch datasets qa mcqa proof_qa te data_collators utilities models pl_models transformer-based models, we use the Transformers library from HuggingFace for the implementation of the models.We describe the models in detail in Section 2.3.PyTorch Lightning Models.For each implemented model, we also provide a PyTorch Lightning version.It includes the model, the optimizer, the training loop, and the validation evaluation.For example, the PRover model (Saha et al., 2020) has a PyTorch Lightning version called PLPRover.This allows users to play with features such as multi-GPU and fast-low precision training without modifying the training loop.

Datasets
The current implemented datasets focus on evaluating the reasoning capabilities of deep learning models.They cover four tasks: Multiple Choice Question Answering (MCQA), Question Answering (QA), Proof Generation, and Textual Entailment (TE).Table 1 shows the task and the number of instances of each dataset.Let us now describe each task and the associated datasets.
Multiple Choice Question Answering (MCQA) is the task of choosing the correct answer to a question from a list of possible answers.Here is an example taken from the LogiQA dataset (Liu et al., 2021b)

Utilities
LogiTorch implements several utility functions that can be used for feature engineering: Coreference Resolution is the task of finding all mentions in a text that refer to the same entity.For example, in "Zidane is one of the best footballers.He won the World Cup in 1998", the words "Zidane" and "he" refer to the same person.Coreference resolution is used by the Focal Reasoner model (Ouyang et al., 2021) to construct a graph of fact triples, where the same mentions are connected with an undirected edge.In LogiTorch, we implemented a wrapper over a finetuned SpanBERT (Joshi et al., 2020) for coreference resolution.Logical Expression Extraction is the task of extracting a logical representation from a text, in order to infer new logical expressions.For example, the sentence "If you have no keyboarding skills, you will not be able to use a computer" can be split into α = "you have no keyboarding skills" and β = "you are not be able to use a computer".The sentence can then be rewritten as α → β.From this, we can infer by transposition that ¬β → ¬α, which corresponds to "If you are able to use a computer, you have keyboarding skills".The LReasoner model (Wang et al., 2022) uses this utility function to extend the input with logical expressions.In LogiTorch, we developed a wrapper over the code provided by LReasoner for this purpose.
Discourse Delimitation is the task of splitting a text into elementary discourse units (EDU).It is used for the rhetorical structure theory (RST), in which it is a tree representation of a text where the leaves are EDUs, and the edges are rhetorical relations.For example, "A signal in a pure analog system can be infinitely detailed, while digital systems cannot produce signals that are more precise than their digital unit" is split into two EDUs: "A signal in a pure analog system can be infinitely detailed", and "digital systems cannot produce signals that are more precise than their digital unit".The DAGN model (Huang et al., 2021) requires EDUs to construct a graph of discourse units.

Models
LogiTorch currently implements four models: RuleTaker (QA task) (Clark et al., 2021) is a RoBERTa-Large model (Liu et al., 2019) that has been finetuned first on the RACE dataset (Lai et al., 2017), and then finetuned again for rule-based reasoning.The model takes as input facts and rules and a boolean question.The output is either True or False.The RoBERTa model has a similar architecture to BERT, but performs better on many NLP tasks.This is because it is pretrained for a longer period, with large batches, and on a larger dataset.The pretraining task is only the Masked Language Modeling (MLM) task, but the masked tokens are changed after each training epoch.
ProofWriter (QA and proof generation) (Tafjord et al., 2021) is a T5 model (Raffel et al., 2020) finetuned to perform rule-based reasoning.It takes as input facts and rules and a question.The output is either True, False, or Unknown (if the trained dataset considers the open-world assumption).T5 is a text-to-text transfer transformer that was pretrained on a variety of NLP problems such as textual entailment, coreference resolution, linguistic acceptability, and semantic equivalence.
PRover (QA and proof generation) (Saha et al., 2020) is built on RoBERTa with three modules: the QA module, Node module, and Edge module.
The QA module is responsible for answering a question as either True or False.The Node and Edge modules are responsible for generating proofs.
The Node module predicts the relevant rules and facts used to generate the answer, and the Edge module predicts the link between two relevant facts and between a relevant fact and a relevant rule.BERTNOT (TE task) (Hosseini et al., 2021) is a BERT model that is pretrained using the unlikelihood loss and knowledge distillation functions for the MLM task to model negation.Then it is finetuned on textual entailment tasks.This model is more robust on examples containing negations, and performs better on the negated NLI dataset than the original BERT.Future releases will include newer models such as LReasoner (Huang et al., 2021), Focal Reasoner (Ouyang et al., 2021), AdaLoGN (Li et al., 2022), Logiformer (Xu et al., 2022), and LogiGAN (Pi et al., 2022).

Library Usage
Listing 1 shows a detailed example of how a model can be trained on a rule-based reasoning dataset for QA.The RuleTaker model is trained on its corresponding dataset.In Lines 9-10, we initialize the training and validation datasets with the Rule-TakerDataset.We specify which sub-dataset and which split we want to use.In Line 12, we initialize the RuleTaker data collator for preprocessing the datasets.We then use the Dataloader to pre-load the datasets and use them as batches.In Line   terparts (Hossain et al., 2020).All models use the the same settings as in the original papers.
Table 2 shows the results of the three different models on the QA task at different reasoning depths.Our model implementations achieve nearperfect accuracies, which are comparable to the performance in the original papers.Table 3 shows the performance on the TE task on each TE training dataset (SNLI, MNLI, and RTE).Again, our model achieves nearly the same results as reported in the original paper (Hosseini et al., 2021) on the MNLI and SNLI datasets.We are getting lower results on the RTE dataset.We assume that this is because the finetuned model has a high variance due to the small size of the training set of RTE.

Conclusion
We have introduced LogiTorch, a Python library for logical reasoning on natural language.It is built on top of PyTorch in combination with the Transformers and PyTorch Lightning libraries.LogiTorch includes an extensive list of textual logical reasoning datasets and utility functions, and different implemented models.The library allows researchers and developers to easily use a logical reasoning dataset and train logical reasoning models with just a few lines of code.The library is available on GitHub and is under active development.
For future work, we will add new datasets, and implement models such as DAGN, Focal Reasoner, and LogiGAN with their utility functions for feature engineering.Finally, we want to invite researchers and developers to contribute to Logi-Torch.We believe that such a library will lower the hurdles to research in the area, foster re-usability, encourage comparative evaluation, strengthen reproducibility, and advance the culture of open software and data.

:
Context: David knows Mr. Zhang's friend Jack, and Jack knows David's friend Ms. Lin.Everyone of them who knows Jack has a master's degree, and everyone of them who knows Ms. Lin is from Shanghai.
(Young et al., 2022)m Shanghai and has a master's degree?Choices: (A) David (B) Jack (C) Mr. Zhang (D) Ms. LinWe implement the following MCQA datasets, LogiQA(Liu et al., 2021b)assesses the logical deductive ability of language models for the case where the correct answer to a question is not explicitly included in the question.The corpus includes paragraph-question pairs translated from the National Civil Servants Examination of China.ReCloR(Yu et al., 2019)is a corpus consisting of questions retrieved from standardized exams such as LSAT and GMAT.To adequately evaluate a model without allowing it to take advantage of artifacts in the corpus, the testing set is split into two sets: the EASY set where the instances are biased, and the HARD set where they are not.rulesandabooleanquestion.The model has to perform logical deductions from the rules and facts in order to answer the question.The datasets includes synthetically generated subsets that require different depths of reasoning, i.e., different numbers of deduction steps to answer a question.The dataset also includes the Bird dataset (which showcases McCarthy's problem of abnormality(McCarthy, 1986)), the Electricity dataset (which simulates the functions of an appliance), and the ParaRules corpus (where crowd workers paraphrased sentences such as "Bob is cold" to "In the snow sits Bob, crying from being cold").ParaRules Plus(Bao, 2021) is an improved version of ParaRules(Clark et al., 2021).It has more examples for the instances with larger reasoning depths.Abduction Rules(Young et al., 2022)is a dataset that evaluates the abductive reasoning capabilities of language models.It is generated similarly to ParaRule Plus, but in this task, the model has to generate an answer to explain an observation.datasetcontains proofs for the answer of each question.Furthermore, there is a variant of the dataset that considers the open-world assumption.Textual Entailment (TE, also RTE) is the task of predicting whether a premise entails or contradicts 1 import pytorch_lightning as pl 2 from pytorch_lightning.callbacks import ModelCheckpoint 3 from torch.utils.data.dataloaderimport DataLoader 6 from logitorch.datasets.qa.ruletaker_dataset import RuleTakerDataset 7 from logitorch.pl_models.ruletakerimport PLRuleTaker context = "Bob is smart.If someone is smart then he is kind."6 question = "Bob is kind."

Table 2 :
Accuracies of different models for the QA task at different reasoning depths. 1 Depth-5 of the testing set of RuleTaker dataset. 2 Depth-5 of the testing set of ProofWriter dataset.3Theoriginalimplementation uses a (more powerful) T5-11B model.vicessuch as CPUs, GPUs, and TPUs.Finally, we train the model with the fit function.Future releases will also provide pre-configured pipelines to train models.Listing 2 shows the code for testing the bestsaved model of Listing 1.In Line 3, we load the best model.In Line 8, we use the predict function, which takes as input a context and a question, and predicts either 0 (for False) or 1 (for True).

Table 3 :
Results of our BERTNOT implementation on different textual-entailment datasets.
by Hosseini et al. (2021)(included in LogiTorch), finetuned the model on each TE dataset (MNLI, SNLI, and RTE) and tested it on its negated coun-