Flexible Generation of Natural Language Deductions

An interpretable system for open-domain reasoning needs to express its reasoning process in a transparent form. Natural language is an attractive representation for this purpose — it is both highly expressive and easy for humans to understand. However, manipulating natural language statements in logically consistent ways is hard: models must cope with variation in how meaning is expressed while remaining precise. In this paper, we describe ParaPattern, a method for building models to generate deductive inferences from diverse natural language inputs without direct human supervision. We train BART-based models (Lewis et al., 2020) to generate the result of applying a particular logical operation to one or more premise statements. Crucially, we develop a largely automated pipeline for constructing suitable training examples from Wikipedia. We evaluate our models using out-of-domain sentence compositions from the QASC (Khot et al., 2020) and EntailmentBank (Dalvi et al., 2021) datasets as well as targeted perturbation sets. Our results show that our models are substantially more accurate and flexible than baseline systems. ParaPattern achieves 85% validity on examples of the ‘substitution’ operation from EntailmentBank without the use of any in-domain training data, matching the performance of a model fine-tuned for EntailmentBank. The full source code for our method is publicly available.


Introduction
Developing models that can make useful inferences from natural language premises has been a core goal in artificial intelligence since the field's early days (Bobrow, 1964;Winograd, 1971). Since then, there has been massive progress in automated formal reasoning (De Moura and Bjørner, 2011); in contrast, progress in automated 1 https://github.com/alephic/ParaPattern The Green Bay Packers play in the NFL.
NFL teams play from September to January.

Substitution model
The Green Bay Packers play from September to January. Conditions lacking environmental causes are strictly hereditary.

Contraposition model
Conditions that aren't strictly hereditary have environmental causes. Figure 1: Examples of the natural deduction operations for which we construct models. Note that conclusions involve both lexical inferences (X plays in the NFL → X is an NFL team, ¬[X lacks Y ] → X has Y ) and logical transformations. natural language reasoning has been comparatively slow. Today, 'natural language inference' usually means recognizing textual entailment (RTE), a pairwise sentence classification task. Models have saturated RTE benchmarks (Bowman et al., 2015;Williams et al., 2018) largely through surfacelevel heuristics (Gururangan et al., 2018;Poliak et al., 2018); hill-climbing on these benchmarks has failed to yield robust models (Naik et al., 2018) or systems capable of more complex reasoning.
Following a line of work on multi-hop question answering (Welbl et al., 2018;Yang et al., 2018;Chen and Durrett, 2019;Min et al., 2019), the reading comprehension community has started to make inroads in the area of reasoning. Recent datasets have been explicitly designed to test deduction ability Yu et al., 2020;Holzenberger et al., 2020) and new types of models take inspiration from formal and informal reasoning Saha et al., 2020;Cartuyvels et al., 2020;Betz et al., 2021). Many recent modeling efforts share a common motif of using intermediate fact chains to support their final predictions, but a major shortcoming is that these chains are either retrieved heuristically or generated freely from autoregressive language models, meaning they are not necessarily sound. To enforce soundness, we envision future reasoning systems factoring the deduction process into a set of common operations, analogous to proof rules. Modeling the reasoning process in this way would grant the ability to generalize systematically to any problem that could be decomposed in terms of available operations, among other desirable properties (Rudin, 2019).
In this work, we describe a generative model for single-step deductive reasoning, building towards models capable of generating the range of logical transformations needed for the full reasoning process. We use a BART-based sequence-tosequence model (Lewis et al., 2020) to represent the distribution of valid conclusion statements conditioned on one or more premise statements. To make sound inferences, the model must be fine-tuned on well-formed training data. We describe a pipeline for crafting this data based on syntactic retrieval from Wikipedia, rule-based example construction, and automatic paraphrasing to increase diversity. Our hypothesis is that the logical regularities in the constructed examples will teach models to generate correct deductions, while paraphrasing coupled with the inductive bias from pretraining will regularize models, allowing them to robustly tolerate natural lexical and syntactic variation in their inputs.
We demonstrate our method's effectiveness by using it to create models for two distinct logical operations, substitution and contraposition, examples of which are shown in Figure 1. Through experiments on manually-constructed English perturbation sets, as well as on the English Question Answering via Sentence Composition (QASC) and EntailmentBank datasets (Khot et al., 2020;, we show that our proposed data generation method leads to accurate and robust operation models. While baseline methods tend to default to trivial input copying and fail to assign significant likelihood to valid novel conclusions, we show that our operation models reliably generate consistent inferences. Evaluating our substitution model on fact compositions from the QASC and EntailmentBank datasets, we find that our method produces valid conclusions at rates equivalent to models trained on in-domain, humanannotated data, indicating that our method is a viable substitute for expensive direct supervision.

Methods
We consider an operation G, like our substitution example ( Fig. 1), to be analogous to a proof rule, allowing one or more premise statements to be combined and transformed to yield a valid conclusion statement. A model for G places a distribution p G (y | x 0 , . . . , x n ) over conclusions y conditioned on premises x i .
We would like models to satisfy three criteria: Consistency: Predicted outputs should be valid deductions from the model's inputs.
Robustness: Models should be robust to linguistic variation present in their inputs.
Supervision economy: A minimal amount of manual effort should be needed to construct a model for a new operation.
We choose to parameterize p G by fine-tuning pretrained sequence-to-sequence language models (Lewis et al., 2020;Raffel et al., 2020). Fine-tuning pretrained models allows the resulting operations to successfully handle a wider variety of inputs by leveraging general linguistic knowledge gained during pretraining. The three desired model criteria we have identified lead to two data collection balancing acts: • Model consistency and robustness improve with increased data quantity, quality, and diversity, but collecting a large amount of diverse, high-quality data presents a challenge.
• Variation in the data and even noise will improve model robustness, but too much noise will cause the trained model to be inconsistent.
Directly annotating such data is possible, but requires significant manual labor, either in the form of expert annotation or careful prompting and filtering of crowd annotations. While annotated resources already exist for certain domains (Khot et al., 2020;Hwang et al., 2021), this is not the case for most types of reasoning. Scraping data from free text only works if examples of the desired operation appear in the wild, which is generally not the case for concise well-formed deduction steps. Betz et al. (2020) use templates to generate logically consistent text for training language models; however, there is little need for diversity or naturalism in their data as it is exclusively used during pretraining for the purposes of transfer learning.  use template-based natural language proofs to fine-tune transformer language models for reasoning; we include a model trained on their data as one of our baseline systems.

Data Collection
Our proposed method, PARAPATTERN, combines scraping, template-based generation, and automatic paraphrasing in order to achieve sufficient data diversity and quality with very little manual effort. PARAPATTERN consists of three phases, as shown in Figures 2 and 3.
Phase 1: Source scraping A set of dependency patterns is used to retrieve source sentences suitable for template expansion from a dependency-parsed free text corpus. An example of one of the dependency patterns we use is shown in Figure 3. This template finds sentences exhibiting the Hearst pattern (Hearst, 1992) X such as Y indicating a hypernymy relationship between X and Y. Note that the retrieved sentences do not constitute complete training examples; such examples of logical reasoning are hard to find in the wild. These sentences need to be reshaped in the next step, but they are lexically diverse and semantically suitable as inputs to our templates in terms of the relations they express.
We perform syntactic search over a corpus of cleaned English Wikipedia article text comprising 112M sentences.
We use the off-the-shelf spaCy en-core-web-sm dependency parser (Honnibal et al., 2020), and index the resulting trees by bottom-up dependency chain prefixes in chunks of 160K sentences in order to accelerate the search process. Dependency parsing and index construction for English Wikipedia takes approximately 24 hours on a single CPU core.
We use six pattern variations to gather source sentences for the substitution template and two patterns for the contraposition template. Potential matches are filtered based on a list of disallowed subject modifiers that would result in semantically invalid examples. After filtering, the substitution patterns yield ∼44,000 source sentences and the contraposition patterns yield ∼23,000 source sentences. All dependency patterns we use are listed in Figure 7 in the appendix. Dependency search over the indexed trees takes 30-45 minutes depending on pattern complexity.
Phase 2: Template expansion Source sentences are expanded into generated examples through the application of an operation-specific template. Figure 3 shows an example of a source sentence and its rule-based expansion into a pair of premise statements and a conclusion.
Template outputs are expressed in terms of the source pattern's match variables.
The template expansion algorithm produces examples by breaking out dependency subtrees rooted at each match variable and rearranging them according to the template structure. We also apply simple heuristics for verb reinflection and noun number adjustment during the reconstruction process to maximize the fluency of the resulting text.
Phase 3: Paraphrase augmentation Data from template expansion is augmented by adding copies of each example with paraphrased input sentences. Paraphrases are generated using a version of the PEGASUS model  fine-tuned for paraphrasing. 2 We sample two paraphrases for each original input using nucleus sampling with p = 0.9. See Figure 4 for samples of input sentences after paraphrasing has been applied. These values were determined heuristically in the course of our preliminary experiments; we found that using a higher sampling cutoff or more paraphrases critically reduced the consistency of model inferences, and lowering p or using only a single paraphrase per source example increased the rate of premise copying for examples not matching a training template.
We observe that the resulting paraphrases tend to include a noticeable amount of noise (e.g. the replacement of 'Hibiscus' with 'bing' in Figure 3), but we hypothesize that since we only paraphrase premises, this effectively adds a denoising component to the fine-tuning objective similar to the motivation behind backtranslation In Egypt, herbal teas such as Hibiscus tea are very popular.
Premises: In Egypt, herbal teas are very popular. Hibiscus tea is a herbal tea. Conclusion: In Egypt, Hibiscus tea is very popular.

Template expansion
Premises: Herbal teas are popular in Egypt.
A herbal tea is called bing tea. Conclusion: In Egypt, Hibiscus tea is very popular.
3. Paraphrasing Figure 3: An example of the steps involved in our data generation process for the substitution operation. Phrases in the source sentence and expanded template are colored according to their corresponding pattern variable.

Substitution
Premises: Staphylococcus epidermis is a microorganism.
Microorganisms colonize the skin surface. Paraphrased: Staphylococcus epidermidis is a microorganism.
Microbiological colonization of the skin surface. "Staphylococcus Epidermidis is a Microorganism." The skin surface is colonized by micro organisms. Conclusion: Staphylococcus epidermis colonizes the skin surface.
Premises: During the undergraduate years, seminarians learn the ancient language courses. Latin is an ancient language course. Paraphrased: The seminars know the ancient language courses.
Latin is an old language course. Seminarians learn ancient language during their undergraduate years.
Latin is a language. Conclusion: During the undergraduate years, seminarians learn Latin.

Contraposition
Premise: As such, rivers that have headwaters in the mountains provide water for irrigation in the surrounding lands. Paraphrased: In order for water to be used in the surrounding lands, the rivers in the mountains must have their headwaters there. Conclusion: As such, rivers that do not provide water for irrigation in the surrounding lands do not have headwaters in the mountains.
Premise: Dogs that are especially dirty or hungry are not able to participate in contests. Paraphrased: To participate in a contest, dogs that are dirty or hungry, must be turned away. Conclusion: Dogs that are able to participate in contests are not especially dirty or hungry. in machine translation (Sennrich et al., 2016). Additional samples of the output of our data generation pipeline are shown in Figure 4. These examples demonstrate the ability of automatic paraphrasing to reduce both lexical and syntactic regularities in the original template outputs that can lead models to overfit to the template. In our subsequent experiments, we examine this overfitting by ablating Phase 3 of our data generation pipeline.

Model Training
Once data for an operation has been generated, we use it to fine-tune an instance of BART-Large (Lewis et al., 2020). Premise sentences x 0 . . . x n are concatenated in a random order and provided as input to the model's encoder, and the conclusion sentence y is used as the target sequence for the decoder.
We use model and training algorithm implementations from the transformers library (Wolf et al., 2020). We fine-tune models for a single epoch using the ADAMW optimizer (Loshchilov and Hutter, 2019) with initial learning rate 3e-5 and triangular learning rate decay. In our preliminary experiments, we found that fine-tuning models for more than a single epoch always produced detrimental overfitting. Models are trained using a total batch size of 16 split across two NVidia Titan RTX GPUs; with this configuration, training takes no more than an hour of wall clock time per model.

Baselines
We compare models trained using our proposed method against three baselines. Our first baseline system is an unmodified instance of the pretrained autoregressive GPT2-Large language model (Radford et al., 2019), prompted with operation premises followed by the elicitation prefix "Therefore," (Zero-shot GPT2). This baseline, inspired by the zero-shot premise elaboration employed by Betz et al. (2021), is intended to assess the likelihood of making consistent deductions under a general model of language with no logical specialization. Our second baseline model is an instance of BART-Large fine-tuned to generate hypotheses from the MNLI dataset (Williams et al., 2018) Substitution -Control Premises: RSA is a cryptographic system.
Cryptographic systems let people exchange messages securely. Conclusion: RSA lets people exchange messages securely.
Predicted: RSA lets people exchange messages securely.
Link NP mismatch Premises: RSA is a cryptographic system.
Encryption protocols let people exchange messages securely. Conclusion: RSA lets people exchange messages securely.
Predicted: RSA allows people to exchange messages securely.
Identity VP mismatch Premises: Dominant cryptographic systems include RSA.
Cryptographic systems let people exchange messages securely. Conclusion: RSA lets people exchange messages securely.
Predicted: RSA allows people to exchange messages securely.
NP + VP mismatch Premises: Dominant encryption protocols include RSA.
Cryptographic systems let people exchange messages securely. Conclusion: RSA lets people exchange messages securely.
Predicted: RSA allows people to exchange messages securely.  conditioned on their respective premises (MNLI BART). We train on all instances for which the gold label indicates entailment (≈103K examples) with the same training configuration as our other models, detailed in 2.2. We hypothesize that while this model may assign higher likelihood to valid conclusions than a general language model would, it will place much more probability mass on reemitting premise statements due to the fact that high word overlap tends to be a common feature of RTE examples labeled as 'entailment' (Zhou and Bansal, 2020).

Number agreement
Our third baseline model is an instance of BART-Large fine-tuned to generate proof steps from the ProofWriter dataset  (ProofWriter BART). While this dataset contains a large number of English proof steps (≈135K), the language used in its proofs is automatically generated from a limited template library and is thus highly constrained. We hypothesize that this model will be unable to generalize as a result.
On the QASC and EntailmentBank datasets, we additionally compare to BART-Large models fine-tuned on inference steps from each dataset's respective training split (QASC BART and Ent. Bank BART). The QASC training set contains ∼8K crowd-annotated fact compositions, while the EntailmentBank training set contains ∼3K expert-annotated premise-conclusion steps.

Perturbation sets
First, in order to evaluate the accuracy of our models on the operations they were designed for, and to understand the degree to which they generalize when input statements deviate from their training patterns, we manually construct controlled perturbation sets for each operation.
Our substitution perturbation set consists of 75 examples evenly split across a control condition and four test conditions, and our contraposition perturbation set consists of 60 examples evenly split across a control condition and three test conditions (15 examples per condition).
Each example in a given test condition is constructed by perturbing a corresponding control example. This aligned structure allows us to evaluate the impact of a particular perturbation on model performance without the confounding effect of content variation that would be present if each condition were constructed independently. Samples of each perturbation condition are presented in Figure 5.

QASC and EntailmentBank
Our perturbation sets are not necessarily "indomain" for our models, but they still neatly fit the reasoning patterns we are targeting. To test our approach's applicability to data outside its training, we apply our substitution model to the fact compositions in the validation splits of the QASC and EntailmentBank datasets (Khot et al., 2020;. QASC fact compositions were annotated by crowd workers as rationales for multiple-choice question answering problems. Since annotators combined facts with a certain question in mind, there is some amount of missing context for many QASC fact combinations.
The EntailmentBank dataset consists of a set of expert-annotated natural language proofs for elementary science question-answer pairs involving multi-step reasoning. Thanks to its trained annotators, EntailmentBank contains fewer spurious fact combinations than QASC.

Evaluation Criteria
We evaluate model performance on each dataset primarily through a manual assessment of conclusion validity. The first author placed generated conclusions into one of six categories: Valid: Conclusion is logically consistent with premises but does not trivially repeat them.
Valid with minor grammar errors: Conclusion is valid but includes minor syntactic errors such as bad verb inflection that do not hinder comprehension.
Repeats premises: Conclusion is a near-verbatim copy of one or more premise sentences.
Unsupported: Conclusion is technically true but does not logically follow from premises.

Incompatible: Conclusion contradicts premises or is inherently false.
Incomprehensible: Conclusion is difficult to interpret due to major ungrammaticality.
Model outputs were shuffled and annotated without knowledge of model identity to prevent rating bias. The last author reannotated a subset of QASC examples to validate the first author's annotations; there were minor differences in interpretation of the divisions between non-valid categories, but the relative proportion of conclusions rated as valid remained consistent between annotators. We additionally compute the perplexity of reference conclusions under each model in order to assess the likelihood assigned to desired conclusions by each model's output distribution.
For the perturbation sets, we also report the BLEURT score (Sellam et al., 2020) of generated conclusions with respect to reference conclusions.

Results on Perturbation Sets
Our first question is whether or not we have good generative models of natural language deductions. As Table 1 shows, PARAPATTERN BART outperforms all baselines by a wide margin in terms of the likelihood of desired conclusions (Ref. PPL), the similarity of its outputs to desired conclusions, and its overall rate of valid inference. Additionally, there is a substantial gap in performance between models trained with and without paraphrastic data augmentation (PARAPATTERN vs. Pattern-only, an ablation of our method). We observe that PARAPATTERN allows models to reliably produce valid inferences when given inputs that lie both on and off the training template manifold. In contrast, models trained using generic entailments (MNLI BART) or purely template-derived inferences (ProofWriter BART) are almost never able to produce valid, nonredundant inferences.  Figure 6: BLEURT score profile of ParaPattern BART substitution outputs for fact combinations from the QASC development set. Sampled substitution model outputs and corresponding QASC annotations for a range of scores are shown to the right. Minor grammatical errors are indicated in orange. Note that generated conclusions remain semantically coherent despite diverging from annotated references as BLEURT scores decrease.   Table 2 shows that PARAPATTERN BART generates valid, grammatical inferences at a rate comparable to that of a model with identical parameter budget and pretraining fine-tuned on in-domain data (QASC BART) as well as a model fine-tuned on inferences from EntailmentBank (Ent. Bank BART), an expert-annotated dataset in an adjacent domain. Thanks to our automated data collection pipeline, we are able to achieve this level of fidelity without any direct annotation. For a full profile of model behaviors on QASC, refer to Figure 7. We note that MNLI BART's preference for repeating premises results in a lower reference perplexity than PARAPATTERN  Table 2 despite MNLI BART behaving poorly during generation due to the fact that reference conclusions exhibit high lexical overlap with premises. Pattern-only BART also tends to repeat inputs, but in this case it is a lowconfidence 'fallback' behavior, as evidenced by its high reference perplexity.

BART in
Visualizing Model Generations on QASC Figure 6 depicts a range of PARAPATTERN BART outputs for QASC validation set fact combinations ranked according to their BLEURT scores with respect to the reference combined fact. In the portion of this distribution above 0 BLEURT, we see very close agreement between the content of generated outputs and references. On the   . This set includes 41% of the sampled examples, in agreement with the statistic reported by the dataset's authors. We omit out-of-domain baselines due to their non-competitive performance on QASC. opposite end of the spectrum, we can see the structure of model outputs diverges from that of the reference fact combinations. However, even as our model's predictions grow farther from the reference conclusions, they remain semantically consistent combinations of the premise facts. The prediction for the final example in Figure 6 is a valid inference in spite of an ungrammatical premise, exemplifying one of the benefits of training on data augmented with noisily paraphrased inputs. In agreement with our quantitative results, these outputs confirm that PARAPATTERN generates sound inferences even under domain shift.

Results on EntailmentBank
We capitalize on the known taxonomy of reasoning types present in EntailmentBank to better understand how well our substitution model aligns with the definition of 'substitution' adopted by the dataset's authors. In Table 3 we demonstrate that PARAPATTERN BART matches the performance of the in-domain fine-tuned BART model on the subset of examples where the inputs specify a well-formed substitution as defined in . This indicates that there is agreement between their definition of substitution, our model's representation of the operation, and the aspects of the operation captured by the EntailmentBank training set.
Furthermore, this result shows that PARA-PATTERN can circumvent the need for manual supervision for a given reasoning skill without sacrificing performance in that skill. Figure 8 provides a breakdown of model behaviors on EntailmentBank. Of the 22% of nonsubstitution examples for which our substitution model is able to produce valid inferences, we note that the majority are instances of 'property inheritance', 'sequential inference', or 'inference from rule', according to the taxonomy of . The following are examples of some of these inferences generated by the PARAPATTERN BART substitution model:

Inference from rule
If fossils are destroyed in rock transition, then there will be gaps in the fossil record. Fossils are lost / destroyed when sedimentary rock changes to metamorphic rock. → Fossils are lost / destroyed when sedimentary rock changes to metamorphic rock, leaving gaps in the fossil record.

Property inheritance
United states is located in the northern hemisphere. New york / new york state is a state located in the united states of america. → New york / new york state is in the northern hemisphere.

Sequential inference
Humans changing ecosystems usually has a negative impact on an ecosystem / organisms living in an ecosystem. Humans building homes in an ecosystem causes that ecosystem to change. → Humans building homes in an ecosystem usually has a negative impact on an ecosystem / organisms living in an ecosystem.
We hypothesize that these inferences reflect generalizations of NP substitution to other phrase categories, most likely learned as a side effect of paraphrastic data augmentation.

Related Work
Natural Logic (Bernardi, 2002;Zamansky et al., 2006;MacCartney and Manning, 2009;Angeli et al., 2016) is related to our approach in that it provides a framework for logical reasoning about statements in natural language. Such systems recognize that there is a cat on the dresser entails there is an animal on the dresser because of the hypernymy relationship between cat and animal. These relationships can be formalized into a monotonicity calculus (Icard et al., 2017) and past work has grounded lexical inference tasks into such a formalism (Angeli et al., 2016;Hu et al., 2020). Instead of decomposing entailment into relationships between words, our models learn to map premises to conclusions at the sentence level, allowing our approach to handle relationships not captured by such a formalism.

Multi-Hop Reasoning
Combining facts to form a conclusion overlaps with the idea of multi-hop reasoning, which has been explored in reading comprehension settings (Welbl et al., 2018;Yang et al., 2018). However, training end-to-end models on these datasets does not necessarily teach models to combine facts (Chen and Durrett, 2019;Min et al., 2019). Systems like NLProlog (Weber et al., 2019) attempt to explicitly ground reasoning in logic, but this process still heavily relies on latent representations; in contrast, by grounding reasoning directly in natural language, a system based on natural deduction operations like ours gains inherent faithful natural language explanations and can build on the strengths of pretrained language models. More recent datasets emphasize the ability to actually exhibit correct reasoning chains and form explanations Xie et al., 2020;. Systems like PRover (Saha et al., 2020) and Leap-of-Thought (Talmor et al., 2020) have some broadly similar goals to ours, but only retrieve facts and do not generate novel conclusions.
Generative Reasoning The generative nature of our models resembles generative models used for commonsense inference (Rajani et al., 2019;Latcinnik and Berant, 2020;Shwartz et al., 2020). However, these models do not strongly constrain the nature of what is generated. In contrast, our models reliably perform specific logical transformations, indicating that they can support sound inferences over longer reasoning chains in future work. Arabshahi et al. (2021) also explore generative reasoning in commonsense scenarios, but the domain of their approach is limited. Khot et al. (2021) use generative models to decompose a complex QA problem into a series of elementary steps that can be delegated to simpler models; this idea parallels the notion of decomposing reasoning into simple steps to be performed by generative operation models. Their results support the idea that such decomposition aids systematic generalization by enforcing separation of concerns.

Conclusion
Building systems that use natural language as a medium for reasoning will require operations to logically combine and transform natural language statements. In this work, we present PARAPATTERN, a method for creating such models with minimal manual effort by finetuning pretrained sequence-to-sequence language models on data generated through a three-step process of syntactic retrieval, template expansion, and automatic paraphrasing. Our experimental results show that PARAPATTERN yields operation models capable of generating consistent logical transformations over a diverse range of natural language inputs, matching the performance of models trained with in-domain human supervision.