RuleBERT: Teaching Soft Rules to Pre-Trained Language Models

While pre-trained language models (PLMs) are the go-to solution to tackle many natural language processing problems, they are still very limited in their ability to capture and to use common-sense knowledge. In fact, even if information is available in the form of approximate (soft) logical rules, it is not clear how to transfer it to a PLM in order to improve its performance for deductive reasoning tasks. Here, we aim to bridge this gap by teaching PLMs how to reason with soft Horn rules. We introduce a classification task where, given facts and soft rules, the PLM should return a prediction with a probability for a given hypothesis. We release the first dataset for this task, and we propose a revised loss function that enables the PLM to learn how to predict precise probabilities for the task. Our evaluation results show that the resulting fine-tuned models achieve very high performance, even on logical rules that were unseen at training. Moreover, we demonstrate that logical notions expressed by the rules are transferred to the fine-tuned model, yielding state-of-the-art results on external datasets.


Introduction
Pre-trained language models (PLMs) based on transformers (Devlin et al., 2019;Liu et al., 2020) are established tools for capturing both linguistic and factual knowledge (Clark et al., 2019b;Rogers et al., 2020). However, even the largest models fail on basic reasoning tasks. If we consider common relations between entities, we see that such models are not aware of negation, inversion (e.g., parent-child), symmetry (e.g., spouse), implication, and composition. While these are obvious to a human, they are challenging to learn from text corpora as they go beyond linguistic and factual knowledge (Ribeiro et al., 2020;. We claim that such reasoning primitives can be transferred to the PLMs by leveraging logical rules, such as those shown in Figure 1.

Input facts:
Mike is the parent of Anne. Anne lives with Mark. Anne is the child of Laure. Anne lives with Mike. Input rules: (r 1 , .1) Two persons living together are married. (r 2 , .7) Persons with a common child are married. (r 3 , .9) Someone cannot be married to his/her child. (r 4 , 1) Every person is the parent of his/her child. While there have been initial attempts to teach reasoning with rules to PLMs , such approaches model only a subclass of logical rules. In fact, current solutions focus on exact rules, i.e., rules that hold in all cases. In reality, most of the rules are approximate, or soft, and thus have a certain confidence of being correct. For example, across the 7,015 logical rules defined on the DBpedia knowledge graph, only 11% have a confidence above 95%. In the example, rules r 1 -r 3 are soft, i.e., cover knowledge that is not true in all circumstances. Consider rule r 2 , stating that if two persons have a child in common, they are most likely married. As r 2 has a confidence of being correct of 0.7, this uncertainty is reflected in the probability of the prediction.
With the above considerations in mind, here we show how to reason over soft logical rules with PLMs. We provide facts and rules expressed in natural language, and we ask the PLM to come up with a logical conclusion for a hypothesis, together with the probability for it being true.
Unlike previous approaches , we enable deductive reasoning for a large class of soft rules with binary predicates and an unrestricted number of variables. Our model can even reason over settings with conflicting evidence, as shown in Test 3 in Figure 1. In the example, as Anne and Mike live together, they have a 0.1 probability of being married because of soft rule r 1 . However, we can derive from exact rule r 4 that Anne is the child of Mike and therefore they cannot be married, according to soft rule r 3 .
To model uncertainty, we pick one flavor of probabilistic logic programming languages, LP MLN , for reasoning with soft rules (Lee and Wang, 2016). It assigns weights to stable models, similarly to how Markov Logic assigns weights to models. However, our method is independent of the logic programming approach at hand, and different models can be fine-tuned with different programming solutions. Our proposal makes use of synthetic examples that "teach" the desired formal behavior through fine-tuning. In particular, we express the uncertainty in the loss function used for fine-tuning by explicitly mimicking the results for the same problem modeled with LP MLN .
Our contributions can be summarized as follows: • We introduce the problem of teaching soft rules expressed in a synthetic language to PLMs through fine-tuning (modeled as binary classification).
• We create and release the first dataset for this task, which contains 3.2M examples derived from 161 rules describing real common-sense patterns with the target probability for the task obtained from a formal reasoner (Section 4).
• We introduce techniques to predict the correct probability of the reasoning output for the given soft rules and facts. Our solution relies on a revised loss function that effectively models the uncertainty of the rules (Section 5). Our approach handles multi-variable rules and nicely extends to examples that require reasoning over multiple input rules.
• We show that our approach enables fine-tuned models to yield prediction probability very close to that produced by a formal reasoner (Section 6). Our PLM fine-tuned on soft rules, RULEBERT, can effectively reason with facts and rules that it has not seen at training, even when fine-tuned with only 20 rules.
• We demonstrate that our fine-tuning approach effectively transfers knowledge about predicate negation and symmetry to the lower levels of the transformer, which benefits from the logical notions in the rules. In particular, RULEBERT achieves new state-of-the-art results on three external datasets.
The data, the code, and the fine-tuned model are available at http://github.com/MhmdSaiid/ RuleBert.

Related Work
PLMs have been shown to have some reasoning capabilities (Talmor et al., 2020b), but fail on basic reasoning tasks (Talmor et al., 2020a) and are inconsistent (Elazar et al., 2021), especially when it comes to negation .
Our work focuses on deductive reasoning. Note that it is different from previous work, e.g., on measuring the factual knowledge of PLMs (Petroni et al., 2019), on probing the commonsense capabilities of PLMs at the token or at the sentence level (Zhou et al., 2020), or on testing the reasoning capabilities of PLMs on tasks such as age comparison and taxonomy conjunction (Talmor et al., 2020a). Our work relates to Task #15 in the bAbI dataset (Weston et al., 2016) and to Rule-Takers . However, we differ (i) by using a larger subclass of first-order logic rules (with more variables and various forms), and (ii) by incorporating soft rules.
Our proposal is different from work on Question Answering (QA) with implicit reasoning based on common-sense knowledge (Clark et al., 2019a), as we rely purely on deductive logic from explicitly stated rules.
Our approach also differs from methods that semantically parse natural language into a formal representation on which a formal reasoner can be applied (Liang, 2016), as we directly reason with language. Yet, we are also different from Natural Language Inference (NLI) and textual entailment, which work with text directly, but cannot handle Horn rules (MacCartney and Manning, 2009;Dagan et al., 2013).
Unlike previous work (Hamilton et al., 2018;Yang et al., 2017;Minervini et al., 2020), we do not design a new, ad-hoc module for neural reasoning, but we rely solely on the transformer's capability to emulate algorithms (Wang et al., 2019b;Lample and Charton, 2020).

Background
Language Models. We focus on language models pre-trained with bidirectional transformer encoders using masked language modeling (Devlin et al., 2019). For fine-tuning, we create examples for sequence classification to teach the models how to emulate reasoning given facts and soft rules.
Logical Rules. We rely on existing corpora of declarative Horn rules mined from large RDF knowledge bases (KBs) (Galárraga et al., 2015;Ortona et al., 2018;Ahmadi et al., 2020). An RDF KB is a database representing information with triples (or facts) p(s, o), where a predicate p connects a subject s and an object o. An atom in a rule is a predicate connecting two universally quantified variables. A Horn rule (or clause) has the form: B → h(x, y), where h(x, y) is a single atom (head or conclusion of the rule) and B (body or premise of the rule) is a conjunction of atoms. Positive rules identify relationships between entities, e.g., r 1 , r 2 , r 4 in Figure 1. Negative rules identify contradictions, e.g., r 3 in Figure 1. Rules can contain predicates comparing numerical values, such as <. For example, negative rule r 5 : birthYear(b,d) ∧ foundYear(a,c) ∧ <(c,d) → negfounder(a,b) states that any person (variable b) with a birth year (d) higher than the founding year (c) of a company (a) cannot be its founder. A fact is derived from a rule if all the variables in the rule body are replaced with constants from facts. For r 5 , facts "foundYear (Ford,1903), birthYear(E. Musk,1971), >(1971,1903" trigger the rule that derives the fact negFounder(E. Musk, Ford).
Rule Confidence. Exact rules, such as r 4 , apply in all cases, without exception. However, most rules are approximate, or soft, as they apply with a certain likelihood. For example, r 3 in Figure 1 is true in most cases, but there are historical exceptions in royal families. Rules are annotated with a measure of this likelihood, either manually or with a computed confidence (Galárraga et al., 2015).
Probabilistic Answer Set Programming. As we deal with soft rules, we adopt LP MLN (Lee and Wang, 2016) to create the dataset. LP MLN is a probabilistic extension of answer set programs (ASP) with the concept of weighted rules from Markov Logic (Baral, 2010). In ASP, search problems are reduced to computing stable models (answer sets), a set of beliefs described by the program.
A weight (or confidence) is assigned to each rule, so that the more rules a stable model satisfies, the larger weight it gets, and the probability of the stable model is computed by normalizing its weight among all stable models. Given a set of soft rules and facts, we measure how much the hypothesis is supported by the stable model.

Dataset
We start by defining the reasoning task. We then discuss example generation methods for three scenarios: single rule as input, multiple (possibly conflicting) rules that require reasoning for the same conclusion, and multiple rules that require a sequence (chain) of reasoning steps. Examples of the data generation procedures are in the Appendix.

Reasoning Task
Each example is a triple (context, hypothesis, confidence). Context is a combination of rule(s) and generated facts, such as "If the first person lives together with the second person, then the first person is the spouse of the second person." and "Anne lives with Mike." Hypothesis is the statement to be assessed based on the context, e.g., "Laure is the spouse of Mike." Confidence is the probability that the hypothesis is valid given by the reasoner, e.g., 0.7. As we generate the examples, we know the confidence for each hypothesis.

Single-Rule Dataset Generation
Given a rule, we generate examples of different hypotheses to expose the model to various contexts. Each example contains the context c and a hypothesis h with its probability of being true as obtained for the (c, h) pair from the LP MLN reasoner. The intuition is that the examples show the expected behavior of a formal reasoner for every combination of possible facts for a given rule. This process is not about teaching the model specific facts to recall later, but teaching it reasoning patterns.
Unlike previous work , our rules allow for multiple variables. This introduces additional complexity as examples must show how to deal with the symmetry of the predicate. For example, child(Alice,Bob) and child(Bob,Alice) are not equivalent since child is not symmetric, while spouse(Alice,Bob) and spouse(Bob,Alice) are equivalent as spouse is symmetric. We assume that metadata about the symmetry and the types is available from the KB for the predicates in the rules.
Given as input (i) a rule r, (ii) a desired number n of examples, (iii) an integer m to indicate the maximum number of facts given as a context, and (iv) a pool of values for each type involved in r's predicate pools, Algorithm 1 outputs a dataset D of generated examples. We start at line 3 by generating facts, such as child(Eve,Bob), using the function GenFacts (lines 15-18), which takes as input a rule r, the maximum number of facts m to generate, and the pools. A random integer less than m sets the number of facts in the current context. The generated facts F have predicates from the body of r, their polarity (true or negated atom) is assigned randomly, and variables are instantiated with values sampled from the pool (line 16). Facts are created randomly, as we are not interested in teaching the model specific facts to recall later, but instead we want to teach it how to reason with different combinations of rules and facts. We then ensure that the rule is triggered in every context, eventually adding more facts to F using the function GetRuleFacts in line 17. After obtaining F , we feed rule r along with facts F to the LP MLN reasoner, and we obtain a set O containing all satisfied facts and rule conclusions (line 4).
We generate different hypotheses, where each one leads to an example in dataset D. For each context, we add an example with different facts with respect to the given rule according to three dimensions. A fact can be (i) for a predicate in the premise or in the conclusion of a rule, could be (ii) satisfied or unsatisfied given the rule, and could have (iii) positive or negative polarity. This makes eight different possibilities, thus leading to the generation of eight different hypotheses (one for each context).
The first hypothesis h 1 is obtained by sampling a fact from the set F (line 5). We then produce the counter hypothesis h 2 by altering the fact (line 6) using the function Alter (lines 19-22). Given a hypothesis p(s, o) (line 19), we return its negated form if p is symmetric (line 20). Otherwise, if p is not symmetric, we produce a counter hypothesis either by negation (line 21), or by switching the subject and the object in the triple as the predicate is not symmetric (line 22). We rely on a dictionary to check whether a predicate is symmetric or not.
We then produce hypothesis h 3 (line 7), which is the outcome of triggering rule r with the facts added in line 17. The counter hypothesis h 4 is generated by altering h 3 (line 8). Moreover, we generate hypothesis h 5 by considering any unsatisfied positive fact outside F . Following a closed-world assumption (CWA), we assume that positive triples are false if they cannot be proven, meaning that their negation is true. We sample a fact f l from the set of all possible positive facts that do not have the same predicate of the rule head (line 9). Thus, h 5 will never be in the output O of the reasoner, as it cannot be derived. We then produce h 6 by negating h 5 in line 10. We further derive h 7 by sampling a fact f r that has the same predicate as that of the rule head, but does not belong to the output of the reasoner O (line 11). For a positive (negative) rule, such a fact is labelled as False (True). h 7 is then negated to get the counter hypothesis h 8 (line 12). All generated hypotheses are added to D (line 13), and the process repeats until we obtain n examples.
Finally, we automatically convert the examples to natural language using predefined templates. A basic template for atom predicate p (type t 1 , type t 2 ) is "If the 1 st t 1 is p of the 2 nd t 2 ." ("If the first person is spouse of . . . "). For the single-rule scenario, we release a dataset for 161 rules with a total of 3.2M examples and a 80%:10%:10% split for training, validation, and testing.

Rules with Overlapping Conclusion
When multiple rules are in the context, there could be facts that trigger more than one rule for a given hypothesis. The triggered rules might be all of the same polarity (positive or negative), eventually accumulating their confidence, or could be a mix of positive and negative rules that oppose each other. While the data generation procedure in Section 4.2 can be extended to handle multiple rules, this raises an efficiency problem. Given a set of R rules, it would generate 8 |R| examples for each (facts,rule) pair in order to cover all rule combinations. This is very expensive, e.g., for five rules, it would generate 8 5 = 32, 768 examples for a single context.
Given this challenge, we follow a different approach. We first generate data for each rule individually using Algorithm 1. We then generate more examples only for combinations of two or more rules having all their rule conclusions as hypotheses. For every input context, we produce rule-conclusion hypotheses (positive and negative) while varying the rules being fired. Thus, we generate 2 * |R| x=2

|R|
x examples with at least two rules triggered. Adding the single-rule data, we generate 8 * |R| + 2 * |R| x=2 |R| x for every (facts,rules) pair, which is considerably smaller than 8 r for |R| ≥ 2, according to the binomial theorem. For example, for |R|=5, we generate 92 examples per context. For the overlapping rules scenario, we release a dataset for 5 rules with a total of 300K examples, and a 70%:10%:20% split for training, validation, and testing.

Chaining of Rule Executions
For certain hypotheses, an answer may be obtained by executing rules in a sequence, i.e., one on the result of the other, or in a chain. To be able to evaluate a model in this scenario, we generate hypotheses that can be tested only by chaining a number of rules (an example is shown in Appendix H). Given a pool of rules over different relations and a depth D, we sample a chain of rules with length D. We then generate hypotheses that would require a depth varying between 0 and D. We generate a rule-conclusion hypothesis (h 3 ) and its alteration (h 4 ) for each depth d ≤ D. A depth of 0 means that the hypothesis can be verified using the facts alone without triggering any rule. We also generate counter-hypotheses by altering the hypotheses at a given depth, and we further include hypotheses that are unsatisfied given the input.
For the chaining rules scenario, we start with a pool of 64 soft rules, and we generate hypotheses that would need at most five chained rules to verify them. The dataset for d ≤ 5 contains a total of 70K examples, and a 70%:10%:20% split for training, validation, and testing.

Teaching PLMs to Reason
In this section, we explain how we teach a PLM to reason with one or more soft rules. Note that uncertainty stems from the rule confidence. One approach to teach how to estimate the probability of a prediction is to treat each confidence value (or bucket of confidence values) as a class and to model the problem as a k-way classification instance (or regression), but this is intractable when multiple rules are considered. Instead, we keep the problem as a two-class one by altering how the information is propagated in the model to incorporate uncertainty from the rule confidence. Let be our generated dataset, where x i is one example of the form (context,hypothesis,confidence) and y i is a label indicating whether the hypothesis is validated or not by the context (facts and rules in English), and m is the size of the training set. A classifier f is a function that maps the input to one of the labels in the label space. Let h(x, y) be a classification loss function. The empirical risk of the classifier f is We want to introduce uncertainty in our loss function, using the weights computed by the LP MLN solver as a proxy to represent the probability of predicting the hypothesis as being true. To do so, we apply a revised empirical risk: We now state that each example is considered as a combination both of a weighted positive example with a weight w(x i ) provided by the LP MLN solver and a weighted negative example with a weight 1 − w(x i ).
When trained to minimize this risk, the model learns to assign the weights to each output class, thus predicting the confidence for the true class when given the satisfied rule head as a hypothesis.

Experiments
We first describe the experimental setup (Section 6.1). We then evaluate the model on single (Section 6.2) and on multiple rules (Sections 6.3 and 6.4). We show that a PLM fine-tuned on soft rules, namely RULEBERT, makes accurate predictions for unseen rules (Section 6.5), and it is more consistent than existing models on three external datasets (Section 7). We report the values of the hyper-parameters, as well as the results for some ablation experiments in the Appendix. The datasets for all experiments are summarized in Table 1.

Experimental Setup
Rules. We use a corpus of 161 soft rules mined from DBpedia. We chose a pool of distinct rules with varying number of variables, number of predicates, rule conclusions, and confidences.
Reasoner. We use the official implementation 1 of the LP MLN reasoner. We set the reasoner to compute the exact probabilities for the triples.
PLM. We use the HuggingFace pre-trained RoBERTa LARGE (Liu et al., 2020) model as our base model, as it is trained on more data compared to BERT (Devlin et al., 2019), and is better at learning positional embeddings (Wang and Chen, 2020). We fine-tune the PLM 2 with the weighted binary cross-entropy (wBCE) loss from Section 5. More details can be found in Appendix C.

Evaluations Measures.
For the examples in the test set, we use accuracy (Acc) and F1 score (F1) for balanced and unbalanced settings, respectively. As these measures do not take into account the uncertainty of the prediction probability, we further introduce Confidence Accuracy@k (CA@k), which measures the proportion of examples whose absolute error between the predicted and the actual probabilities is less than a threshold k: where x i is the i th example of dataset, w i is the actual confidence of the associated hypothesis given by the LP MLN reasoner,ŵ i is the predicted confidence by the model, and k is a chosen threshold. The measure can be seen as the ordinary accuracy measure, but true positives and negatives are counted only if the condition is satisfied, where lower values for k indicate stricter evaluation.

Single Soft Rule
We fine-tune 16 models for 16 different positive and negative rules (one model per rule) using 16k training samples per rule. We compare the accuracy of each model (i) without teaching uncertainty using binary cross-entropy (RoBERTa), and (ii) with teaching soft rules using wBCE. Table 2 shows a rule with its confidence, followed by accuracy and CA@k (for k = 0.1 and k = 0.01) for both loss functions. We see that models fine-tuned using RoBERTa-wBCE perform better on CA@k. In terms of accuracy, both models perform well, with RoBERTa-wBCE performing better for all rules. Interestingly, the best performing rules are two rules that involve comparison of numerical values (birth years against death and founding years), which suggests that our method can handle comparison predicates.

Rules Overlapping on Conclusion
The dataset contains five soft rules with spouse or negspouse in the rule conclusion, and a confidence between 0.30 and 0.87 (shown in Figure 2). We train a model on the dataset and test it (i) on a test set for each of the five rules separately, (ii) on test sets with U triggered rules, where U ∈ {2, 3, 4, 5}. Results. Table 3 shows that the model achieves high scores both on the single test sets (top five rows) and on the sets with interacting rules. The test sets with U = 2 and U = 3 are most challenging, as they contain 5 2 = 10 and 5 3 = 10 combinations of rules, respectively, while the one with U = 5 has only one possible rule combination. The high scores indicate that PLMs can actually learn the interaction between multiple soft rules.   Table 3: Results for a model trained on five rules sharing the same predicate, and tested on multiple test sets.

Rule Chaining
Here, we assess models fine-tuned on various chaining depths. We construct six datasets for this scenario with increasing depths (D = 0, D ≤ 1, D ≤ 2, D ≤ 3, D ≤ 4, D ≤ 5), i.e., dataset D ≤ x contains hypotheses that need at most x chained rules. We thus train six models (one per dataset), and we test them (i) on their own test dataset (Test), (ii) on the test set with D ≤ 5 that contains all examples up to depth 5 (All), and (iii) on test sets with a chaining of depth x (Depx).

Results.
The results are shown in Table 4. We can see that the models achieve high F1 scores on the respective test sets for Depth 0. The red borderline indicates F1 scores for models tested on chaining depths higher than the ones they have been trained on. We see that Mod3 and Mod4 do fairly well on Depth 5. However, there is a decrease for higher depths, possibly due to the need for more training examples in order to learn such depths.  Moreover, since we sample a chain of rules each time, it is likely that every model has been trained on certain chains of rules. This yields lower scores in the constant-depth test sets as the models are being tested on unseen rule chains.
Note that Mod0 shows a counter-intuitive increase in the F1 score for higher unseen depths. Chaining soft rules may lead to a low probability for the associated hypothesis, and thus eventually to a False label. However, Mod0 is not trained on chaining and sees a hypothesis that requires chaining as an unsatisfied fact, thus eventually labelling it as False, while in fact it is the chaining of the soft rules that is the cause for this label. This is never the case with hard rules, as the actual label there would be True.

Testing RULEBERT on Unseen Rules
We have seen that a PLM can be successfully finetuned with rules. We now study the performance on the PLM after it has been fine-tuned on 161 (single) rules. We call this fine-tuned model RULEBERT.

Rule
FT-PLM RULEBERT20 FT-RULEBERT20  Table 5: Evaluation on unseen rules (accuracy). The first group contains rules with predicates seen by RULEBERT among the 20 rules used for fine-tuning, while the second group has rules with unseen predicates. We first evaluate RULEBERT on unseen rules. We fine-tune it with only twenty randomly selected rules (shown in Figure 3) and call it RULEBERT 20 . We then select ten new rules divided into two groups: (i) five rules containing predicates that were used in the rules for fine-tuning RULEBERT 20 , and (ii) five rules that share no predicates with the fine-tuning rules. For each rule in the test sets, we run a model fine-tuned (with 4k examples) only for that rule (FT-PLM), the model fine-tuned on the twenty original rules (RULEBERT 20 ), and the same model fine-tuned again for the rule at hand (FT-RULEBERT 20 ).
Results. Table 5 shows that RULEBERT 20 outperforms the fine-tuned model (FT-PLM) on the first group. Even though fine-tuned on 20 rules, it learned enough about (i) symmetric/transitive predicates and (ii) rule confidence to predict correctly, even better than rule-specific models.
For the second rule group, the accuracy of RULEBERT 20 is high, but FT-PLM performs better. Applying the same fine-tuning on RULEBERT 20 yields the best results in all scenarios.

RULEBERT on External Datasets
As our fine-tuning propagates information in the layers of the encoder, we hypothesize that RULE-BERT effectively "learns" logical properties of the concepts represented in the rules, such as negation and symmetry, and thus it could perform better on tasks testing such properties of PLMs. To study the negation of predicates, we use the Negated LAMA datasets, which test how PLMs distinguish a Cloze question and its negation . In most cases, PLMs make the same prediction both for a positive statement ("Relativity was developed by Einstein.") and for its negation ("Relativity was not developed by Einstein."). To test the symmetry relationship between predicates, we use the SRL test in CheckList (Ribeiro et al., 2020), which focuses on behavioral testing of NLP models; we use its test set for the duplicate-question detection task (QQP) (Wang et al., 2019a). Finally, we test deductive reasoning on the bAbI dataset and its Task #15 (Weston et al., 2016).

Negated LAMA Experiments
For Negated LAMA, we do not fine-tune RULE-BERT for the task; instead, we replace its original classification layer by an MLM head with weights identical to those of RoBERTa (not fine-tuned). Note that this configuration is biased in favor of RoBERTa, as the parameters of the MLM head and of the RoBERTa encoder have been trained in conjunction and thus good values have been found for this combination, which is not the case for our RULEBERT.  Results Yet, even in this arguably unfair setting, RULEBERT outperforms RoBERTa on all datasets of Negated LAMA, as shown in Table 7. We can see that RULEBERT performs better on both evaluation measures used in . It achieves a lower mean Spearman rank correlation (ρ) and a much smaller percentage of positive and negated answers overlap (%).

CheckList QQP Experiments
The CheckList tests (Ribeiro et al., 2020) have shown that PLMs fail in many basic cases. We hypothesize that RULEBERT can perform better on tasks and examples that deal with symmetric and asymmetric predicates, if such predicates have been shown to it during pre-fine-tuning. We experiment with the QQP dataset, which asks to detect whether two questions are duplicates. We identify a few rules that can teach a model about symmetric predicates, and we pre-fine-tune RULEBERT on them; then, we fine-tune it on the QQP dataset.
Results Table 6 shows the results on the challenging CheckList QQP test set: we can see that RULE-BERT achieves accuracy of 0.422 after one epoch, while RoBERTa is at 0.0. However, after three epochs RULEBERT is also at 0.0, 3 i.e., it started to unlearn what it had learned at pre-fine-tuning (Kirkpatrick et al., 2017;Kemker et al., 2018;Biesialska et al., 2020). Learning a new task often leads to such catastrophic forgetting (Ke et al., 2021). While there are ways to alleviate this (Ke et al., 2021), this is beyond the scope of this paper.

bAbI Task #15 Experiments
Finally, we experiment with task #15 of the bAbI dataset, where the goal is to assess whether a model can perform deductive reasoning. However, as mentioned in the original bAbI paper (Weston et al., 2016), it is not only desirable to perform well on the task, but also to use the fewest examples.
3 On the much easier QQP test set, RULEBERT achieved 0.89 accuracy after one epoch, and 0.91 after three epochs.  Thus, we use the smallest dataset consisting of about 2,000 data points. We hypothesize that under the same conditions and hyper-parameters, RULE-BERT should be able to generalize faster and to learn in fewer epochs. As PLMs produce varying scores when fine-tuned on small datasets, we repeat the experiment ten times and we report the average scores. We then compare to RoBERTa. Both models contain two classification layers to predict start and end spans of the input context.

Results
We can see in Table 6 that RULEBERT achieves accuracy of 0.863 in two epochs, while RoBERTa achieves 0.676. On the third epoch, RoBERTa catches up with accuracy of 0.827, while RULEBERT starts to overfit (goes down to 0.825), indicating that fewer epochs should be used, probably due to catastrophic forgetting.

Conclusion and Future Work
We studied whether PLMs could reason with soft rules over natural language. We experimented with one flavor of probabilistic answer set programming (LP MLN ), but other semantics can be also used with the proposed methodology. We further explored the inference capabilities of Transformerbased PLMs, focusing on positive and negative textual entailment.
We leave non-entailment for future work. We also leave open the development of explainable models. Some approaches use occlusion that removes parts of the input and checks the impact on the output  or build proof iteratively using 1-hop inference (Tafjord et al., 2021).

Ethics and Broader Impact
Data Collection While we generated the facts in our examples, the logical rules have been mined from the data in the DBpedia knowledge graph, which in turn has been generated from Wikipedia.
Biases We are aware of (i) the biases and abusive language patterns (Sheng et al., 2019;Bender et al., 2021;Liang et al., 2021) that PLMs impose, and (ii) the imperfectness and the biases of our rules as data from Wikipedia has been used to mine the rules and compute their confidences (Janowicz et al., 2018;Demartini, 2019). However, our goal is to study PLM's capability of deductive soft reasoning. For (i), there has been some work on debiasing PLMs (Liang et al., 2020), while for (ii), we used mined rules to have more variety, but could resort to user-specified rules validated by consensus to relieve the bias.
Environmental Impact The use of large-scale Transformers requires a lot of computations and GPUs/TPUs for training, which contributes to global warming (Strubell et al., 2019;Schwartz et al., 2020). This is a smaller issue in our case, as we do not train such models from scratch; rather, we fine-tune them on relatively small datasets. Moreover, running on a CPU for inference, once the model is fine-tuned, is less problematic as CPUs have a much lower environmental impact.

A More on Reasoning with Soft Rules
Let σ be a signature as in first-order logic. An LP MLN program Π is a finite set of weighted rules of the form: where A is a disjunction of atoms of σ, B is a conjunction of literals (atoms and negated atoms) of σ, and w is a real number or the symbol α. When A is ⊥ (the empty disjunction), the rule asserts that B should be false in the stable model. An LP MLN rule (1) is called soft if w is a real number or hard if w is α. An LP MLN program is ground if its rules contain no variables. An LP MLN program Π that contains variables is identified with a ground LP MLN program gr σ [Π], which is obtained from Π by replacing every variable with every ground term of σ. The weight of a ground rule in gr σ [Π] is the same as the weight of the corresponding rule in Π. By Π we denote the unweighted logic program obtained from Π, i.e., Π = {R | w : R ∈ Π}.
For a ground LP MLN program Π, Π I denotes the set of rules w : R in Π such that I satisfies R (denoted I |= R) and SM[Π] denotes the set {I | I is a (deterministic) stable model of Π I }. The (unnormalized) weight of I under Π is defined as follows: otherwise.
The probability of I under Π is the normalized weight defined as follows: .
In Answer Set programming (ASP), search problems are reduced to computing stable models (a.k.a. answer sets), a set of beliefs described by the program. In the case of a Horn program, the stable models coincide with the minimal models. LP MLN programs are transformed to meet the needs of an ASP solver (Lee et al., 2017;Gebser et al., 2014).

B Rule Support
We designed an experiment to show the impact of increasing the number of overlapping rules on the same target predicate. The goal is to measure how often multiple rules are triggered for the same target triple. We measure this with the support of a rule, i.e., the number of triples in the knowledge base that satisfy all the atoms in the rule.
To compute the support for more than one rule, we combine the premises of the rules. In this experiment, we picked three predicates (spouse, child and relative), and for each one we selected ten rules randomly. Next, we used DBpedia online endpoint 4 to compute the support for each combination of n (n=1,2,...,5) rules for each predicate. The results in Figure 4 show that by increasing the number of rules, the support decreases for all predicates. For combinations with more than three rules, the support is very small.

C More Experimental Details
For fine-tuning our models, we use Google Colaboratory (Bisong, 2019), which assigns random GPU clusters of various types. The number of parameters of our models is about 355M. We select the values of our hyper-parameters (shown in Table 8) on the development sets, by maximizing accuracy.
The execution times vary largely depending on the GPU at hand and on the scenario, with finetuning on a Tesla V100 taking from one hour for a single rule to a few hours for all the chaining experiments. The training/validation/testing splits are shown in Table 1. Table 9 shows the sizes of the used test datasets.

D Ablation
D.1 Impact of the Data Size Setting. We report the impact of the size of the fine-tuning data on the model performance. As shown in Table 2, the accuracy of the fine-tuned model is higher for rules with higher confidence.   We therefore divide the rules in three categories: High contains rules with confidence greater than 0.8, Medium has rules with confidence between 0.4 and 0.8, and Low is for the rest. There are six rules in the Medium category and the other two categories have five rules each. For each rule, we fine-tune seven models with 1k, 2k, 5k, 10k, 15k, 20k, and 30k examples.
Results. Figure 5 shows that having more training data improves the accuracy in all scenarios. For all categories, there is a sizable increase going from 10k to 15k examples; the impact is smaller for higher values. The highest increase is for rules with high confidence, and rules with medium confidence demonstrate larger increase than low confidence.   Results. The results in Table 10 show that the model performance does not depend heavily on using the same fact format for training and testing. With examples using letters in training, the results are slightly better in the case with two formats. We ultimately use names for testing and training in our default configuration as it yields better results.

E Impact of the Random Seed
Pre-trained transformers often suffer from instability of the results across multiple reruns with different random seeds. This usually happens with small training datasets (Dodge et al., 2020;Mosbach et al., 2021). In such cases, typically multiple reruns are performed, and the average value over these reruns is reported. However, the numbers for the main experiments we report in this paper are not averaged over multiple reruns as our datasets are considerably large and the models did not suffer from instability due to random seeds. For example, when we reran RULE-BERT on a single-rule experiment three times, we obtained accuracy of 0.98959, 0.99551, 0.99636 with a standard deviation of only 0.003.
Yet, for the small dataset bAbI, we observed a much higher standard deviation of 0.17. Thus, in this case we report results that are averaged over ten reruns.

F Data Generation Example
We show an example of data generation for Algorithm 1. For simplicity, here we show an example of a hard rule, i.e., one whose confidence is implicitly set to one. 5 We begin by setting the values of the input parameters: Algorithm 1 Input: We set n = 8 to generate all the eight hypotheses. We start by generating a set of facts F (line 3), having predicates from the body of the rule with random polarity. We ensure that there are facts that trigger the rule. Their number should not exceed m. Here is an example of generated facts F : Generated Facts F: Four facts are generated in total. Facts f 3 and f 4 trigger rule r. We then feed the rule r and facts F into the LP MLN reasoner (line 4 Hypothesis h 1 is obtained by sampling from F (line 5), and thus it is a valid hypothesis. Then, the hypothesis h 2 is generated by altering h 1 with the function Alter (line 19-22). In this example, since child is not symmetric, h 2 is produced using a switch of the subject and the object of h 1 to generate a false hypothesis (line 6).
Hypothesis h 3 is the outcome of rule r being triggered by facts f 3 and f 4 (line 7). In a similar fashion to h 2 , we produce h 4 (line 8).
Hypothesis h 5 is sampled from the universe of all unsatisfied positive facts having a different predicate than that of the rule body (line 9), which makes it an invalid hypothesis, as it is not found in the O. Hypothesis h 6 is the negation of h 5 , and, following CWA, it is a valid hypothesis (line 10).
Finally, hypothesis h 7 is sampled from the universe of unsatisfied rule-head atoms (line 11), and it is negated to produce hypothesis h 8 . We then convert each example to synthetic English using a set of pre-defined templates for the facts and for the rules. Here is the above Example #1, but now rewritten in synthetic English: Example #1 (Synthetic English): • Rule r = If the child of the first person is the third person, and the parent of the third person is the second person, then the first person is the spouse of the second person.
• F acts F : f 1 : The parent of Eve is not Carl.
f 2 : The child of Eve is David.
f 3 : The parent of Carl is Bob.
f 4 : The child of Alice is Carl.
• Hypothesis h 3 : The spouse of Alice is Bob.
The Context is defined as the combined set of facts and rule(s). Both Context and Hypothesis are fed as an input to the model.

Example #1 (Model Input):
• Context : The parent of Eve is not Carl.
The child of Eve is David. If the child of the first person is the third person, and the parent of the third person is the second person, then the first person is the spouse of the second person. The parent of Carl is Bob. The child of Alice is Carl.
• Hypothesis : The spouse of Alice is Bob.

G Rule Overlap Example
After generating the data for every rule in Figure 2, we generate additional examples using combinations of rules. Below, we show how to handle the interaction of two rules: r 2 and r 3 . We follow the procedure in Algorithm 1 by generating facts that trigger the rules, but we only take into consideration hypotheses that deal with rule conclusions. For example, consider the following facts: Generated Facts We can generate an example that triggers two rules: f 2 triggers r 2 , and f 3 triggers r 3 . Feeding the above facts and rules r 2 and r 3 to the reasoner, we obtain the following output (the numbers in parentheses indicate the likelihood of the triple): We produce hypotheses that trigger both rules together. For example, here we generate two hypotheses coming from o 4 and o 5 . The confidence (weight) of a hypothesis is given by the LP MLN reasoner. Taking o 5 as a hypothesis, we feed the following example to the model: Example #2 (Model Input): • Context : The parent of Eve is not Carl.
The child of Eve is David. If the child of the first person is the second person, then the first person is not the spouse of the second person. The relative of Eve is David. If the relative of the first person is the second person, then the first person is the spouse of the second person. The predecessor of Eve is David. • Hypothesis : The spouse of Eve is not David. • W eight : 0.55 We also generate an example, where three rules are triggered: In addition to r 2 and r 3 , r 5 is triggered by f 4 . We then repeat the same procedure to generate the following example: Example #3 (Model Input): • Context : The parent of Eve is not Carl.
The child of Eve is David. If the child of the first person is the second person, then the first person is not the spouse of the second person. The relative of Eve is David. If the relative of the first person is the second person, then the first person is the spouse of the second person. The predecessor of Eve is David. If the predecessor of the first person is the second person, then the first person is not the spouse of the second person. • Hypothesis : The spouse of Eve is not David. • W eight : 0.6