Empower Nested Boolean Logic via Self-Supervised Curriculum Learning

Beyond the great cognitive powers showcased by language models, it is crucial to scrutinize whether their reasoning capabilities stem from strong generalization or merely exposure to relevant data. As opposed to constructing increasingly complex logic, this paper probes into the boolean logic, the root capability of a logical reasoner. We find that any pre-trained language models even including large language models only behave like a random selector in the face of multi-nested boolean logic, a task that humans can handle with ease. To empower language models with this fundamental capability, this paper proposes a new self-supervised learning method \textit{Curriculum Logical Reasoning} (\textsc{Clr}), where we augment the training data with nested boolean logic chain step-by-step, and program the training from simpler logical patterns gradually to harder ones. This new training paradigm allows language models to effectively generalize to much harder and longer-hop logic, which can hardly be learned through naive training. Furthermore, we show that boolean logic is a great foundation for improving the subsequent general logical tasks.

Valjean allows Javert to arrest him, but quickly escapes.It is hinted that Valjean escaped with the help of a file hidden in a coin, an item he is later proved to possess.Valjean returns to his house to pack his clothes and hides behind the door when Javert comes looking for him.

Reading Comprehension Algebra
Why does megalodon go extinct?
Open-domain QA Use 4 numbers (4, 9, 10, 13) to obtain 24 with basic arithmetic operations (+-*/).arithmetical reasoning (Ling et al., 2017).While this is charming, these over-parameterized language models are shown to be good at exploiting superficial statistical cues to achieve decent scores on end tasks (Zhou et al., 2021;Sanyal et al., 2022a;Wu et al., 2023b).Early on BERT, it is found that simply by adding a "not" to the claims, BERT would be fooled into a random selector (Niven and Kao, 2019).It is time to go back and scrutinize whether the state-of-the-art PLMs master solid logical capability, as truly powerful logical reasoners.

Game of 24
Rather than creating even more complex logic, this paper concentrates on the root level of logical reasoning -boolean logic, as in Figure 1.Any logic can be reduced to a combination of multiple boolean operations, including negation ¬, intersection ∧, and union ∨.In this paper, we introduce a new probing method to quantify the boolean logical reasoning of a language model, fine-grained to different levels of logical difficulty.
However, our results show that none of PLMs  possess the necessary proficiency to tackle the multiple nesting of (multi-nested) simple boolean operations, even the state-of-the-art models like DeBERTa-V3 (He et al., 2021a) and ChatGPT (OpenAI, 2023).Faced with more than three nested boolean operations, they quickly degenerate into a random selector.even with the chain-of-thought prompt (Wei et al., 2022;Zhang et al., 2022b).Conversely, this task is very simple for humans, compared to other more general reasoning tasks.This raises a shadow over their generalizability acquired from large amount of training.
To empower the language models with such a fundamental capability in nested boolean logic, we propose a new self-supervised training paradigm, Curriculum Logical Reasoning (CLR), inspired by curriculum learning (Bengio et al., 2009).Concretely, we construct the nested boolean logic stepby-step from simple to hard on top of the original training samples in a self-supervised manner (Devlin et al., 2019).The model is encouraged to start with learning simple logical patterns and then move forward to hard ones gradually, rather than learning hard logic with a single leap.We find that recalling simpler logic while learning harder logic can result in a better outcome.Our experiments demonstrate that CLR significantly enhances the logical learning process.Excitingly, pre-learning boolean logic acts as a great foundation step to further enhance the subsequent logical end tasks, like ReClor and DREAM.Figure 2 illustrates CLR very lively.

Introducing Nested Boolean Logic
This section presents our method to introduce multinested boolean logic to existing data.
We first present the notations.Let x denote the input text, with its ground truth y, and p θ denote the classifier (e.g. a language model) with parameters θ.Given an arbitrary input sample x, suppose that the model accurately predicts p θ (x) = y.We now define an operation δ on x, which can be regarded as a transformation on the text, denoted as δ • x.

From Simple Boolean Logic to Nested Boolean Logic
We concentrate on the logical operation, which specifically manipulates the underlying logical chain by transformation on the text.We present a new form of logical operation that corresponds to only boolean operators, i.e. intersection ∧, union ∨, and negation ¬.We might concentrate on the simplest negation first.Suppose that the input statement x entails a fact f , which can be either a true fact or a false fact, represented by y 0 .The logical process can be formulated as x ⇒ y 0 , where ⇒ refers to "implies that" and y 0 ∈ {0, 1} (0 for True and 1 for False).We illustrate a toy example of our logical operation in Table 1.First, the model is required to discriminate whether the stated fact in x is true or false.It states a false fact "the earth is flat", so y 0 = 1 (False).Next, we transfer it to a contextquestion template and denote the context as S 0 .It is still a binary classification and the answer for it is limited in True or False.This template can be applied to arbitrary tasks.For instance, a sentiment analysis sentence "cold movie" can be rewritten to a statement like "cold movie expresses a positive movie watching".
Our idea is to craft a series of statements after S 0 .Each statement asserts the truth or falsity of the previous statement, which is uniformly chosen.We denote such a statement as boolean statement, and ask the model to discriminate the final statement.For instance, y 0 = 1 and S 1 asserts S 0 is false, so y 1 should be negated, y 1 = 0.After deduction, we can obtain y 3 = 1.
Logically, the assertion of "true" results in no change of the current logic and the assertion of "false" results in a negation.δ can be nested for k times without affecting the fact in x: where y i denotes each intermediate answer after i times of boolean statements and y k denotes the eventual answer.We denote Eq. 1 as multi-nested boolean logic.
Obtaining final y k is free of external annotation, as in self-supervised learning, by programming the following recursion: Such multi-nested boolean logic poses little challenge to humans.We hopefully assume that a strong language model can tackle that as well.
We generalize the negation operation to other boolean operations as in the bottom of Table 1.Concretely, we uniformly choose one statement from S 1 to S k and append it with either "and" or "or" chosen uniformly.

Quantify Boolean Logic
We probe the mastery in nested boolean logic of a language model by measuring its performance against our boolean statements.An ideal logical reasoner is supposed to make clear logical transitions between truth and falsity.We are particularly interested in this situation: the model accurately discriminates the original fact, while falters in delivering the correct answer subsequent to k boolean statements.This can be formulated as: where p θ satisfies: Deep neural models are good at exploiting superficial features rather than delving into the entire semantics (Wu et al., 2023a;Sanyal et al., 2022a).
The consequence is that they can get the final result without correctly classifying the original fact.Eq. 3 and 4 exclude this potential threat and focus entirely on the model's capability in handling nested boolean logic.In other words, if the model reasons from a misclassified fact, its final result can be noisy, misleading the analysis.Hence, we are interested in two metrics: • Clean accuracy (clean%): It refers to the general accuracy score.
• Boolean accuracy (boolean%): It refers to the accuracy only calculated on those samples where the model accurately discriminates the original fact, as represented in Eq. 3 and 4.This can only be calculated on augmented data.

Benchmark
To benchmark the multi-nested boolean logic, we construct a new dataset in this paper and following experiments are based on this.As apart from other datasets, it is composed of a series of subsets, representing different levels of logical complexity.We will release this benchmark for future research.

Data Collection
We collect the raw data from SciTail (Khot et al., 2018), a scientific text entailment dataset with a premise and a hypothesis for each sample, which is labeled as entail or not entail.We join the premise and hypothesis together to make them a "fact", with the entailed pair labeled as True and not entailed one labeled as False.Some samples are shown in Appendix A. Eventually, we get 6,000 raw samples and randomly sample 1,000 of them as the test set with the rest as the training set.
On top of the raw data, we convert it to the context-question format and then impose boolean statements to generate the adversarial set, which means that the resultant samples are likely to fool the model (Zellers et al., 2018(Zellers et al., , 2019)).Specifically, we uniformly choose a value k from some range and insert k boolean statements following the original sample.The range of k bounds the minimal and maximal nesting of boolean logic on each sample, and larger value of k suggests more nesting on the logic chain.For instance, the samples in Table 1 correspond to k = 0 and k = 3 (see Appendix A).
We denote this benchmark as BoolKill, in which each sample is a logic chain started with a potential fact and followed by a series of boolean statements.It is worth noting that BoolKill is a group of sets for different levels of logical difficulty, and each level has its own training and test set.We use the following notations to spot them: • raw: the raw data in which each sample is a statement of a fact; • u 0 : the clean set in which each raw sample is only transferred to a context-question template, with semantics unchanged; • u k : the adversarial set constructed on top of u 0 in which each sample is suffixed by k boolean statements; • u k 1 ∼k 2 : the adversarial set in which each sample is suffixed by k 1 ∼ k 2 boolean statements; • ũk /ũ k 1 ∼k 2 : u is negation-only, and we use ũ to distinguish the adversarial set additionally containing AND and OR.

Data Bias
The first thing to verify is whether u 0 is semantically equivalent to raw.From Table 2, we find that each model achieves very close performances on raw and u 0 , suggesting that the context-question template does not induce bias to the original data.
The average sentence length will vary due to the boolean statements on raw data, which grows linearly from 36 to 88, from u 1 to ũ8 .The overall statistics of BoolKill are in Appendix A.
To minimize the bias between subsets, we keep the ratio of positive and negative samples to 1:1 in all subsets.Additionally, BoolKill is a semiannotated dataset, comprising human-annotated facts and synthetic boolean statements.The latter introduces several high-frequency words like "true", "false", and "statement", which may induce large bias if these words do not occur in balance in data.For instance, the model may make the decision based on the relative number of "true" and "false" in the sentence.Hence, we also keep the occurrence of "true" and "false" the same for both positive and negative samples in all subsets.

Evaluation Results
We report the thorough results on each level of logical difficulty on BoolKill.We sequentially evaluate each model on u 0 , u 1 , u 2 , ..., and u 8 (ũ 8 ), indicating the number of nested boolean operations.
ChatGPT shows an impressive ability to follow human instructions and we directly evaluate it on the test sets2 .For DeBERTa, we first fine-tune it on the u k training set and evaluate it on the u k test.
NOT: We curve the results in Figure 3.We find that each model exhibits a high performance on u 1 , suggesting their proficiency in tackling single boolean logic.DeBERTa performs better than ChatGPT, probably due to task-specific fine-tuning.However, as the nesting increases, each model suffers from a notable decline regardless of size.For instance in (a), starting from u 2 , in which the samples are suffixed by only two boolean statements, DeBERTa-base falls to 53.8% while DeBERTalarge falls to 65.4%.From u 3 , strong as DeBERTalarge, it leans to a random selector, whose accuracy gets close to 50%.Similar situations can be seen on ChatGPT, while its degradation is more gentle.It suggests that even state-of-the-art models possess a critical limitation in the basic nested boolean logic, only able to handle up to three nested operations.This is far below humans' level.
AND & OR: From (b), it is counter-intuitive that DeBERTa performs better on sets additionally including AND and OR.We conjecture that the model utilizes the inherent bias that AND ⇒ False and OR ⇒ True in majority of cases.Such a shortcut is particularly useful when k is small.Interestingly from (d), well-trained ChatGPT appears not to use this, and its performance drops even faster on ũ.Therefore, we focus on ũ and ũ with large k in the following experiments.
Chain-of-Thought (CoT) (Wei et al., 2022;Zhang et al., 2022b;Yao et al., 2023) is proven to be an effective prompt method to amplify the reasoning ability of LLMs, with asking them to offer the procedure while performing the reasoning.From Figure 3 (c) and (d), we find that ChatGPT performs better with the assistance of CoT.However, we raise a criticism in the paper: does CoT promote logical reasoning?Indeed, our study show that CoT may bring new logical concern.We will further discuss it in Sec.6.1.

Empower Nested Boolean Logic
We present a new self-supervised learning manner.

Self-Supervised Learning
The straightforward method is to fine-tune the model on BoolKill.The concept behind is to sequentially introduce boolean statements on top of some corpus and let the model learn to tackle multinested boolean logic self-supervisedly.
However, we find language models struggle to fit the samples in BoolKill when the potential logic within the data is too hard, and still be a random selector.It indicates that naive training is not the best therapy to learn complex logical patterns.

Curriculum Logical Reasoning
Inspired by Curriculum Learning (Bengio et al., 2009), where the machine learning model is encouraged to learn the task starting with easier samples and ending with harder ones, we propose Curriculum Logical Reasoning (CLR) to enhance the process of learning logical reasoning.
There is a natural match between curriculum learning and logical philosophy, because the logic chain is a step-by-step progression from single to complex.CLR means that, rather than learning hard logic from scratch, the model starts with learning simpler logic, e.g.single boolean logic, and then moves forward to harder logic gradually, e.g.multinested boolean logic.
We show a concrete instance.We start to train the model on u 0∼1 , which solely includes single boolean operations.Next, we train such a model on u 0∼2 , which further includes two-nested boolean operations.This gradual progression continues until the model is trained on u 0∼4 .The above procedure can be denoted as u 0∼1 → u 0∼2 → u 0∼3 → u 0∼4 .We find that reusing the easier samples in the new turn of training benefits the eventual performance, which potentially reminds the model of what it learns previously.Our ultimate goal is that the model can gradually learn to tackle more complex logic that it has not seen before.

Empirical Results
As opposed to the prior section, where we evaluate the model on each level of logical difficulty, in this section, we evaluate each model on BoolKill u 1∼4 , u 5∼8 , and ũ5∼8 as an alternative.These sets cover the range from k = 1 to k = 8. u 1∼4 is a simpler one and u 5∼8 and ũ5∼8 appear to be highly challenging, since we previously show that state-ofthe-art PLMs are almost powerless for the nested boolean logic beyond u 3 .
We experiment on DeBERTa-V3-base and DeBERTa-V3-large.Each model is trained for 3,000 steps with a batch size of 16 and learning rate of 2e-5 / 1e-5 for the base / large one.
To verify CLR, we report two experiments.In the first experiment, we compare different training settings and evaluate the models on BoolKill.In the second, we leverage the boolean logic in BoolKill to benefit other general logical tasks.

Nested Boolean Logic
The results across various BoolKill sets are summarized in Table 3.We find that naively training the model on u 5∼8 only produces random accuracy scores on all three test sets, even on two simpler ones u 0 and u 1∼4 .While on u 0∼4 and ũ0∼4 , we find that DeBERTa-V3-large can achieve better outcomes on simpler u 1∼4 compared to DeBERTa-V3base.It suggests that a larger model possibly has a greater learning ability to handle more nested boolean operations, but it is still very hard even for strong DeBERTa-V3-large, to learn very difficult logical patterns in u 5∼8 within a single leap.
However, CLR brings significant performance Additionally, all models consistently maintain a strong accuracy on u 0 throughout the process of CLR, suggesting that they learn to discriminate the original facts and tackle boolean logic simultaneously.As a contrast, naive self-supervised training leads to inferior u 0 results.Moreover, we find that each level of curriculum brings a considerable improvement to the model.For instance, the performance of DeBERTa-V3base has outperformed all naive baselines when it just completes the second level of training on u 0∼2 .

Boolean Benefits Complex Logic
Boolean logic acts as the atomic component of logic.Our intuition is that it can solidify more general end tasks that require complex logical reasoning.We conduct validation on two machine reading comprehension (MRC) datasets:  (Yu et al., 2020), a reasoning-required MRC collected from graduate admission exams; • DREAM (Sun et al., 2019), a dialogue-based MRC.Concretely, we first train DeBERTa-V3 on BoolKill as an initialization and then fine-tune it on the taskspecific data of ReClor and DREAM.
The results are shown in Table 4.We find that learning boolean logic acts as a nice initialization for the subsequent reasoning tasks on both ReClor and DREAM.For instance, initializing with u 0∼1 improves DeBERTa-V3-base by 3.4% compared to naive fine-tuning on ReClor, and u 0∼1 → u 0∼2 further improves by 4.4%.It is worth noting that u 0 alone does not provide any useful signals (59.0% on ReClor and 80.2% on DREAM), suggesting that it is the boolean logic that we add into the data that enhances the eventual logical performance.
As a contrast, we first train the model on taskspecific data and then fine-tune it on boolean logic.We find that more complex logic in ReClor or DREAM does not enable the model to perform any better on u 0∼1 or even harms it, confirming our initial idea, that the model may ignore the basic logic during training, even if it appears to handle more complex problems sometimes.
It is the generic form of CLR to pre-learn boolean logic and then learn complex logic.

Ablation Study
The ablation study is made under negation-only sets.We first discuss the composition of levels to make up the curriculum to perform CLR.We remove some levels from the full curriculum setting training sets together, e.g.u 0∼1 , u 0∼2 , u 0∼3 , u 0∼4 and performing naive training.The difference is that CLR strategically samples the training data from easy ones to hard ones rather than uniformly.
The results are summarized in Table 5.We find that any leap from the full curriculum can result in a notable performance drop, highlighting the importance of a complete and gradual progression of logical learning.Interestingly, we also find that learning from simpler u 0∼1 → u 0∼3 achieves a better outcome compared to harder u 0∼2 → u 0∼4 .Next, we discuss the composition of samples for each level.We remove the simpler samples that belong to the prior level (u 0∼1 → u 2 → u 3 → u 4 ) and see whether the model would forget what it has learned before as a result.From Table 5, we find that the removal process gives comparable results on u 0 and u 1∼4 , However, when it comes to harder u 5∼8 , it leads to a performance drop of 6%.These findings underscore the importance of reusing simpler samples when stepping forward to the new level, especially when evaluating on harder or even unseen data like u 5∼8 .

Fine-tuning Large Language Models
We also evaluate our method on LLMs.However, fine-tuning LLMs requires a huge amount of resources.As a compromise, recent studies propose several efficient fine-tuning methods that only update a small ratio of parameters within LLMs.We experiment on three models, GPT2-1.5b(Brown et al., 2020), OPT-7b (Zhang et al., 2022a), and LLaMA2-7b (Touvron et al., 2023).They both belong to the decoder-only architecture as ChatGPT.We fine-tune GPT2-1.5b with full parameters and fine-tune the 7b models with the low rank adaption method (LoRA) (Hu et al., 2022).
From Table 6, we find that CLR works very well on GPT2-1.5b,achieving a boolean accuracy of 79.4% on u 5∼8 , outperforming naive training by a notable margin of 13.8%.However, larger-scaled OPT-7b does not yield better results as expected.Specifically, it achieves comparable results on simpler u 1∼4 , while greatly lags behind much smaller GPT2-1.5b on harder u 5∼8 .We conjecture that parameter efficient fine-tuning might compromise the acquisition of complex reasoning capability, e.g.multi-nested boolean logic, leading to a nonnegligible performance drop.
6 Further Discussion

Chain-of-Thought
We discuss CoT in more detail, and this part is particularly geared to the current LLMs.It has been shown that when being asked to give the procedure, the model can perform the reasoning more precisely.In the prior section, we show that CoT can assist ChatGPT in achieving better performance on BoolKill.We notice that the intermediate thinking procedure exposed by CoT is equally important.A ideal reasoner can not only make the final answer but also reasonable intermediate results.
However, we find that ChatGPT leans to fall into inconsistent deduction when giving the intermediate results, as we illustrate in Table 7.
For [a], we can first obtain that S 3 is false from the previous statements since S 0 is true.Hence, the deduction of the model up to this step is correct.In the next step, however, the model draws an incorrect conclusion based on the fact that S 3 is false, that is S 4 is true.Indeed, S 4 should be false since it doesn't match S 3 , incurring a wrong final answer.Similar cases can be found in [b].These cases indicate that when making longer reasoning, ChatGPT can fall into mistake in some logical step, even though each step is very easy when cutting individually.

True or False
We take a further look at true-or-false questions, a specific and common question type in MRC and logical end tasks.Specifically, we filter out the samples with questions that contain keywords "true" or "false".In ReClor, there are 173 such samples out of the 500 in its development set.The evaluation results on true-or-false questions are shown in Table 8.We find that both DeBERTa models struggle with seemingly simple true-or-false questions, showing lower accuracy compared to the overall performance.However, the models pre-learned with nested boolean logic showcase a significant improvement, achieving 6.4% and 6.3% points of gain respectively.

Related Work
The study of boolean operations is the fundamental requirement for a series of challenging tasks, e.g.arithmetical reasoning (Ling et al., 2017), commonsense reasoning (Zellers et al., 2019), reading comprehension (Yang et al., 2018), dialogue comprehension (Sun et al., 2019).We concentrate on the multi-nested boolean logic by augmenting the text with boolean statements.Previous studies analyze more general logical reasoning, e.g.RICA (Zhou et al., 2021), RobustLR (Sanyal et al., 2022a), FaiRR (Sanyal et al., 2022b), by logical paraphrase or contrast sets.Self-supervised learning methods typically generate learnable inputs on top of unlabeled corpora, e.g. by masking (Devlin et al., 2019), insertion (Wu et al., 2022), sentence reordering (Lan et al., 2020), contrastive learning (Gao et al., 2021), while our method is by imposing a series of sentences to the suffix, actually generating learnable logic.We introduce curriculum learning (Bengio et al., 2009), which allows the model to learn step by step to further facilitate self-supervised learning.Curriculum learning is under-discussed in context of language processing (Xu et al., 2020;Lee et al., 2022).
While deep neural networks are capable of handling very complex tasks, in reality they lean to exploit spurious cues (Goodfellow et al., 2015;Madry et al., 2018;Wu et al., 2023a), and can be powerless to very simple perturbations as a consequence.Our work discloses that language models are poorly skilled at basic boolean logic.In parallel, studies show that language models can be easily fooled by some naive patterns within the text, e.g.lexical overlap (McCoy et al., 2019;Wu et al., 2023c), entity boundary (Yang et al., 2023), word order (Zhang et al., 2019).
We also release a challenging benchmark to evaluate boolean logical reasoning.There are a series of work focusing on constructing challenging logic, e.g.ReClor (Yu et al., 2020), HotpotQA (Yang et al., 2018), ANLI (Nie et al., 2020).

Conclusion
This paper provides a quantified analysis on the multi-nested boolean logic.We flag the deficiency in the state-of-the-art language models in terms of such basic capability, which will inevitably cause pitfalls in dealing with more complex reasoning tasks.For this, we propose Curriculum Logical Reasoning, a new self-supervised learning method to empower language models with foundational logical capability.We also show that our idea can act as a cornerstone learning method for general logical reasoning.

Limitations
We cannot exhaust all the arrangements of curriculum to perform CLR, which could potentially achieve even better performances.We have discussed the potential risk of chain-of-though as secondary contribution of our work, which will be interesting to study in the future.Our method to introduce nested boolean logic is general, while our experiments are based on one source.Another option is to collect data from more general corpus or specific domains of interest, which is promising.Eventually, we do not have enough resources to run large language models above 7b.

Figure 1 :
Figure1: While language models are capable of handling a range of complex logical tasks, they do not perform well on more basic nested boolean logic.

Figure 3 :
Figure 3: Boolean accuracy of different models with increasing numbers of nested boolean operations (u k /ũ k ).

Table 1 :
Method to augment arbitrary samples with nested boolean logic.

Table 2 :
Performances on raw data and its templated u 0 .

Table 3 :
Results on BoolKill, comparing CLR with naive training.We use "→" to denote the curriculum setting we perform, where the model inherits the trained weights from the last level.We highlight the step-bystep performance gains CLR brings with "↑".

Table 5 :
Additionally, we include another strong baseline by merging all the Ablation study on DeBERTa-V3-base.We omit the notations of u 0∼2 and u 0∼3 in "• • •".

Table 7 :
ChatGPT case study.S 4 in [a] should be false.

Table 8 :
Results on true-or-false questions in ReClor.