IDOL: Indicator-oriented Logic Pre-training for Logical Reasoning

In the field of machine reading comprehension (MRC), existing systems have surpassed the average performance of human beings in many tasks like SQuAD. However, there is still a long way to go when it comes to logical reasoning. Although some methods for it have been put forward, they either are designed in a quite complicated way or rely too much on external structures. In this paper, we proposed IDOL (InDicator-Oriented Logic Pre-training), an easy-to-understand but highly effective further pre-training task which logically strengthens the pre-trained models with the help of 6 types of logical indicators and a logically rich dataset LGP (LoGic Pre-training). IDOL achieves state-of-the-art performance on ReClor and LogiQA, the two most representative benchmarks in logical reasoning MRC, and is proven to be capable of generalizing to different pre-trained models and other types of MRC benchmarks like RACE and SQuAD 2.0 while keeping competitive general language understanding ability through testing on tasks in GLUE. Besides, at the beginning of the era of large language models, we take several of them like ChatGPT into comparison and find that IDOL still shows its advantage.


Introduction
With the development of pre-trained language models, a large number of tasks in the field of natural language understanding have been dealt with quite well.However, those tasks emphasize more on assessing basic abilities like word-pattern recognition of the models while caring less about advanced abilities like reasoning over texts (Helwe et al., 2021).
In recent years, an increasing number of challenging tasks have been brought forward gradually.At sentence-level reasoning, there is a great variety of benchmarks for natural language inference like QNLI (Demszky et al., 2018) and MNLI (Williams et al., 2018).Although the construction processes are different, nearly all these datasets evaluate models with binary or three-way classification tasks which need reasoning based on two sentences.At passage-level reasoning, the most difficult benchmarks are generally recognized as the ones related to logical reasoning MRC which requires questionanswering systems to fully understand the whole passage, extract information related to the question and reason among different text spans to generate new conclusions in the logical aspect.In this area, the most representative benchmarks are some machine reading comprehension datasets like ReClor (Yu et al., 2020) and LogiQA (Liu et al., 2020).
Considering that there are quite few optimization strategies for the pre-training stage and that it is difficult for other researchers to follow and extend the existing methods which are designed in rather complex ways, we propose an easy-to-understand but highly effective pre-training task named IDOL which helps to strengthen the pre-trained models in terms of logical reasoning.We apply it with our customized dataset LGP which is full of logical information.Moreover, we experimented with various pre-trained models and plenty of different downstream tasks and proved that IDOL is competitive while keeping models and tasks agnostic.
Recently, ChatGPT attracts a lot of attention all over the world due to its amazing performance in question answering.Thus, we also arranged an experiment to let IDOL compete with a series of LLMs (large language models) including it.
The contributions of this paper are summarized as follows: • Put forward the definitions of 5 different types of logical indicators.Based on these we construct the dataset LGP for logical pre-training and we probe the impact of different types of logical indicators through a series of ablation experiments.• Design an indicator-oriented further pretraining method named IDOL, which aims to enhance the logical reasoning ability of pretrained models.It achieves state-of-the-art performance in logical reasoning MRC and shows progress in general MRC and general understanding ability evaluation.
• The first to provide a pilot test about the comparison between fine-tuning traditional pretrained models and prompting LLMs in the field of logical reasoning MRC.
2 Related Work

Logical Reasoning
In order to help reasoning systems perform better on reading comprehension tasks focusing on logical reasoning, there have been a great many methods put forward by research institutions from all over the world.Unsurprisingly, the majority of the optimization approaches put forward revolve around the fine-tuning phase while there are far fewer methods designed for further pre-training.
In the aspect of pre-training, to the best of our knowledge, there are only two approaches presented in published papers called MERIt and Logi-GAN.MERIt team generated a dataset from the one provided by Qin et al. (2021) which contains passages from Wikipedia with annotations about entities and relations.And then optimize the model on that with the help of contrastive learning (Jiao et al., 2022).The researchers behind LogiGAN use a task about statement recovery to enhance the logic understanding ability of generative pretrained language models like T5 (Pi et al., 2022).
For optimizing models at the fine-tuning phase, there are dozens of methods proposed as far as we know.For example, LReasoner put forward a context extension framework with the help of logical equivalence laws including contraposition and transitive laws (Wang et al., 2022a).Another example is Logiformer which introduced a twostream architecture containing a syntax branch and a logical branch to better model the relationships among distant logical units (Xu et al., 2022).
In the United States, each bushel of corn produced might result in the loss of as much as two bushels of topsoil.Moreover, in the last 100 years, the topsoil in many states, which once was about fourteen inches thick, has been eroded to only six or eight inches.

Pre-training Tasks
As NLP enters the era of pre-training, more and more researchers are diving into the design of pre-training tasks, especially about different masking strategies.For instance, in Cui et al. (2020), the authors apply Whole Word Masking (WWM) on Chinese BERT and achieved great progress.WWM changes the masking strategy in the original masked language modeling (MLM) into masking all the tokens which constitute a word with complete meaning instead of just one single token.In addition, Lample and Conneau (2019) extends MLM to parallel data as Translation Language Modeling (TLM) which randomly masks tokens in both source and target sentences in different languages simultaneously.The results show that TLM is beneficial to improve the alignment among different languages.

Text Logical Unit
It is admitted that a single word is the most basic unit of a piece of text but its meaning varies with different contexts.In Xu et al. (2022), the au-thors refer logical units to the split sentence spans that contain independent and complete semantics.
In this paper, since much more abundant logical indicators with different types that link not only clauses but also more fine-grained text spans are introduced, we extend this definition to those shorter text pieces like entities.

Logical Indicators
By analyzing the passages in logical reasoning MRC and reasoning-related materials like debate scripts, we found that the relations between logic units (like entities or events) can be summarized into 5 main categories as follows and all these relations are usually expressed via a series of logical indicators.After consulting some previous work like Pi et al. (2022) and Penn Discourse TreeBank 2.0 (PDTB 2.0) (Prasad et al., 2008), we managed to construct an indicator library for each category.
As for the examples of indicators we used in detail, please refer to Table 1.

• Premise/Conclusion Indicator (PMI/CLI)
The first two types of logical indicators pertain to premises and conclusions.These indicators signal the logical relationship between statements.For instance, premise expressions such as "due to" indicate that the logic unit following the keyword serves as the reason or explanation for the unit preceding it.Conversely, conclusion phrases like "result in" suggest an inverse relationship, implying that the logic unit after the keyword is a consequence or outcome of the preceding unit.
• Negative Indicator (NTI) Negative indicators, such as "no longer", play a crucial role in text logic by negating affirmative logic units.
They have the power to significantly alter the meaning of a statement.For example, consider the sentences "Tom likes hamburgers." and "Tom no longer likes hamburgers."These two sentences have nearly opposite meanings, solely due to the presence of the indicator "no longer".
• Adversative Indicator (ATI) Certain expressions, such as "however", are commonly employed between sentences to signify a shift or change in the narrative.They serve as valuable tools for indicating the alteration or consequence of a preceding event, which helps to cover this frequent kind of relation among logic units.
• Coordinating Indicator (CNI) The coordinating relation is undoubtedly the most prevalent type of relationship between any two logic units.Coordinating indicators are used to convey that the units surrounding them possess the same logical status or hold equal importance.These indicators effectively demonstrate the coordination or parallelism between the connected logic units.

LGP Dataset Construction
For the sake of further pre-training models with IDOL, we constructed the dataset LGP (LoGic Pretraining) based on the most popular unannotated corpus English Wikipedia. 2We first split the articles into paragraphs and abandoned those whose lengths (after tokenization) were no longer than 5.
In order to provide as much logical information as possible, we used the logical indicators listed in Table 1 to filter the Wiki paragraphs.During this procedure, we temporarily removed those indicators with extremely high frequency like "and", otherwise, there would be too many paragraphs whose logical density was unacceptably low.Then, we iterated every logical keyword and replaced it with our customized special token [LGMASK] under the probability of 70%.
For the purpose of modeling the ability to distinguish whether a certain masked place is logicrelated or not, we introduced the sixth logical indicator type -Logic Unrelated Indicator (LUI).Based on this, we then randomly replaced 0.6% tokens other than logical indicators with [LGMASK].Afterward, the labels for the logical category prediction (LCP) task were generated based on the corresponding logic types of all the [LGMASK]s.In the end, take RoBERTa (Liu et al., 2019) for example, our logic dataset LGP contains over 6.1 million samples and as for the quantities of logical indicators in each type please refer to Figure 2.

Logical Category Prediction
As introduced in section 3.2 and section 4.1, we defined a logic-related special mask token [LGMASK] and it will take the place of 6 types of logical indicators -PMI, CLI, NTI, ATI, CNI, and LUI.During the forward process of fine-tuning the pre-trained models, the corresponding logical categories need to be predicted by them like what will be done in the token classification task of the standard Masked Language Modeling (MLM) (Devlin et al., 2019).
When the models are trying to predict the correct logical type of a certain [LGMASK], they will learn to analyze the relationship among the logical units around the current special token and whether there is some kind of logical relations with the help of the whole context.Therefore, the pre-trained models will be equipped with a stronger ability of reasoning over texts gradually.
Moreover, we use Cross-Entropy Loss (CELoss) to evaluate the performance of predicting the logical categories.The loss function for LCP is as described in Equation ( 1) where n is the number of samples, m is the number of [LGMASK] in the i th sample, y i,j indicates the model prediction result for the j th [LGMASK] in the i th sample and ŷi,j denote the corresponding ground truth value.

IDOL
To avoid catastrophic forgetting, we combine the classic MLM task with the LCP introduced above to become IDOL, a multi-task learning pre-training method for enhancing the logical reasoning ability of pre-trained models.For the purpose of balancing the effects of the two pre-training tasks, we introduced a hyper-parameter λ as the weight of the loss of LCP (the proper λ depends on the pre-trained language model used and the empirical range is between 0.7 and 0.9).Thus, for the IDOL pre-training loss function, please refer to Equation (2). Figure 3 presented an example of IDOL pre-training where predicting tokens and the classes of logical indicators simultaneously.

Baselines
With the rapid development of pre-training technology these years, we have various choices for backbone models.In this paper, we decide to apply IDOL on BERT-large (Devlin et al., 2019), RoBERTa-large (Liu et al., 2019), ALBERTxxlarge (Lan et al., 2020) and DeBERTa-v2-xxlarge (He et al., 2021) and will evaluate the models in the following three different aspects in section 5.4 to better verify the performance of IDOL. 3n terms of logical reasoning MRC, we will compare IDOL with several previous but still competitive methods for logical reasoning MRC including DAGN (Huang et al., 2021), AdaLoGN (Li et al., 2022), LReasoner (Wang et al., 2022b), Logiformer (Xu et al., 2022) and MERIt (Jiao et al., 2022).Much more interesting, we let IDOL compete with ChatGPT in a small setting.

Datasets
First and foremost, the aim of IDOL is to improve the logical reasoning ability of pre-trained models, thus, the two most representative benchmarks -ReClor and LogiQA will act as the primary examiners.
Following this, RACE (Lai et al., 2017) and SQuAD 2.0 (Rajpurkar et al., 2018), two classic machine reading comprehension datasets that are not targeted at assessing reasoning ability, will come on stage, which will be beneficial to conclude whether IDOL helps with other types of reading comprehension abilities.
Last but not least, we also tested the models pre-trained with IDOL on MNLI (Williams et al., 2018) and STS-B (Cer et al., 2017), two tasks of GLUE (Wang et al., 2018), to make sure that the general language understanding abilities are retained to a great extent during the process of logical enhancement.The evaluation metrics on STS-B are the Pearson correlation coefficient (Pear.)and Spearman's rank correlation coefficient (Spear.) on the development set.And we use the accuracy of MNLI-m and MNLI-mm development sets for evaluation on MNLI.
ReClor The problems in this dataset are collected from two American standardized tests -LSAT and GMAT, which guarantee the difficulty of answering the questions.Moreover, ReClor covers 17 classes of logical reasoning including main idea inference, reasoning flaws detection, sufficient but unnecessary conditions, and so forth.Each problem consists of a passage, a question, and four answer candidates, like the one shown in the green section of Figure 1.There are 4638, 500, and 1000 data points in the training set, development set, and test set respectively.The accuracy is used to evaluate the system's performance.
LogiQA The main difference compared with Re-Clor is that the problems in LogiQA are generated based on the National Civil Servants Examination of China.Besides, it incorporates 5 main reasoning types such as categorical reasoning and disjunctive reasoning.And 7376, 651, and 651 samples are gathered for the training set, development set, and test set individually.

IDOL
During the process of pre-training with IDOL, we implemented the experiments on 8 Nvidia A100 GPUs.Since IDOL was applied on multiple different pre-trained models, we provide a range for some main hyperparameters.The whole training process consists of 10k~20k steps while the warmup rate keeps 0.1.The learning rate is warmed up to a peak value between 5e-6~3e-5 for different models, and then linearly decayed.As for batch size, we found that 1024 or 2048 is more appropriate for most models.Additionally, we use AdamW (Loshchilov and Hutter, 2017) as our optimizer with a weight decay of around 1e-3.For the software packages we used in detail, please see Appendix.
With respect to the hyperparameters for finetuning models on downstream tasks, we follow the configurations provided in the original paper of either the corresponding model or the dataset.

LLM
For the purpose of comparing IDOL with LLMs, we randomly sampled 30 pieces of data in the development sets of ReClor and LogiQA separately (named Dev-30).As for models, we choose GPT-3.54 , ChatGPT5 and GLM-130B (Zeng et al., 2022) for this pilot test.
To better evaluate the performance of LLMs, we tested them in the following three settings: zeroshot prompting, few-shot prompting, and chain-ofthought prompting.For zero-shot prompting, we designed the following template to wrap up the MRC problem.As for few-shot prompting, we insert 3 examples in the same template but with correct answers ahead of the target question.When testing with chain-of-thought prompting, the template is similar to the one presented above.But there is only one example ahead and sentences describing the process of the way how humans reason to solve the problem are provided before giving the right answer to the example.For more details about the templates and the test example, please refer to Table 6 and Figure 4.

Logical Reasoning MRC
Fine-tuning To evaluate the model performance on logical reasoning MRC, we experimented with the baseline models mentioned above on ReClor and LogiQA, the two most representative benchmarks in this field.The majority of previous researchers focus on applying their method to RoBERTa, IDOL meets the most competitors in this setting as shown in Table 2.In spite of this, IDOL surpassed all the existing strong systems by an obvious margin in nearly every evaluation metric except the accuracy on the LogiQA test set.Apparently, from the results on BERT and ALBERT in Table 2 and results on DeBERTa in Table 3, we can see that IDOL has significant advantages over other opponents as well.
In summary, IDOL is highly effective in logical reasoning MRC with state-of-the-art performance and this benefit can be generalized to different pretrained models even to the recent large-scale and strong ones.
Prompting Although the scale of Dev-30 for the pilot test on LLM is small, the results displayed in Table 5 inspired us to some extent.Generally, IDOL is still competitive in the era of LLM.On Re-Clor, it achieved an accuracy of 80% while the best result from LLMs is 70% (ChatGPT with Chain-of-Thought prompting).Even though GLM-130B re- alizes an accuracy of 50% on LogiQA in Zero-Shot setting surprisingly (slightly higher than 43.3% by IDOL), IDOL has an obvious advantage compared with other settings and other LLMs.Additionally, there is an interesting phenomenon that chain-ofthought prompting brings negative effects on LLMs except for ChatGPT on ReClor, which is not consistent with the findings in Wei et al. (2022).

Other MRC Datasets
For testing whether IDOL could also benefit on types of MRC tasks or maintain the original abilities, we conducted a series of experiments based on RoBERTa as the backbone model.The results are displayed in the middle part of Table 4 where we compare the original model, the model further pre-trained with only MLM on LGP and the model further pre-trained with IDOL.We evaluate the models in each setting with 4 different seeds and report the average value.It is apparent that IDOL performs better on both RACE and SQuAD 2.0 in each evaluation metric (although the effects are not as big as those on ReClor or LogiQA), which implies that IDOL indeed helps on general MRC tasks while achieving significant improvement in logical reasoning ability.

General Understanding Ability
Following the experiment configuration in section 5.4.guage understanding tasks which help to reflect the general understanding ability of pre-trained language models.We evaluate the models in each setting with 4 different seeds and report the average value.From the results presented in the right part of Table 4, we can easily find that although IDOL falls behind on MNLI and exceeds the other two competitors on STS-B, the differences in all the evaluation metrics are quite small.Therefore, we could conclude that IDOL retains the general language understanding ability from the original pre-trained model successfully during the process of becoming stronger in logical reasoning.

Ablation Study
In this section, we conducted a series of ablation experiments about the multiple logical indicators we used in both fine-tuning and pre-training phases.
We evaluate the models based on RoBERTa with 4 different seeds and report the average value.

Indicators in Fine-tuning
As introduced in section 3.2, we defined 5 classes of logical indicators that reflect various logical relations among text logical units and we make use of all of them in IDOL.To figure out whether the 5 types are of equal importance in logical reasoning MRC, we conducted a set of controlled experiments where certain types of indicators are removed from the ReClor train set as the fine-tuning train dataset in each setting.
From the results displayed in Table 7, it is obvious from the last column that logical indicators indeed play an important role in logical reasoningrelated text understanding since the loss of all indicators decreases accuracy by 4 to 7 points.In detail, we can conclude that the negative and adversative indicators influence the most by comparing the gaps between pre-training on the original LGP and the dataset without individual types of indicators.

Indicators in Pre-training
Now that logical indicators have been proven to be effective in fine-tuning stage, we believe they also help with the pre-training stage.Therefore, we arranged a series of experiments on gradually incorporating more logical indicators from not leveraging any indicators (MLM), only making use of PMI and CLI (LCP-2), adding LUI to LCP-2 (LCP-3), to taking advantage of all 6 types of logical indicators (LCP).
From the lines displayed in Figure 5, it is clear that models perform better while leveraging a greater variety of logical indicators since the red line (IDOL) is positioned significantly higher than green and yellow lines representing pre-training tasks that utilize fewer types of logical indicators.According to the results in   token [LGMASK] inevitably brings noise during model training and further widens the gap between pre-training and down-stream tasks, so that they perform even not better than the original MLM.Additionally, in the aspect of overall trends, the model pre-trained with IDOL is becoming stronger gradually during the process of pre-training, which certifies the effectiveness of our designed task targeted at logical indicators.

Conclusion and Future Work
In this paper, we proposed an easy-to-understand further pre-training method IDOL which fully exploits the logical information provided by 6 types of logical indicators and is proven effective on different pre-trained language models while keeping them competitive on many other kinds of downstream tasks.Particularly, IDOL achieves state-ofthe-art performance on logical reasoning machine reading comprehension tasks.With respect to future work, we plan to leverage the sentence-level or passage-level logical features in the meantime and integrate it with IDOL to generate a stronger multi-task further pre-training method for improving the logical reasoning ability of pre-trained language models.Moreover, we de- cide to redesign the IDOL task and find out whether logical indicators also play an important role in those generative pre-trained models as well.Furthermore, we will explore the way of combining IDOL with prompting to find a better method to elicit the reasoning abilities of LLMs.
(2022), the authors presented a pre-training task named PERT which requires the models to recover the original token sequences under the background of that different token permutation within a certain range would not affect Chinese text understanding.This method only depends on the original texts, but IDOL introduces one more special token, which widens the gap between pre-training and fine-tuning to some extent.B2.Did you discuss the license or terms for use and / or distribution of any artifacts?
We will put forward the terms for using the artifact we created when we publish it, only for research purposes.
B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified?For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?We will put forward the terms for using the artifact we created when we publish it, only for research purposes.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?As far as we know, something like names is safe in the Wikipedia corpus and there is nearly no offensive content in it, so we didn't plan to filter out those texts like names or offensive content.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.? 4.1 B6.Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created?Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.4.1

Figure 1 :
Figure 1: A diagram illustrating the three steps of our method: (a) construct the logically rich dataset LGP from Wikipedia, (b) further pre-train models to improve logical reasoning ability, and (c) answer logical reasoning MRC questions with the help of logical indicators appeared both in context and choices.See Section 4 for more details on our method.

Figure 2 :
Figure 2: The numbers of 6 types of logical indicators in LGP for RoBERTa.

Figure 3 :
Figure 3: An example of pre-training with IDOL.The model needs to recover the tokens replaced by [MASK] (MLM) and predict the category of each logical indicator masked by [LGMASK] (LCP) in the meantime.
The passage is [PASSAGE].The question is [QUESTION].Here are 4 choices for it and they are [CHOICES].Which one should I choose?Thanks.

Figure 4 :
Figure 4: An example of ChatGPT answering an MRC question.
] [Example B] [Example C] The passage is [PASSAGE].The question is [QUESTION].Here are 4 choices for it and they are [CHOICES].Which one should I choose?Thanks.Chain-of-Thought The passage is [PASSAGE].The question is [QUESTION].Here are 4 choices for it and they are [CHOICES].You can analyze like this, [Thought Process].So the answer is [Answer].The passage is [PASSAGE].The question is [QUESTION].Here are 4 choices for it and they are [CHOICES].Which one should I choose?Thanks.

Figure 5 :
Figure 5: The results on ReClor development set of models with different tasks on RoBERTa during the pre-training.LCP-2: LCP only with PMI and CLI.LCP-3: LCP only with PMI, CLI, and LUI.Baseline: fine-tuning with the original RoBERTa.
you describe the limitations of your work? 8 A2.Did you discuss any potential risks of your work? 8 A3.Do the abstract and introduction summarize the paper's main claims? 1 A4.Have you used AI writing assistants when working on this paper?We just use Grammarly to do spell checks.B Did you use or create scientific artifacts?4.1 B1.Did you cite the creators of artifacts you used?Not applicable.Left blank.

Table 1 :
Libraries and examples of all types of logical indicators.

Table 2 :
Results on logical reasoning MRC benchmarks -ReClor and LogiQA.In each block, the previous methods listed for comparison and IDOL take the pre-trained model in the first line as their backbone model.

Table 3 :
Results of IDOL with DeBERTa and other publicly available data.♣: top results from the official leaderboard of ReClor (as of January 19, 2023).♡: the performance of the original DeBERTa from Jiao et al. (2022) for reference (the majority of the top submissions and IDOLs in this table take DeBERTa as the backbone model).

Table 4 :
2, we planned to find out what kind of effect would IDOL have on other types of natural lan-Results of RoBERTa with different pre-training tasks on logical reasoning MRC, other types of MRC and other types of NLU tasks.

Table 7 ,
PMI and CLI brought the least difference in the model performance on ReClor.The LCP-2 and LCP-3 mainly rely on the two types, and introducing a new special The passage is [PASSAGE].The question is [QUESTION].Here are 4 choices for it and they are [CHOICES].Which one should I choose?Thanks.

Table 6 :
Templates and examples for LLM prompting in different settings.

Table 7 :
Results of fine-tuning with datasets obtained by removing certain types of logical indicators in the original ReClor train set and testing on the development set.The first row under "ReClor Train set" in each column indicates what indicators are removed from LGP. "-": the original LGP."PMI&CLI": both premise and conclusion indicators are removed."ALL": no logical indicators left.