APOLLO: A Simple Approach for Adaptive Pretraining of Language Models for Logical Reasoning

Logical reasoning over text is an important ability that requires understanding the semantics of the text and reasoning through them to arrive at correct inferences. Prior works on pretraining language models to improve the logical reasoning ability require complex processing of training data (e.g., aligning symbolic knowledge to text), yielding task-specific data augmentation that is not easy to adapt to any general text corpus. In this work, we propose APOLLO, a simple adaptive pretraining approach to improve the logical reasoning skills of language models. We select a subset of Wikipedia for adaptive pretraining using a set of logical inference keywords as filter words. Further, we propose two self-supervised loss functions for training. First, we modify the masked language modeling loss only to mask specific parts-of-speech words that likely require higher-order reasoning to predict them. Second, we propose a sentence-level classification loss that teaches the model to distinguish between entailment and contradiction types of sentences. The proposed pretraining paradigm is both simple and independent of task formats. We demonstrate the effectiveness of APOLLO by comparing it with prior baselines on two logical reasoning datasets. APOLLO performs comparably on ReClor and outperforms baselines on LogiQA.


Introduction
Logical reasoning is an important ability of humans that helps us in making rational decisions based on known information. It is an important ability for text understanding across various downstream tasks, e.g., in open-domain question answering (Yang et al., 2018;Zhu et al., 2021), machine reading comprehension (MRC) (Baradaran et al., The blue words are some logical keywords that are essential in understanding the overall logical structure of the context. The blue and pink words are some of the words we mask during pretraining as they pertain more towards logical reasoning (as opposed to the non-highlighted words that mainly contain factual knowledge or grammar). 2022), etc. Recently, there has been an increasing focus on evaluating the logical reasoning abilities of language models by using MRC tasks that specifically require a significant amount of logical reasoning to obtain the correct answer Liu et al., 2021). In these datasets, the model needs to understand a given context, reason logically about a question to infer new conclusions, and then select the correct answer from a set of options. With the advent of large pre-trained language models (PLMs) in NLP (Devlin et al., 2019;Radford et al., 2019;Raffel et al., 2020), understanding and improving the logical reasoning abilities of these models has become even more important as these are increasingly being used across a wide variety of real-world tasks.
There have been some recent works on improving the logical reasoning abilities of PLMs (Wang et al., 2022;Ouyang et al., 2022;Jiao et al., 2022). These works typically generate a dataset containing symbolic structures such as logical graphs from text, logical contrast sets, etc., and then train the LM using custom loss objectives to learn logical reasoning abilities. While the performance improvements achieved by these methods are encouraging, the proposed solutions generally require complex data processing to generate the additional structural information (graphs, contrast data, etc.) required for logical reasoning. Further, the loss functions proposed in these works are very specifically designed in accordance with their respective data augmentation technique and widely differs from the typical masked language modeling loss used for LM pretraining (Devlin et al., 2019). These complex processing steps usually require task-specific design choices, which are not necessarily learning generalizable logical reasoning ability that is reusable across different task formats. For example, Wang et al. (2022) parses symbolic logical structures from the training data of a specific dataset, which might not generalize to a new dataset or task. Jiao et al. (2022) constructs synthetic context and options from entities in Wikipedia, which is specific to the multiplechoice MRC setting. Overall, it is unclear if these highly specific inductive biases are indeed essential for improving the logical reasoning abilities in language models, or if a simple approach is sufficient.
Prior works (Gururangan et al., 2020) have shown that continued domain-adaptive pretraining of PLMs leads to performance gains on downstream tasks. Inspired by this, we propose APOLLO, a continued pretraining-based approach to inject logical reasoning abilities in language models that requires minimal data processing and loss function modifications. Firstly, we use a set of logical inference keywords to select a subset of sentences involving logical reasoning from a large text corpus, such that every sentence in the subset contains at least one of the keywords. These keywords are chosen such that the sentences containing the keywords are more likely to elicit reasoning when the PLMs fill out the masked tokens during pretraining. We note that in contrast to previous works (Gu-rurangan et al., 2020), our method only requires selecting sentences from a text corpus, eliminating the need for any domain-specific corpus.
Secondly, we find that selectively masking specific types of words in masked language modeling (MLM) can benefit the model's logical reasoning skills, in line with previous works on task-guided finetuning (Lad et al., 2022). More specifically, we mask the words based on their parts-of-speech tags and choose those tags that can increase the likelihood of using logical reasoning in predicting. For motivation, in Figure 1, we highlight these words in an instance of ReClor  dataset -a popular logical reasoning benchmark. We observe that these words are more aligned with logical reasoning compared to the non-highlighted words that mainly involve knowledge about specific nouns.
Lastly, we add a sentence-level classification loss to predict if the reasoning in the sentence describes an entailment in the reasoning process or a contradiction. This enables the model to better understand the differences between positive and negative implications in a sentence, thus improving logical reasoning. Overall, compared to prior works that developed solutions specifically for MRC datasets, our proposed pretraining paradigm for APOLLO is both task format and downstream dataset agnostic, which is highly desirable.
To test APOLLO, we evaluate it on two downstream logical reasoning tasks: ReClor  and LogiQA (Liu et al., 2021), and compare it with other baselines. We achieve state-ofthe-art performance on LogiQA and comparable performance on ReClor. We demonstrate that our method generalizes across different model types. Further, we show that using our proposed loss functions does not induce any catastrophic forgetting (Kirkpatrick et al., 2017) of the original language modeling skills. This demonstrates that our simple, continued pretraining approach is generalizable to different datasets and enables the PLM to learn strong logical reasoning abilities.

Method
In this section, we describe the details of our proposed approach. In APOLLO, we use a keywordbased dataset selection strategy to collect a dataset of reasoning-related sentences called IMPLICA-TION ( §2.1) and then continue training a pretrained model checkpoint using two loss functions jointly ( §2.2). This model is then fine-tuned on the training We filter Wikipedia using specific logical keywords to create the IMPLICATION dataset. This is then used for continued pretraining of a model using two loss objectives: selective masked language modeling (S-MLM) loss and entailment classification (E-CLS) loss. Please refer to Section 2 for more details.
dataset of each task separately. A detailed overview of the pipeline is shown in Figure 2.

Dataset Selection
PLMs are typically trained on the data from the internet which helps them in learning the language model and then they are finetuned on specific downstream datasets to specialize on a task (Devlin et al., 2019;Radford et al., 2018;Raffel et al., 2020).
Here, instead of focusing on a specific task, we want to teach the PLM generalizable logical reasoning abilities. We hypothesize that using training data that contains more logical sentences, rather than generic internet data, should help in improving the reasoning ability of the PLM. Although creating such a dataset automatically is a challenging task in itself, in APOLLO, we explore a simple and intuitive way to create such a dataset. First, we select specific keywords that are typically encountered in sentences with some logical implication. Broadly, we categorize these keywords into two types: • Positive implication (Entailment): These keywords are present in sentences where the reason generally entails the inference. Examples of such keywords would be "therefore", "accordingly", etc.
• Negative implication (Contradiction): The keywords in this category are usually present in sentences where the reason contradicts the inference. For example, keywords such as "but", "although", etc., come under this category.
Next, we select sentences from Wikipedia such that they contain at least one of the keywords. We name this filtered version of Wikipedia as the IM-PLICATION dataset. While this keyword-based filtering does not necessarily ensure that the sentence has a logical implication, it increases the probability of logically rich sentences being present in the training data. This helps in teaching the required logical reasoning skills to the PLM. Please refer to Appendix A for more details on the list of keywords used to build the IMPLICATION dataset.

Loss Function Design
Selective masked language modeling loss (S-MLM) is a modified version of the masked language modeling (MLM) loss used in BERT (Devlin et al., 2019). In the MLM loss, tokens in a sentence are masked at random and the model learns to predict the masked tokens. While this helps in learning a good language model, not all masked tokens require a similar degree of reasoning to predict them. For example, for the sentence shown in Figure 3, words such as "were", "the", etc. are decided more by the structure of the English language than any form of reasoning. For example, in Figure 3, predicting words such as "more", "drop", etc., would require logical reasoning. Similarly, the connector words such as "and", "hence", etc. define the logical structure of the sentence. These types of logically relevant words are highlighted in blue. Thus, we hypothesize that masking these logical tokens would likely teach the model to perform reasoning more effectively than masking any random token.
While finding these exact logical words for a given sentence is again a hard problem, in APOLLO we simplify this using a heuristic approach to consider tokens that belong to a specific set of parts-ofspeech (POS) tags. We select candidate tokens from the following SpaCy POS tags (Honnibal and Montani, 2017): ADJ, ADV, CONJ, CCONJ, PART, SCONJ, and VERB. Please refer to Section 4.3 for more empirical results that further justify this choice.
If Earth were frozen entirely and hence be more reflective, the temperature would drop below. Entailment classification loss (E-CLS) Prior works have shown that semantic-aware sentence level classification loss can be useful to learn the semantic information (Sun et al., 2020). Inspired by this, in addition to S-MLM, we use another auxiliary loss function that predicts whether a sentence contains some reasoning aspects that portray a sense of entailment or contradiction within the sentence. For example, in Figure 3, the sentence is classified as "Entailment", because the phrase "more reflective" is entailed by the phrase "frozen entirely". A model would ideally require strong logical reasoning abilities to understand the sentence and then predict if it refers to an entailment or contradiction. The labels for this loss are bootstrapped using the heuristic of checking the type of implication keyword present in the sentence (refer to Section 2.1 for details). We note that although the keyword is a correlated feature that can be used to predict the label, on average the keyword would be masked out due to our selective masking policy, forcing the model to learn some logical semantics to minimize the loss. Additionally, the classification loss adds a strong inductive bias specifically about the positive and negative implication words than the S-MLM loss.

Continued Pretraining
The loss function used for continued pretraining in APOLLO is a multi-task loss with both the S-MLM loss and E-CLS loss weighted equally, as depicted in Figure 2. Unlike prior works (Jiao et al., 2022), we don't need to add MLM loss to avoid catastrophic forgetting, as S-MLM is quite close to the standard MLM objective.

Finetuning
As our loss functions are task-format agnostic, we follow Devlin et al. (2019) and add a randomly initialized MLP layer on top of the pretrained model. Then, we finetune the combined model on the downstream dataset using the cross-entropy loss over all the options.

Experimental Setup
In this section, we describe the details of the datasets on which we evaluate APOLLO, the baselines we compare it with, and some implementation details of our training procedure.

Datasets
Following prior works (Jiao et al., 2022), we evaluate APOLLO on two logical reasoning datasets: ReClor ) is a reading comprehension dataset created from the logical reasoning questions from standardized graduate admission examinations. The test set is divided into two subsets: EASY (test-E) and HARD (test-H), where the EASY set contains instances whose options can be selected correctly without knowing the context and question. The train/dev/test split consists of 4,638/500/1,000 instances, respectively.
LogiQA (Liu et al., 2021) is developed using publicly available logical examination papers for reading comprehension. The train/dev/test split consists of 7,376/651/651 instances, respectively.

Baselines
We compare the accuracy of APOLLO with three prominent baselines: LRReasoner (Wang et al., 2022), FOCAL REASONER (Ouyang et al., 2022), and MERIt (Jiao et al., 2022). All these models train a PLM using some additional data to improve logical reasoning abilities.

Implementation Details
For creating IMPLICATION dataset, we use the Wikipedia version provided under HuggingFace Datasets (Wolf et al., 2020)

Overall Results
In this section, we compare the performance of APOLLO with prior baselines on the two logical reasoning datasets for different base architectures.
The results of using pretrained Roberta-Large as the starting checkpoint for our method are shown in Table 1. We observe that APOLLO outperforms all baselines on LogiQA and has lower performance on ReClor than two baselines. This demonstrates that our simple continual pretraining approach is strong enough to perform well on logical reasoning tasks as compared to the rest of the models that use much more complex training data and loss functions. Next, we observe that using our continual pretraining process as a complementary technique over an existing baseline MERIt leads to further performance improvements. Specifically, MERIt + APOLLO achieves the best performance on both LogiQA and the hard set of ReClor (Test-H). This shows that our pretraining strategy is effectively able to improve the logical reasoning abilities of language models and can also be implemented over existing solutions easily, due to the simple pretraining process.
To test the generality of our approach across different architectures, we use pretrained DeBERTa-v3 and DeBERTa-v2-xxlarge as the base models for continued training. The results of using these models are shown in Table 2. We find that APOLLO outperforms both the baselines on both datasets. Further, we observe that APOLLO performs 1.5%

Performance on GLUE Benchmark
In APOLLO pretraining, we do not include the standard MLM loss (Devlin et al., 2019), which might question the general language modeling abilities of the model. To understand this, we finetune APOLLO on each dataset of the GLUE benchmark (Wang et al., 2019) and evaluate the finetuned checkpoint on the Dev set. The results are compared with the Dev set results for RoBERTa model (Liu et al., 2019b) in Table 3. Following Devlin et al. (2019), we omit the evaluation on the problematic WNLI set. Overall, we observe that APOLLO can slightly improve the overall performance on the GLUE benchmark. This demonstrates that our proposed continued pretraining strategy is able to learn better logical reasoning abilities without any catastrophic forgetting of general-purpose language modeling skills, and these logical reasoning capabilities are also beneficial for general natural language understanding.

Ablation Studies
In  the validation set of the downstream task.

Effect of dataset and loss functions
To study the effect of using IMPLICATION for continued pretraining along with the proposed loss functions, we first create RANDOM, a random subset of Wikipedia of similar size as that of IMPLICATION, and also consider using the standard masked language modeling (MLM) loss (Devlin et al., 2019), where any token can be masked at random. The results of the ablation are shown in Table 4. We observe that using the IMPLICATION dataset leads to consistent improvements on both datasets when compared to the RANDOM dataset. Additionally, we find that both the S-MLM and E-CLS loss lead to improvements over MLM loss. Thus, this empirically justifies our choice of the dataset and loss functions proposed here.
Effect of keyword category In this ablation, we study the effect of the keyword categories that we use for filtering Wikipedia. For this, we create two different pretraining datasets IMPLICATION-Positive and IMPLICATION-Negative using the positive and negative implication keywords, respectively (refer to Section 2.1). The total number of sentences in these datasets is 7.5M and 11.3M, respectively. Our complete dataset IMPLICATION thus has a total of 18.3M sentences. The results of the ablation are shown in Table 5, under the section "Keyword Category". We observe that IMPLICA-TION-Positive, although smaller in size, leads to better performance on both downstream tasks, compared to IMPLICATION-Negative. One reason for this is that the sentences with positive keywords are more likely related to reasoning than the negative counterparts because the negative keywords are used in many diverse scenarios in the English language. Overall, we observe that the combined IM-PLICATION dataset leads to the best performance, demonstrating that both the positive and negative implication keywords are essential to improve logical reasoning.
Effect of POS tag category In this, we analyze the effect of the parts-of-speech (POS) tags we use to mask tokens in our S-MLM loss. We consider the following categories: • Base: This consists of the POS tags used in APOLLO, i.e., ADJ, ADV, CONJ, CCONJ, PART, SCONJ, and VERB. • Nouns: Here, we consider the tags referring to nouns and pronouns, i.e., NOUN, PRON, and PROPN. • Random: This consists of remaining categories such as ADP, INTJ, DET, PUNCT, etc. To study the effect of the POS tags, we incrementally add the "Nouns" and "Random" categories to the base case, and evaluate the effect of pretraining using the S-MLM loss. The results of this ablation are shown in Table 5, under the section "POS Category". We observe that masking nouns and pronouns ("Nouns") leads to a significant performance drop. We attribute this drop to the fact that predicting a correct noun in a sentence would likely require more world knowledge than logical reasoning. Using the remaining categories for selective masking ("Random"), effectively making the loss function equivalent to random MLM, leads to some drop in performance as well, indicating that our set of POS tag categories are indeed more useful to learn logical reasoning.
Effect of the number of trainable layers In order to study the effect of training different numbers of parameters of the RoBERTa model, we vary the number of trainable layers of the transformer architecture between 1 and 24 (i.e., training the complete model). The results are shown in Figure  4. The blue solid line shows the performance of APOLLO and the purple dashed line denotes the average performance of RoBERTa-Large when all layers are finetuned. From the plot, we observe that with increasing the number of trainable layers, the performance improves till layer 2, and then continues to degrade until all the layers are being trained. Prior works (Tenney et al., 2019) have shown that PLMs learn syntactic-level information in the lower layers of the transformer and semanticlevel information in the upper layers. Thus, we hypothesize that the logical reasoning task initially benefits with an increasing number of trainable layers, as the semantic information needed to understand logic is being captured. But lower layers that contain the syntactic information do not benefit as much when trained using the same data as they are less related to high-level logical reasoning. The full model finetuning surprisingly performs quite well as all the model layers along with the token embeddings are being trained specifically for the logical reasoning task. But it takes significantly larger compute to finetune such a model. Overall, we find that by training the topmost two layers of the model, we are able to achieve the best performance on both datasets, and hence we follow this across all variants of APOLLO.

Qualitative Analysis
In this section, we analyze the effect of continued pretraining on the model's overall faithfulness. Post-hoc interpretability methods such as Integrated Gradients (Sundararajan et al., 2017), are algorithms to determine the importance of words in the input towards predicting a particular class. These importance scores are also referred to as attribution scores. To approximate the impact of continued pretraining, we compute the overall change in attribution scores for the implication keywords, before and after pretraining the model using our proposed datasets and loss functions. Specifically, we compute the sum of the attribution scores for the keywords present in each instance of the validation set. The results are shown in Figure 5. We observe that our proposed pretraining increases the overall attribution score by a significant margin, indicating that the model intrinsically learns these important logical keywords, which is desirable.

Related Works
Reasoning in natural language has been a prevalent problem in NLP. In recent years, logical reasoning in textual data has seen an increasing focus.
ReClor  and LogiQA (Liu et al., 2021) are reading comprehension-style datasets focused on questions that require reasoning using information from a given context. Wang et al. (2022) proposed LRReasoner, which parses symbolic logical structures from the training data of ReClor for data augmentation using logical context extensions. Ouyang et al. (2022) constructed logical graphs using the chain of facts present in a task instance and used GNNs to reason on the graph. Jiao et al. (2022) proposed MERIt, that used Wikipedia to generate sentence pairs for contrastive learning that are logically related, and trained the PLM using contrastive loss. Both LRReasoner and FOCAL REASONER use data augmentation that is specific to the task being solved, making the pretraining process specific to the downstream dataset, and thus not generalizable across tasks. While MERIt addresses this issue by using Wikipedia to generate logical graphs, their contrastive loss formulation requires counterfactual data augmentation, which potentially distorts the factual knowledge present in the pretrained model. Additionally, their approach is restricted to using Wikipedia as the data source since they heavily rely on forming entity graphs from Wikipedia texts. In contrast, we propose a simple continued pretraining strategy by modifying the masked language modeling loss (Devlin et al., 2019) and sentence classification loss to improve the logical reasoning ability of language models. Our approach is simple to integrate during pretraining, is not dependent on any data processing, and generalizes well across different datasets. Along a related line, Clark et al. (2020) used synthetically generated rulebases to show that PLMs can perform complex deductive reasoning to predict the entailment of a given hypothesis. This led to some recent developments in trying to build systems that can generate step-by-step reasoning chains to prove the model's entailment prediction (Saha et al., 2020;Tafjord et al., 2021;Sanyal et al., 2022b). While progress on these datasets is encouraging, the use of synthetic data for training the models limits the generality of the logical reasoning skills learned by these models. Some works have questioned if these models are indeed learning to perform logical reasoning in a robust manner or just learning some shortcuts from training data (Zhang et al., 2022;Sanyal et al., 2022a).

Conclusion
In this paper, we proposed APOLLO, an adaptive pre-trained language model with logical reasoning abilities. We use a subset of Wikipedia sentences for continued pretraining of the model using two self-supervised loss functions. The choice of the training dataset and loss functions are guided by the objective to include more reasoning-related sentences and training signals, respectively. Through experiments on two logical reasoning datasets and ablation studies, we demonstrate the effectiveness of our proposed approach. Overall, we show that APOLLO is a generalized solution to improving logical reasoning in language models.
An important advantage of APOLLO is that the pretraining steps are independent of both the dataset being used to train the model and the downstream task format. This opens the scope to use a larger text corpus for training such as C4 (Raffel et al., 2020). Additionally, expanding on the keywords beyond positive and negative implications can benefit the training pipeline as well. For example, conditionals such as "if-then", "either-or", etc.
A limitation of this approach is the trade-off between completeness and noise in the training data. While our method using keywords to extract text from Wikipedia is effective, IMPLICATION likely contains redundant sentences that cannot improve the model's logical reasoning capability. A better rule-based or neural model might be able to extract a better corpus, with potentially higher computational cost.