Hence, Socrates is mortal : A Benchmark for Natural Language Syllogistic Reasoning

Syllogistic reasoning, a typical form of deductive reasoning, is a critical capability widely required in natural language understanding tasks, such as text entailment and question answering. To better facilitate research on syllogis-tic reasoning, we develop a benchmark called S YLLO B ASE that differs from existing syllo-gistic datasets in three aspects: (1) Covering a complete taxonomy of syllogism reasoning patterns; (2) Containing both automatically and manually constructed samples; and (3) Involving both the generation and understanding tasks. We automatically construct 50k template-based syllogism samples by mining syllogism patterns from Wikidata and ConceptNet. To improve our dataset’s naturalness and challenge, we apply GPT-3 to paraphrase the template-based data and further manually rewrite 1,000 samples as the test set. State-of-the-art pre-trained language models can achieve the best generation ROUGE-L of 38.72 by T5 and the best multi-choice accuracy of 72.77% by RoBERTa on S YLLO B ASE , which indicates the great challenge of learning diverse syllo-gistic reasoning types on S YLLO B ASE . Our datasets are released at https://github.com/ casually-PYlearner


Introduction
Reasoning, as a typical way for human beings to obtain new knowledge and understand the world, is also an ultimate goal of artificial intelligence (Newell and Simon, 1956;Lenat et al., 1990). Reasoning skills, i.e., examine, analyze, and critically evaluate arguments as they occur in ordinary language, have been required by many natural language processing tasks, such as machine reading comprehension (Liu et al., 2020;Yu et al., 2020), open-domain question answering (Kwiatkowski et al., 2019;Huang et al., 2019), and text gener-* Corresponding author. ation (Dinan et al., 2019). 1 According to different mental processes, reasoning can be categorized as deductive, inductive, abductive, etc (Copi et al., 2016). In Piaget's theory of cognitive development (Huitt and Hummel, 2003), these logical reasoning processes are necessary to manipulate information, which is required to use language and acquire knowledge. Therefore, the study of logical reasoning is worthy of our attention because it is so prevalent and essential in our daily lives.
In this study, we focus on syllogism, which is a typical form of reasoning and has been studied for a long time (it was initially defined in Aristotle's logical treatises Organon, composed around 350 BCE). As shown in Table 1, a syllogism often contains two premises and a conclusion, where the conclusion can be inferred based on the given premises through a deductive reasoning process. 2 Though reasoning-required tasks (such as question answering) have been widely studied, the thorough study to test the deductive reasoning capabilities of a model or system is rare. In the study of syllogism, there are only a few datasets, and they have several limitations: (1) They focus merely on categorical syllogism (shown in Figure 1) (Dames et al., 2020;Dong et al., 2020;Aghahadi and Talebpour, 2022). Even though it is the most common type, syllogisms come in a variety of forms. They involve different reasoning processes and are also beneficial.
(2) Some datasets (Dames et al., 2020;Dong et al., 2020) are not in natural language, which are difficult to adapt to inference requirements in real natural language scenarios. (3) More severely, all of them have less than 10k samples, which are insufficient for training deep neural networks.
To support further study on syllogistic reasoning, in this work, we build a new natural language  Table 2): First, it is a more complete benchmark that covers five types of syllogisms. Therefore, it can support more fine-grained research on certain types, their interrelationships, and their combined effect on other tasks. Second, all premises and conclusions are written in natural language. It more closely resembles real-world application settings in which natural language descriptions rather than categorized inputs are provided. In addition, the power of large-scale pre-trained language models can also be harnessed effectively. Third, with our proposed automatic construction process, we collect a large number of samples (50k in total). They can support the training of deep neural networks. In order to validate the performance on actual human syllogism, we also manually annotate 1,000 samples as the test set. This test set may also be used independently to assess the reasoning capability of models in a zero-/few-shot manner. Finally, to promote a more comprehensive investigation of syllogistic reasoning, we organize both a generation and an understanding task.
The experimental results indicate that there is a great deal of room for improvement in the syllogistic reasoning capabilities of existing models. Our additional experiments demonstrate the efficacy of transferring knowledge learned from our automatically constructed syllogism to actual human reasoning.

Syllogism
Syllogism is a common form of deductive reasoning. Basic syllogism can be categorized as categorical syllogism, hypothetical syllogism, and disjunctive syllogism. They can be further combined into polysyllogisms. In this section, we use the most common categorical syllogism to introduce the term and structure of syllogism. Other types of syllogism will be introduced in Section 3.
The left side of Figure 1 shows a well-known categorical syllogism about "Socrates is mortal". We can see a categorical syllogism usually contains two premises and a conclusion. A common term (e.g., "human") links two premises, and the premises respectively define the relationship between "human" and "mortal" or "Socrates". The reasoning process is to draw a conclusion based on the premises. A syllogism can also be described by a pattern, as shown in the middle side of Figure 1.

Related Work
Syllogistic Reasoning Dataset Several syllogistic reasoning datasets have been introduced to promote the development of this field. CCO-BRA (Dames et al., 2020) is a dataset with around 10k triplets (major premise, minor premise, conclusion). The task is formed as a single-choice question, and the ground-truth conclusion is shuffled with several distractors. ENN (Dong et al., 2020) is another similar dataset, but the syllogism is constructed from WordNet (Miller, 1995). SylloFigure (Peng et al., 2020) and Avicenna (Aghahadi and Talebpour, 2022) are two natural language textbased syllogism reasoning datasets, but they are designed for different tasks. SylloFigure annotates the data in SNLI (Bowman et al., 2015), restores the missing premise, and transforms each syllogism into a specific figure. 3 The target is to predict the correct figure type of a syllogism. Avicenna is a crowdsourcing dataset, and the syllogism is extracted from various sources, such as books and news articles. These syllogisms are used for both natural language generation and inference tasks. Different from existing datasets that focus only on categorical syllogism, our SYLLOBASE covers more types and patterns of syllogism and is significantly larger than existing datasets. More detailed comparisons are shown in Table 1.
Logic Reasoning in NLP There are several tasks and datasets related to logical reasoning in NLP. The task of natural language inference (NLI) (Bos and Markert, 2005;Dagan et al., 2005;MacCartney and Manning, 2009;Bowman et al., 2015;Williams et al., 2018), also known as recognizing textual entailment, requires model to classify the relationship types (i.e., contradicted, neutral, and entailment) between a pair of sentences. However, this task only focuses on sentence-level logical reasoning, and the relationships are constrained to only a few types. Another NLP task related to logical reasoning is machine reading comprehension (MRC). There are several MRC datasets designed specifically for logical reasoning, such as LogiQA (Liu et al., 2020) and ReClor (Yu et al., 2020). A paragraph and a corresponding question are given, and the model is asked to select a correct answer from four options. This task requires models to conduct paragraph-level reasoning, which is much more difficult than NLI.
The above logic reasoning NLP tasks attempt to improve models' general logic reasoning capability, but they pay little attention to different types of reasoning processes, such as deductive reasoning or inductive reasoning. In this work, we study a specific form of deductive reasoning, i.e., syllogism. We hope our benchmark can support more in-depth studies on the reasoning process.

Data Construction
Our target is to develop a large-scale benchmark and support research on several typical kinds of syllogistic reasoning. It is straightforward to collect data through human annotation, as most existing datasets have explored (Dames et al., 2020;Aghahadi and Talebpour, 2022). However, this method is impracticable for obtaining large-scale data due to the high cost of human annotation. Therefore, we propose constructing a dataset automatically from existing knowledge bases and man-ually rewriting 1,000 samples as the test set.

Data Source
Inspired by existing studies (Dong et al., 2020) that collect data from knowledge bases, we choose Wikidata (Vrandecic and Krötzsch, 2014) and Con-ceptNet (Speer et al., 2017) as our data sources because they contain large-scale high-quality entities and relations.
Wikidata is an open-source knowledge base, serving as a central storage for all structured data from Wikimedia projects. The data model of Wikidata typically consists of two components: items and properties. Items represent things in human knowledge. Each item corresponds to a identifiable concept or object, or to an instance of a concept or object. We use entities in the top nine categories, including human, taxon, administrative territorial, architectural structure, occurrence, chemical compound, film, thoroughfare, and astronomical object. 4 Then, we use the relationship of instance of, subclass of, and part of to extract triplets.
ConceptNet is another open-source semantic network. It contains a large number of knowledge graphs that connect words and phrases of natural language with labeled edges (relations). Its knowledge is collected from many sources, where two entities are connected by a closed class of selected relations such as IsA, UsedFor, and CapableOf. We use ConceptNet to extract the descriptive attributes of the entities obtained from Wikidata. By this means, we can obtain another group of triplets, which are also used for constructing syllogism.

Data Processing
In this section, we introduce the construction process of five types of syllogism data, respectively. Some examples are shown in Table 2.

Categorical Syllogism
As shown in Table 1, a categorical syllogism is composed of a major premise, a minor premise, and a corresponding conclusion. We first construct premises and then use them to infer the conclusion and form syllogisms.
The premise in a categorical syllogism can be summarized as four propositions according to different quantifiers and copulas: (1) All S are P ; (2) No S are P ; (3) Some S are P ; (4) Some S are not P ; where S and P are two entities. With different combinations of the four propositions, categorical syllogisms can be categorized into 24 valid patterns. The first part of Table 2 shows an example of Dimatis syllogism, which is one of the valid patterns. 5 To construct premises, we use the extracted triplets from Wikidata and ConceptNet. To obtain a proposition which contains negative relationship, we can use the Antonym and DistinctFrom relationship in ConceptNet to construct it. Taking the triplets (chemical compound, subclass of, pure substance) and (chemical compound, Antonym, mixture) as an example, we have: (1) All chemical compounds are pure substances; (2) No chemical compounds are mixture; (3) Some pure substances are chemical compounds; (4) Some pure substances are not mixture.
By this means, we can obtain various premises, which will be used for constructing syllogisms.
Considering the example in Table 2, which is a Dimatis syllogism, we first sample a triplet (carbon dioxide, IsA, chemical compound). Then, we use the middle term chemical compound to sample another triplet (chemical compound, subclass of, pure substance), which forms the minor premise. Finally, we can generate a conclusion based on the pattern definition. All other different patterns of syllogisms can be constructed in a similar way.

Hypothetical Syllogism
Similar to categorical syllogism, a hypothetical syllogism has two premises and a conclusion. The difference is that the premises have one or more hypothetical propositions. A hypothetical Syllogism has three valid patterns (the full list is in Appendix A), and we use five relations (i.e., Causes, HasSubevent, HasPrerequisite, MotivatedByGoal, and CausesDesire) in ConceptNet to construct hypothetical propositions.
The following pattern is used as an example to illustrate the data construction process: Premise 1: If P is true, then Q is true. Premise 2: If Q is true, then R is true. Conclusion: If P is true, then R is true.
Specifically, we extract a triplet pair where the tail entity of one triplet is the head entity of another triplet, e.g., (success, CausesDesire, celebrate) and (celebrate, CausesDesire, have a party). This triplet pair can construct premises as success makes 5 Other patterns can be referred to in Appendix A. you want to celebrate, and celebration makes you want to have a party. Then, we can build a hypothetical syllogism according to the pattern, and the corresponding conclusion is success makes you want to have a party. Hypothetical syllogism with other patterns can be constructed in a similar way.

Disjunctive Syllogism
A disjunctive syllogism has two premises: One of them is a compound proposition, which tells that at least one proposition is true; The other premise tells that one proposition in the former premise is false. Then, we can infer another proposition in the former premise is true. For example, if P and Q are two propositions, a disjunctive syllogism can be described as: Premise 1: P is true or Q is true; Premise 2: P is not true; Conclusion: Q is true.
According to whether the two propositions can be both true, a disjunctive syllogism can be categorized as compatible or incompatible. We use ten relations in ConceptNet to construct disjunctive syllogism, where eight of them (such as PartOF and HasA) are used for compatible disjunctive syllogism, and the rest two (i.e., Antonym and DistinctFrom) are used for incompatible disjunctive syllogism (all relations we used are listed in Appendix B). Here, we use the incompatible disjunctive syllogism as an example to illustrate the construction process.
We first sample a triplet for an entity, such as (newspapers, CapableOf, come weekly) and (newspapers, CapableOf, come daily). Then, we can construct a premise as newspapers can come weekly or come daily. Next, we obtain another premise, such as some newspapers cannot come weekly. Finally, we can have the conclusion as some newspapers come daily. In this way, we can automatically construct various disjunctive syllogisms based on the triplets in ConceptNet.

Polysyllogism
A polysyllogism is a combination of a series of syllogisms. It usually contains three or more premises and a conclusion. We construct polysyllogisms based on categorical syllogisms, and the construction process can be summarized as the following steps: (1) We sample a categorical syllogism from our categorical syllogism repository (built in Section 3.2.1).
(2) According to the form of the conclusion, we can get its predicate term and subject term.
(3) We use these terms to traverse the repository and select a premise/conclusion that contains them.
(4) We use the conclusion obtained in the second step and the selected premise/conclusion in the third step as two new premises. Then, we can infer the conclusion and check if the generated syllogism follows a valid pattern.
(5) Repeat the above process, and we can obtain a series of syllogisms.
(6) We use both premises in the first syllogism and the minor premise in all other syllogisms as the premises of the polysyllogism. The conclusion is obtained from the last syllogism's conclusion. By this means, we can construct a polysyllogism.
We provide an example in the fourth row of Table 2 to illustrate the construction process.

Complex Syllogism
In addition to constructing the previous four types of syllogism, we investigate another new type of syllogism, which is called complex syllogism. A complex syllogism contains two premises and a conclusion, and the premises and conclusion are compound propositions, which contain one or more logical connectives (i.e., not, and, or, and if-then). These logical connectives significantly increase the difficulty of the syllogism. An example of a complex syllogism is shown in the last row of Table 2. The construction steps can be summarized as: (1) We randomly sample a pattern from hypothetical and disjunctive syllogism as a basic pattern.
(2) We replace the simple propositions in the basic pattern (such as P , Q, and R) by a compound proposition with the logical connectives not, and, and or, (e.g., not P, P or Q, and P and Q).
(3) After the replacement, we can infer the conclusion (according to the pattern we derived, as shown in Appendix A) and construct a complex syllogism.

Rule of Replacement
To replace a simple proposition by a compound proposition, we use the Synonyms relation in ConceptNet. For example, considering the proposition something that might happen as a consequence of eating ice cream is pleasure, we use the synonym of the entity ice cream, i.e., cone, and construct a compound proposition as something that might happen as a consequence of eating ice cream and cone is pleasure.

Rewriting
With the above process, we obtain a large number of syllogisms. However, these syllogisms are constructed based on predefined patterns, which have fixed structures and may contain grammar faults. In our preliminary study, we find that models trained on such pattern-based data have a poor robustness, potentially because the models are overfitting to the patterns rather than learning the real reasoning process. To alleviate this problem, we apply GPT-3 (Brown et al., 2020) for rewriting, which has been shown to be effective (Ding et al., 2022). Specifically, we use a prompt with some human-rewritten examples to ask GPT-3 to change the expression of the syllogism but keep its original meaning and pattern. The generated results have good quality in fluency, diversity, and logic, which are suitable for training models (some examples are shown in the bottom of Figure 1, and the detailed process is described in Appendix C).
Furthermore, to test the models' performance on (real) syllogisms and facilitate future in-depth research, we manually rewrite 1,000 samples from our collected data as a test set. The rewriting process includes filtering the noise, correcting the grammar faults, and paraphrasing (details process is described in Appendix D). Our experiments (see Section 4.4) will show that the test data are very challenging, whereas training on our automatically collected data is still effective.
As yet, we have obtained 50k samples by GPT-3 rewriting, which are used for training and validation, and 1k samples by further human annotation, which are used for testing. All of them are equally distributed over the five types.

Task Formalization
Based on our collected data, we design two tasks: Conclusion Generation It is a natural language generation task. The model should generate the correct conclusion based on two given premises. Premises and conclusions are natural language text, which can be represented as sequences of tokens. Formally, given two premises P 1 = {w P 1 1 , · · · , w P 1 m } and P 2 = {w P 2 1 , · · · , w P 2 n }, the model is asked to generate the conclusion C = {w C 1 , · · · , w C l }, where w is a token. Similar to other text generation tasks, the generation probability of the conclusion is determined by the product of the probability of each word, which can be described as: is concatenation operation. More premises can be handled by concatenating all of them as a long sequence.
Conclusion Selection It is a natural language understanding task. The model is asked to select a correct conclusion from four options, where three of them are distractors. Detailed construction process is given in Appendix F. With the above notations of premises and conclusion, we can define the conclusion selection task as: is the predicted probability of C i as a correct conclusion, and M (·, ·) is the output logit of the model. The statistics of our dataset for both tasks are given in Appendix G.

Baseline and Evaluation Metrics
We compare the performance of several models. As for evaluation metrics, following previous studies (Aghahadi and Talebpour, 2022), we use ROUGE-1/2/L (Lin, 2004), BLEU-1/2 (Papineni et al., 2002), and BERT-Score (Zhang et al., 2020) to evaluate the performance of the conclusion generation task. ROUGE and BLEU are commonly used metrics for text generation, and they measure the n-grams overlap between the generated text and the ground-truth text. BERT-Score is a recently proposed model-based metric. It leverages the pre-trained contextual embeddings from BERT and matches words in generated and ground-truth texts by cosine similarity. For the conclusion selection task, we use Accuracy to evaluate the models' performance. The implementation details are provided in Appendix H. Table 3: Results of conclusion generation task. "R-1/2/L" stands for Rouge-1/2/L, "B-1/2" stands for BLEU-1/2, and "BS" denotes BERT-Score.

Experimental Results
The results of all models on the conclusion generation task are shown in Table 3, while those on the conclusion selection task are reported in Table 4. For the conclusion generation task, we can see that the overall performance in terms of wordoverlap metrics (such as ROUGE and BLEU) is poor. Given that conclusions are often brief (11.84 tokens on average), these results show that the task is fairly challenging. In contrast, the BERT-Score is high, indicating that models are able to generate some semantically correct contents but cannot organize them into a reasonable conclusion. Furthermore, the pre-trained language models perform significantly better than the vanilla Transformer. We attribute this to the natural language nature of our dataset, and these results suggest that our dataset can help future research on leveraging pre-trained language models to generate logically reasonable texts. Finally, we notice that the performance on the human-written test set and the automatically generated validation set (in Table 15) is close, reflecting the good quality of GPT-3 rewriting.
For the conclusion selection task, the overall accuracy is around 70%, showing a significant deviation from perfection. In Table 4, the model for a single type of syllogism is trained solely on the corresponding type of data. Therefore, the result of type "All" is not the average result of the five types of syllogisms. We notice that almost all results for ELECTRA are highest, but it has only 70.89 for the type "ALL". We speculate the reason is that the ELECTRA model is not robust when trained with mixed data, and the data in different types of syllogism might confuse it. Intriguingly, the performance on categorical syllogisms is extremely bad. A potential reason is that this type of syllogisms contains more patterns (e.g., categorical syllogisms have 24 valid patterns). As a comparison, the performance on hypothetical syllogisms is significantly higher since there are only three patterns. We also notice that the performance on polysyllogisms is higher than that on categorical syllogisms, despite the fact that the former is derived from the latter. We speculate the reason is that the polysyllogisms have more abundant information in premises (i.e., multiple premises), which is helpful for pre-trained language models to conduct reasoning.

Further Analysis
We also explore the following research questions. To save space, we report the results of the conclusion generation task, while similar trends can be observed on the conclusion selection task, which is shown in Appendix. Effect of Automatically Constructed Data In our benchmark, the training data are automatically constructed from knowledge bases, while the test data are human annotated. 6 To reveal the relationship between them, we conduct an additional experiment: we split the test set as new training, vali-  dation, and test sets with a ratio of 8:1:1 (i.e., 800, 100, and 100 samples respectively). Then, we train models on the new training data and test their performance on the new test data. As a comparison, we also train models that have been pre-trained on the original training data (automatically constructed). The results are illustrated in Table 5. It is clear to see that training on automatically constructed data is beneficial for learning manually rewritten data. This is due to the fact that the original dataset is large and contains sufficient training signals. This also validates the benefit of our dataset-the knowledge acquired from large-scale data can be transferred to more difficult problems. Transfer Learning SYLLOBASE supports study on five types of syllogisms. We explore their internal relationships through a transfer learning experiment. Besides, we also investigate if the knowledge learned on SYLLOBASE can improve other syllogism datasets (e.g., Avicenna). The results are shown in Table 6. In this experiment, we first train a BART model on one dataset (denoted as "pretraining"), then further train it on another dataset (denoted as "fine-tuning") and report the results.
In the first group of experiments (the first two rows), we can see learning categorical syllogisms contributes less to learning hypothetical and disjunctive syllogisms. This confirms our concern that merely studying categorical syllogisms is not enough, and it proves our contribution to syllogism study. In terms of the results in rows (3)-(9), we can generally conclude that learning basic syllogisms is beneficial for learning combined syllogisms, and vice versa. One exception is the result in the row (9), and it indicates that the knowledge learned from the complex syllogisms does not help for learning hypothetical syllogisms. We speculate the reasons are: (a) complex syllogisms have significantly more patterns than hypothetical syllogisms (42 vs. 3), and (b) the premise/conclusion of complex syllogisms is too complicated to form effective knowledge for hypothetical syllogisms. Finally, comparing the results in the row (15) and (16), we can see models trained on SYLLOBASE have good generalizability on other syllogism datasets, demonstrating once again the value of our SYL-LOBASE on general syllogism research. Effect of Context in Premises Existing machine reading comprehension datasets often provide a paragraph for reasoning. Inspired by these tasks, we expand the premises in our generated syllogisms by adding more informative context so as to validate the models' capability of extracting effective clues and inferring conclusions. Specifically, for each premise in the manually rewritten dataset, we ask the annotators to further collect some relevant information through search engines and add it as the context. After this step, both premises are hidden in paragraphs, which makes it more difficult to infer a correct conclusion (as shown in Table 13). Results of both tasks shown in Table 7 indicate: (1) Existing models are still far from tackling reasoning problems in real life; and (2) Extracting clues (such as premises in our case) before reasoning is a promising solution for reasoning tasks, which could be explored in the future.
Appendix I shows a case study with some modelgenerated conclusions of syllogisms.

Conclusion
In this work, we built a large-scale benchmark for natural language syllogistic reasoning. It covers five types of syllogism. The data were automatically constructed from knowledge bases by our proposed construction methods. To evaluate the models' performance on real human syllogism, we manually rewrite 1,000 samples as the test set. Experiments show that syllogistic reasoning is a very challenging task for existing pre-trained language models. Moreover, our further study indicates that existing models are even farther from tackling syllogistic reasoning in real scenarios.

Ethical Statement
This work constructs a new benchmark for syllogistic reasoning. The main dataset is automatically constructed using entities and their relations from Wikidata and ConceptNet. The construction template is predefined and manually reviewed, so the ethical concerns are avoided. For the human rewriting process, we hire five annotators and require them to avoid any social bias and privacy issues in the rewritten material. The results are randomly shuffled and sent back to them for an ethical review. We pay them roughly $15 per hour for annotation.

Limitations
We build a new benchmark for syllogistic reasoning. The limitations are mainly in the experiments part: (1) Due to the limited human resources, our test set is quite small, which may not support training large models directly. (2) We evaluate all models by comparing their predictions with the groundtruth conclusions, but human performance is not evaluated. As a benchmark, it may be better to provide human performance and show the performance gap of existing models. (3) We have not tested the performance of pre-trained models in terms of logical correctness. This kind of automatic metrics has been rarely studied, which can be a potential direction of our future work.  Original premise of a hypothetical syllogism Premise: Something that might happen as a consequence of attending a classical concert is going to sleep.

Retrieval and manual check
Premise: I probably spend more concert time asleep than awake.

Rewriting
Premise: When attending classical concerts, people probably spend more concert time asleep than awake.

A Patterns in Syllogism
We list all valid patterns in categorical (shown in Table 9), hypothetical (shown in Table 10), and complex syllogisms (shown in Table 11).

B Relations from Wikidata and ConceptNet
We list all relations that are used for constructing syllogisms in Table 12. For Wikidata, we use 16 relations, which are all used for constructing categorical syllogisms. As for ConceptNet, we use 15 relations, and they are used for constructing categorical, hypothetical, and disjunctive syllogisms.

C GPT-3 Rewriting
GPT-3 is a well-known pre-trained language model, which has demonstrated impressive few-shot performance on a wide range of natural language processing (NLP) tasks. Recently, researchers has tried to use GPT-3 to annotate data for NLP tasks (Ding et al., 2022). Inspired by this, we choose GPT-3 to complete the rewriting task. In our case, we use a prompt to ask GPT-3 to change the expression of the syllogism but keep its original meaning and pattern. We also append some humanrewritten examples in the prompt as few-shot input. The generated results have good quality in fluency, diversity, and logic, which are suitable for training models. The prompts used for rewriting are listed in Table 16-20.

D Human Rewriting
First, 500 samples are randomly collected from each type of syllogism, respectively. Then, we examine the semantics and filter out illogical syllogisms. Next, for the remaining ones, we correct the grammatical problems (if any). Finally, for each premise/conclusion, the language is painstakingly paraphrased. The paraphrasing process is illustrated in Algorithm 1, and an example is given in Table 8. After rewriting, the sample is more diverse, fluent, and closer to real human language.

E Annotation of Automatic Data
To evaluate the quality of our automatically generated data, we conduct a human annotation for 100 random samples (20 for each type of syllogisms). The annotators are asked to label whether the samples have grammatical faults and incorrect logic. The overall accuracy is 73%. Concretely, the accuracy is 70%, 90%, 70%, 65%, and 70% for categorical syllogisms, hypothetical syllogisms, disjunctive syllogisms, polysyllogisms, and complex syllogisms, respectively. This result reflects: (1) Our automatic data have fairly good quality. Our experiments in Section 4.4 also validates this.
(2) The polysyllogism is hard to construct as it concerns multiple syllogisms.

F Distractor Construction in Conclusion Selection Task
In the conclusion selection task (introduced in Section 4.1), we mix the correct conclusion with three distractors. Basically, these distractors are generated from the ground-truth conclusion by changing its quantifier, adding negative words, or exchanging its subject and object. Specifically, for different kinds of syllogisms, we show the distractor generation process by some examples.
Categorical Syllogism For a syllogism as follows: Premise 1: All m are p. Premise 2: All s are m.
Conclusion: All s are p. Polysyllogism Syllogism This kind of syllogism is built on several categorical syllogisms. Therefore, we can use the same distractor construction method as categorical syllogisms.
Complex Syllogism This kind of syllogism is constructed by adding one or model logical connectives to the original premises and conclusions. Therefore, to generate the distractors, we can (1) add or remove the negative connective (i.e., not)  (2) replace the connectives in the original proposition by others (e.g., and → or). For example, given a syllogism as follows: Premise 1: If P is true or if Q is true, then R is true; Premise 2: If R is true, then S is true; Conclusion: If P is true or if Q is true, then S is true. Premise 1: Carbon dioxide is a chemical compound composed of two oxygen atoms covalently bonded to a single carbon atom. CO2 exists in the earth's atmosphere as a gas and in its solid state it known as dry ice.
Premise 2: In a scientific context, "pure" denotes a single type of material. Ostensibly, compounds contain more than one type of material. Therefore, chemical compounds are considered pure substances. Pure compounds are created when elements combine permanently, forming one substance.
Conclusion: Pure substances include carbon dioxide. We can generate distractors of the conclusion as:

BERT
(1) If P is true or if Q is true, then S is not true.
(add negative words) (2) If P is true or if S is true, then Q is true.
(change a proposition) (3) If P is true and if S is true, then Q is true.
(change the logical connective words)

G Dataset Statistics
The statistics of our SYLLOBASE is given in Table 14.

H Implementation Details
We use PyTorch (Paszke et al., 2019) and Transformers (Wolf et al., 2019) to implement all models. They are trained on 8 Tesla V100 GPUs with 32GB memory. All hyperparameters (e.g., learning rate) are tuned according to the performance (BLEU-1/Accuracy) on the validation set. In the conclusion generation task, for the decoder-only model GPT-2, the major premise and minor premise are concatenated as a long sequence and fed into the model (decoder) to generate the conclusion. For the encoder-decoder structure (Transformer, T5, and BART), the two premises are concatenated and input to the encoder, while the conclusion is input to the decoder and used for generation. The maximum generation length is set as 128. The training batch size is set as 32. The AdamW (Loshchilov and Hutter, 2019) optimizer is applied with a learning rate of 5e-5. The learning rate decay mechanism is applied. All models are trained by 10 epochs, and the total training time is around 1.22 hours. In the conclusion selection task, we concatenate two premises as one sequence, use the conclusion as another sequence, and transform them into the text-pair input format, which is commonly supported by pre-trained language models. For example, the input for BERT is: X = [CLS]P 1 P 2 [SEP]C [SEP]. The representation of [CLS] is used for option selection. The maximum sequence length is set as 256. The training batch size is set as 64. A learning rate of 2e-5 with decay mechanism is used. The optimizer is also AdamW. All models are trained by ten epochs, and the total training time is around 3.29 hours.

I Case Study
We show some results of BART in conclusion generation task to make a case study. We have listed a good case and a bad case for each type of syllogism. They are shown in Table 21. We can see: (1) The model can generate conclusions that are different from the ground-truth but are also correct in logic. This indicates that pre-trained language models can indeed learn some logic reasoning skills from syllogisms rather than merely "remembering" some fixed patterns. (2) Syllogistic reasoning is still difficult for existing models, and the errors stem from several different aspects. As shown in the hypothetical syllogism, the model generates a semantically correct conclusion, but it is irrelevant to the premises. This problem is identified as "hallucination" of pre-trained language models (Nie et al., 2019), i.e., the model cannot decide whether to generate a conclusion based on its learned parameters or the given context. We believe our dataset can contribute to the study of hallucinations in logical reasoning. As for the last case, the model generates a conclusion opposite to the ground-truth. This indicates that existing models may need additional reasoning modules to conduct complex reasoning problems.  Table 15: Results of conclusion generation task on validation set. "R-1/2/L" stands for Rouge-1/2/L, "B-1/2" stands for BLEU-1/2, and "BS" denotes BERT-Score.    Rewrite the following sentences to standard English. Keep the meaning and pattern of the original sentences, but change the expression of the sentences.
pattern: P is true or Q is true. P is not true.
[Therefore], Q is true. original sentences: Is the meal hot or cool. The meal are not hot.
[Therefore], the meal are cool. rewritten sentences: The meal is warm or cold when the man gets home from work. The food is not warm when the man stays late at work.
[Therefore], the meal is cold when the man comes home late. pattern: P is true or Q is true. P is not true.
[Therefore], Q is true. original sentences: The ocean is gas or liquid. The ocean is not gas.
[Therefore], the ocean is liquid. rewritten sentences: The ocean can exist in either liquid or gaseous form. The ocean is not gaseous.
[Therefore], oceans do not exist in a gaseous condition, as far as we know. pattern: P is true or Q is true. P is not true.
[Therefore], Q is true. original sentences: Memories are good or sad. Memories are not good.
[Therefore], memories are sad. rewritten sentences: People like being engrossed in memories, whether good or sad. Old memories are not always pleasant.
[Therefore], memories of the past may cause sadness. pattern: P is true or Q is true. P is not true.
[Therefore], Q is true. original sentences: You can use an audience to performing in front of or boost your ego. You can not use an audience to boost your ego.
[Therefore], you can use an audience to performing in front of. rewritten sentences: When you're in front of an audience, you can put on a show or increase your self-esteem. You cannot exaggerate your ego in front of an audience.
[Therefore], you can give a performance in front of an audience. pattern: P is true or Q is true. P is not true, [Therefore], Q is true.  Rewrite the following sentences to standard English. Keep the meaning of the original sentences, but change the expression of the sentences.
original sentences: No hypothesis is fact. Some proposition are hypothesis. Some proposition are not fact. All proposition are abstract object.
[Therefore], some abstract object are not fact. rewritten sentences: A hypothesis is a proposed explanation that differs from fact. Some propositions are hypotheses. Some propositions are proven not to be facts. Every proposition is an abstract object.
[Therefore], some abstract objects do not exist as facts. original sentences: Applied science is science. No Science is art. Human science is science. Some Behavioral genetics are not human science. Behaviour genetics is psychology. Genetics is biology.
[Therefore], some applied science are not biology. rewritten sentences: Applied science is science in every sense of the word. Science and art are two distinct forms of scholarship. Human science is a branch of science. Behavioral genetics does not involve any human science. Behavioral genetics is a branch of psychology. Genetics is the study of biology.
[Therefore], applied science encompasses more than just biology.  [Therefore], You don't get tired. rewritten sentences: If you do not exercise, you might remain energetic. When you don't workout occasionally, you will not become exhausted.
[Therefore], If you are not exercising you will not get tired.