LogiCoT: Logical Chain-of-Thought Instruction Tuning

,


Introduction
Instruction tuning Large Language Models (LLMs) has become a popular paradigm for Natural Language Processing (NLP) in recent years (Ouyang et al., 2022;Sun et al., 2022), where LLMs are taught to follow natural language instructions for real-world tasks.A prominent line of research is the development of OpenAI's ChatGPT and GPT-4 (OpenAI, 2023).LLMs demonstrate multistep chain-of-thought (CoT) reasoning ability with proper prompting (Kojima et al., 2022;Huang and Chang, 2022).CoT instruction tuning has drawn attention for its potential to encourage complex, stepby-step reasoning in these models.Wei et al. (2023) and (Kojima et al., 2022) have demonstrated the ability of LLMs to generate a coherent sequence of reasoning steps leading to the final answer through CoT prompting.Moreover, ChatGPT and GPT-4 have shown remarkable zero-shot complex reasoning abilities on several logical reasoning datasets (Liu et al., 2023a).
MMe fe)e ME y-eae ela erSeti tell lag SCT aRaut -Boyce L Jessica is playing a game if and only if it is cloudy.(Biconditional) vena late, then Jessica is playing a game.(Conditional) 3.If Jessica plays a game, then it is cloudy.(Conditional) We can break down the biconditional in premise 1 via Biconditional Elimination: 1a.If it is cloudy, then Jessica is playing a game.
1b.If Jessica is playing a game, then it is cloudy.

CoM ek RULES Ee elem
Now, from premise 2(If it is late, then Jessica is playing a game) and premise 1b (If Jessica is playing a game, then it is cloudy), we can infer via Hypothetical Syllogism: a If it is late, then it is cloudy.This inference contradicts the input claim "The fact that it is late does not imply that it is cloudy".Therefore, we cannot infer the input claim from the provided premises.Hence, the answer is no.Yet, developing such proprietary models as GPT-4 and ChatGPT often necessitates intensive instruction engineering, a process that has, thus far, been largely kept private.Recent research endeavours have begun to explore the distillation of the instruction data using self-instruct techniques (Wang et al., 2022;Peng et al., 2023), where GPT-3 or GPT-4 are harnessed to generate instruction-following examples.This technique represents a promising avenue for reducing the human labour involved in instruction tuning, offering a more economical way to produce community models trained with instruc-arXiv:2305.12147v1[cs.CL] 20 May 2023 tional data.
A paradigmatic example is the pipeline established by Wang et al. (2022) for cultivating instruction data.This pipeline has been used to produce multiple open-sourced, instruction-tuned models.The instruction tuning data share a common pool of instructions and fixed task templates.Initial instructions are authored by humans, and LLMs are then used to extend this instruction data.However, the scope of these instructions is limited and does not encapsulate diverse, complex reasoning scenarios such as multi-step logical reasoning.
While GPT-4 has demonstrated its ability to produce high-quality CoT reasoning output, the potential of generating CoT instruction tuning data using this model remains largely unexplored.The need to cover more diverse and complex reasoning scenarios, particularly multi-step logical reasoning, represents a significant gap in the current instruction-tuning landscape.The current research aims to address this gap by scaling up the instruction set (Chung et al., 2022), paving the way for more nuanced and sophisticated instruction-tuned models.
Logical reasoning represents a fundamental aspect of human cognition, embodying the ability to infer conclusions based on a structured progression of premises.The dearth of such abilities in community models presents a significant gap, inhibiting the evolution of more advanced AI systems.To bridge this gap, we introduce LogiCoT 1 , a chain-of-thought (CoT) instruction-tuning dataset designed explicitly for logical reasoning.
Our approach to developing LogiCoT involves repurposing existing logical reasoning datasets, constructing logical reasoning CoT instructions from these resources, and leveraging the capabilities of GPT-4 to generate high-quality outputs.The resulting instruction data features both symbolic reasoning and multi-step CoT reasoning, providing a comprehensive and nuanced resource for enhancing the logical reasoning abilities of AI models.
This work builds on recent research indicating that smaller language models can achieve competitive multi-step reasoning abilities when specialized on targeted CoT tasks.Examples of these tasks include the execution of SQL commands, mathematical CoT reasoning, and generating code snippets (Gao et al., 2022;Liu et al., 2023b;Fu et al., 2023).By applying a similar specialization approach to 1 github.com/csitfun/LogiCoTthe broader and more complex domain of logical reasoning, we aim to bring these capabilities into the mainstream, furthering the development of AI systems with advanced logical reasoning skills.
The LogiCoT dataset represents a significant contribution to the field, providing a resource that facilitates the training of AI models to handle complex reasoning tasks.By enhancing the capabilities of AI systems in this manner, we take a step towards more advanced, more human-like artificial intelligence.

Related Work
Instruction tuning LLMs.Instruction tuning of Large Language Models (LLMs) has become a thriving research area in Natural Language Processing (NLP), aiming to enable zero-shot generalization on unseen tasks (Zhong et al., 2021;Ouyang et al., 2022;Wei et al., 2022).This involves finetuning LMs to perform diverse tasks by following a set of instructions, making the task source an essential component of instruction tuning (Longpre et al., 2023).Most existing instruction tuning methods rely heavily on human-crowdsourced tasks or model-generated tasks.Human-crowdsourced tasks originate from a variety of sources, such as T0 (Sanh et al., 2022), FLAN (Wei et al., 2022), and NaturalInstructions (Mishra et al., 2022).These tasks, although high-quality, rely on substantial human effort and are often limited in quantity.In contrast, model-generated tasks involve leveraging a powerful LM, such as GPT-3 and GPT-4, to generate a diverse set of instructions, task inputs, and task outputs based on a seed set (Wang et al., 2022;Peng et al., 2023).
Our work is distinct as it seeks to leverage GPT-4's chain-of-thought reasoning capabilities in instruction tuning.By introducing the LogiCoT dataset and incorporating symbolic tasks, we aim to advance the quality and scalability of instructionfollowing data and thereby enhance the overall performance of instruction-tuned LLMs.
Chain-of-thought rationales.Large Language Models (LLMs) can conduct complex reasoning tasks by generating intermediate reasoning steps through a process called chain-of-thought (CoT) prompting (Wei et al., 2023).Zero-Shot-CoT prompting uses a simple instruction (like "Let's think step by step") to elicit step-by-step reasoning before answering a question.LLMs have exhibited reasonable zero-shot reasoning capabili-ties, generating outputs that inherently reflect CoT reasoning (Zhou et al., 2023).This notion inspired researchers to use self-generated rationales for demonstrations.In particular, Zelikman et al. (2022) demonstrated the practicality of using LLMs to generate rationales.They prompted GPT-J (Wang and Komatsuzaki, 2021) to generate rationales and then selected the ones leading to the correct answer.We adopt this method for GPT-4 generation.Our approach, however, tackles complex logical reasoning scenarios utilizing questions with annotated answers.
Logical reasoning.Logical reasoning is a key aspect of human cognition and a critical capability for AI systems.Researchers have been exploring various approaches to achieve this goal, including rule-based methods, symbolic systems (Mac-Cartney and Manning, 2007), fine-tuning large language models (Wang et al., 2018), and combining both neural and symbolic approaches (Li and Srikumar, 2019).Logical reasoning tasks often require multi-step, complex reasoning, which makes them an ideal target for CoT instruction tuning.By integrating logical reasoning tasks into CoT instruction tuning, we can push the boundaries of what AI systems can achieve and get closer to systems that can understand and reason about the world in a human-like way.

Seminal Data Selection
Selecting the seminal data for CoT instruction tuning of logical reasoning models involves choosing high-quality datasets that adequately cover the range of skills required for logical reasoning.The datasets should present challenges representing real-world logical reasoning tasks and be designed to support CoT instruction tuning.Below are the seminal instruction data we select: LOGICINFERENCE (Ontanon et al., 2022) is a synthetically generated sequence-to-sequence dataset teaching models to perform logical inference using propositional logic and a subset of firstorder logic.The input is a question, varying in types of problems ranging from language-to-logic translation to multi-step inference chains.The output provides the answer, including the reasoning chain to generate it.The output, in some cases, even provides the name of the inference rule used in each step.
FOLIO (Han et al., 2022) is an open-domain, logically complex and diverse dataset equipped with first-order logic (FOL) annotations.What sets FOLIO apart is the parallel FOL annotations for each premise and conclusion, which are automatically verified by a FOL inference engine.This aspect provides a clear, precise standard for logical reasoning.In addition, the humanannotated nature of the dataset ensures high-quality data input.This dataset can be easily converted into a sequence-to-sequence structure, serving as instruction-following data for symbolic logic reasoning.
ReClor (Yu et al., 2020) and LogiQA (Liu et al., 2020) are datasets derived from verbal reasoning examinations, demanding various types of logical reasoning for answering multi-choice questions.These datasets are especially valuable as they represent realistic human reasoning processes.Further, the real-world nature of the questions in these tests, which often require a mix of common sense and logical reasoning, ensures that the model is trained to tackle problems with varying degrees of complexity.We use the training set of the two dataset, keeping the test set out of the instruction tuning data.
Choosing these seminal datasets for CoT instruction tuning in GPT-4 offers a balanced, comprehensive, and challenging training environment.This approach ensures that the model gains exposure to a broad range of logical reasoning tasks, thus enhancing its ability to effectively handle similar tasks in real-world applications.

Instruction Types
We incorporate a comprehensive system for instructing language models in various aspects of logical reasoning.Each type is designed to engage the model with logical inference tasks at different levels of abstraction and complexity, with both natural language and symbolic language.
To our knowledge, no similar instruction types exist in other instruction-following data.
We classify the instruction types into general inference and multi-choice reading comprehension tasks.

General Inference Task
This category includes instruction types that demand general reasoning and inferential skills, often involving an understanding of logical structures and principles.The model may need to perform operations such as translating natural language to  formal logic, predicting possible inferences from given premises, or tracing inference chains.These tasks are designed to enhance the model's ability to think critically and logically, without relying too heavily on specific contexts or domain knowledge.
Table 1 exemplify the instruction types for general inference.An example is offered to illustrate each instruction type.
Language to Logic: This instruction involves translation from natural language into a more formal logical notation.It presents a foundational task of understanding and interpreting logical statements expressed in natural language and converting them into a formalized logical representation.
One-Step Inference: In this case, the model is presented with a set of premises and tasked with predicting all the potential inferences that can be derived from them in a single step.This type of instruction encourages the model to exercise deductive reasoning based on the provided premises.The premises and inferences can be in natural language or symbolic language.Symbolic language encourages precise and abstract reasoning while natural language context simulates real-world language use scenarios.
Inference Chains: This instruction type takes logical reasoning a step further by requiring the model to establish whether a potential inference can be proven from a set of premises.The model must then provide the chain of reasoning leading to the answer.This type encourages deeper logical reasoning and the ability to construct logical arguments.The examples are either crafted in symbolic language or natural language.

Multi-choice Reading Comprehension Task
This category involves tasks that require a deep understanding of a given text, often demanding that the model identifies, extracts, or infers information from the text.The model might be asked to resolve a situation described in the text, to pinpoint a flaw in an argument presented, or to identify information that would either strengthen or weaken an argument.
Table 2 shows the instruction types and running examples for logical multi-choice reading comprehension.Multi-choice reading comprehension is the go-to task for testing LLMs' reasoning ability, where a model need to select the correct option B. In reading newspapers, most people not only look for stories on specific topics but also like to idly browse through headlines or pictures for amusing stories on unfamiliar or unusual topics.
Reasoning: The argument is based on the assumption that these personalized electronic news services will replace newspapers because they provide information more quickly and efficiently.However, if most people also enjoy browsing through newspapers for amusing stories on unfamiliar or unusual topics (option B), then these electronic services may not provide the same experience for their users.This means that even if the electronic news services are more efficient in delivering specific news, they might not fully replace newspapers because people still enjoy the browsing experience that newspapers provide.This weakens the argument that newspaper sales will decline drastically if these services become widely available.

identify information that would explain or resolve a situation
Passage: On May first, in order to reduce the number of overdue books, a children's library instituted a policy of forgiving fines and giving bookmarks to children returning all of their overdue books.On July first there were twice as many overdue books as there had been on May first, although a record number of books had been returned during the interim.Question: Which of the following, if true, most helps to explain the apparent inconsistency in the results of the library's policy? A. The bookmarks became popular among the children, so in order to collect the bookmarks, many children borrowed many more books than they usually did and kept them past their due date.B. The children were allowed to borrow a maximum of five books for a two-week period, and hence each child could keep a maximum of fifteen books beyond their due date within a two-month period.C.Although the grace period enticed some children to return all of their overdue books, it did not convince all of the children with overdue books to return all of their books.D. Although the library forgave overdue fines during the grace period, the amount previously charged the children was minimal; hence, the forgiveness of the fines did not provide enough incentive for them to return their overdue books.
Answer and reasoning step by step: A. The bookmarks became popular among the children, so in order to collect the bookmarks, many children borrowed many more books than they usually did and kept them past their due date.

Reasoning:
The situation presents an inconsistency: despite forgiving fines and giving bookmarks as incentives to return overdue books, the number of overdue books doubled.We need to identify which statement helps explain this inconsistency.
A. If the bookmarks became popular and children borrowed more books than usual to get them, it is possible that they couldn't read and return all those books on time, causing an increase in overdue books.This option explains the inconsistency, as the policy would have an unintended consequence -children borrowing too many books.B. This option only specifies the borrowing limit and duration, but it does not explain why the number of overdue books would increase despite the incentives.C.This option implies that not all children returned their overdue books, but it doesn't explain the doubling of the overdue books in the given period.It only suggests that the policy didn't work to its full potential.D. This option suggests that the fines were minimal, but it doesn't explain why the number of overdue books increased.If fines were minimal, we could assume that the situation would remain stable, not worsen.Therefore, option A most logically helps to explain the inconsistency in the results of the library's policy.

identify a flaw in an argument's reasoning
Passage: If you study history, then you will appreciate the vast differences among past civilizations, and you will appreciate these differences provided that you reflect on your own civilization.given a problem and a set of options.
The instructions below further cultivate the model's critical thinking and argument analysis skills.They are geared towards refining the model's capacity to dissect arguments, understand their structures, and evaluate their strength based on additional information.
Identify the Necessary Claim: This instruction tasks the model to pinpoint the claim that must be true or is required for an argument to work.It is essentially training the model to identify essential assumptions or premises in an argument, thus honing its ability to understand argument structures.
Strengthen an Argument: Under this instruction, the model must identify information that would strengthen an argument.It requires the model to not just understand the argument, but also anticipate what additional information could make the argument more convincing.This helps the model to improve its capability to enhance logical arguments.
Weaken an Argument: This type is the opposite of Type 5. Here, the model is tasked with identifying information that would weaken an argument.This helps the model develop a nuanced understanding of argument structures and cultivate the ability to critique and dismantle arguments.
Resolve a Situation: This instruction requires the model to identify information that would explain or resolve a situation.This is about identifying missing information or finding potential solutions to a problem, further expanding the model's problem-solving capabilities.
Identify a Flaw in Arguments Reasoning: In this type, the model must identify a flaw in an argument's reasoning.This instruction cultivates the model's critical thinking skills, as it needs to scrutinize the argument and pinpoint any logical fallacies or inconsistencies.
By incorporating these instruction types, the data generation scheme is broadened to more complex logical reasoning tasks, particularly in the realm of argumentation and critical thinking, thereby enhancing the language model's ability to engage with more sophisticated and nuanced logical reasoning tasks.
The distinctiveness of these instruction types lies in their combination of logical reasoning tasks with natural language processing, providing a robust framework for training language models in logic-infused language understanding and gener-ation.This comprehensive approach is unique to this data generation scheme and offers an innovative pathway for improving the logical reasoning capabilities of large language models.
The multi-choice reading comprehension task is derived from LogiQA and ReClor.These two datasets are not sequence-to-sequence; they do not offer step-by-step reasoning outputs.However, GPT-4 scores well on these two datasets (Liu et al., 2023a) without in-context examples.
We are granted early access to GPT-4 API.This early access to GPT-4 API provides a unique opportunity to leverage the advanced capabilities of this model for generating high-quality rationales.By using the API, we can pass the logical reasoning tasks derived from our seminal datasets (LogiQA, ReClor, LOGICINFERENCE, and FOLIO) to GPT-4 and collect the model's responses.
These responses or rationales, which essentially represent GPT-4's reasoning process, are valuable resources for building the Chain-of-Thought (CoT) instruction-tuning data.GPT-4, with its advanced language understanding and generation capabilities, can provide diverse and detailed reasoning chains.These chains not only answer the logical reasoning tasks but also provide insights into the underlying reasoning process, facilitating an understanding of how the model arrives at its conclusions.

Data Statistics
We collect 604,840 data instances in total.Among these 604,840 instances, they are distributed as follows: Language to Logic Tasks: These tasks, derived from the LOGICINFERENCE and the FOLIO datasets, require the model to translate a natural language representation of premises and potential inferences into formal logical notation.We have 78,663 instances of this task.
One-Step Inference Tasks: These tasks, also derived from LOGICINFERENCE and the FOLIO datasets, require the model to predict all possible one-step inferences from a set of given premises.We have 258,646 instances of this task.
Inference Chain Tasks: In these tasks, the model is given a set of premises and a potential inference, and it must determine whether the inference can be proven from the premises, providing the inference chain in either case.We have 262,895 instances of this task.
Multi-choice Reading Comprehension Tasks: These tasks, derived from the LogiQA and ReClor datasets, require the model to choose the correct answer from multiple options based on the provided text.We have collected 4,636 instances for these tasks.
Together, these tasks cover a wide range of logical reasoning abilities, providing comprehensive training data for instruction tuning of GPT-4.

Conclusion
In this paper, we have explored a new approach to collecting Chain-of-Thought (CoT) instructiontuning data through the lens of logical reasoning tasks.We have developed a rich corpus of instruction types and collected over 600,000 instances using datasets like LOGICINFERENCE, FOLIO, LogiQA, and ReClor.These datasets offer diverse challenges in logical reasoning, from one-step inferences to more complex inference chains, and from translating natural language to formal logical notation, to selecting the correct answer from multiple options.Future work will utilize the data to instruction-tuning open-souced LLMs and testing their reasoning abilities.As we continue to push the boundaries of what LLMs can achieve, instruc-tion tuning with logical reasoning tasks presents a promising path.It allows models to exhibit more transparency in their decision-making process by providing intermediate reasoning steps.Our work presents a solid foundation for future research in this direction.

Figure 1 :
Figure 1: A showcase of utilizing GPT-4 and existing inference data to generate CoT rationales for logical reasoning.
From the fact that if it rains, then Patricia is curious, and that if Patricia were curious, then Charles is taking a plane we can infer that if it rains, then Charles is taking a plane via transitivity.From the fact that if it rains, then Charles is taking a plane, and that if Charles takes a plane, then it is raining we can infer that it is raining if and only if Charles is taking a plane via biconditional introduction.From the fact that if Charles takes a plane, then John is not reading a book, and that John is reading a book we can infer that Charles is not taking a plane via modus tollens.Finally, from the fact that it is raining if and only if Charles is taking a plane, and that Charles is not taking a plane we can infer that it is not raining via biconditional elimination Therefore, the answer is yes.

Table 1 :
The instruction types and illustrating examples for generating general inference CoT rationales.
Passage: Almost all of the books published in the past 150 years were printed on acidic paper.Unfortunately, every kind of acidic paper gradually destroys itself due to its very acidity.This process of deterioration can be slowed if the books are stored in a cool, dry environment.Techniques, which are now being developed, to deacidify books will probably be applied only to books with historical significance.Question: If all of the statements in the passage above are true, which one of the following must also be true? A. If a book was published in the past 150 years and is

Table 2 :
The instruction types and examples of generating CoT rationales for multi-choice reading comprehension.