Assessing Step-by-Step Reasoning against Lexical Negation: A Case Study on Syllogism

Large language models (LLMs) take advantage of step-by-step reasoning instructions, e.g., chain-of-thought (CoT) prompting. Building on this, their ability to perform CoT-style reasoning robustly is of interest from a probing perspective. In this study, we inspect the step-by-step reasoning ability of LLMs with a focus on negation, which is a core linguistic phenomenon that is difficult to process. In particular, we introduce several controlled settings (e.g., reasoning in case of fictional entities) to evaluate the logical reasoning abilities of the models. We observed that dozens of modern LLMs were not robust against lexical negation (e.g., plausible ->implausible) when performing CoT-style reasoning, and the results highlight unique limitations in each LLM family.


Introduction
Few-shot learning (Brown et al., 2020) has led to a remarkable performance in large language models (LLMs).In particular, instructions to generate a reasoning process along with the answer, i.e., chainof-thought (CoT) prompting (Wei et al., 2022;Kojima et al., 2022), have improved the performance of LLMs.Building on this, the ability of LLMs to perform CoT-style reasoning robustly is of interest from the probing perspective-how correctly these models perform step-by-step reasoning?; however, to the best of our knowledge, deeper analyses have yet to be explored fully.To address this question, this study investigates the step-by-step reasoning ability of LLMs with a special focus on robustness against (lexical) negation.Historically, negation has been challenging for neural models (Socher et al., 2013;Kassner and Schütze, 2020), and determining whether the step-by-step reasoning of LLMs overcomes this limitation is important in the natural language processing (NLP) community.
Is the following sentence implausible?
Is the following sentence plausible?
May is a turboglide player.
The answer is yes yes Think step-by-step LLM May did a stepover.
May did a stepover.
❌ The model must answer no to the latter question about the implausibility of the sentence based on the valid logical flow.Here, to evaluate the robust logical skill separately, the controlled reasoning chain is given, and the model must derive the answer based solely on the validity of the logical flow without commonsense knowledge due to fictional entities, e.g., turboglide.
Our controlled experiments using dozens of LLMs, including GPT-4 (OpenAI, 2023), demonstrate that such models deteriorate in performance substantially when processing questions involving words with just a negative prefix, e.g., implausible, unreasonable (Figure 1).In addition, the results show that each LLM family has its unique biases against lexical negation, which suggests that different LLM training settings produce substantial differences under certain conditions, and the problems to be addressed are model-dependent.These issues clarify the exact weakness of modern LLMs.

Reasoning Against Lexical Negation
Given a chain of the reasoning process, we expect the LLM to derive a logically valid conclusion even when the problem involves lexical negation (Figure 1).In Section 2.1, we introduce the task format, and Section 2.2 elaborates on the controlled task settings to elucidate the abilities of the models.Note that our task is similar to CoT reasoning; however, we provide the models with predefined reasoning chains to facilitate controlled analyses.

Format: Syllogism
We evaluated the LLMs' ability to judge the validity of particular types of syllogisms.Here, we utilized three settings to ensure the robustness of the results (Section 3); however, we consider the following SPORTS TASK (SP) format as an example to explain the settings.The base format of the syllogism is as follows: Premise1: PERSON is a SPORT player.
Premise2: ACTION happens in the SPORT.

Conclusion: PERSON does ACTION.
The above syllogism is converted into instances, as shown in Table 1, comprising a question about the validity of a particular conclusion (Is a sentence...plausible?), a chain of the reasoning process (premises), and a yes/no answer part.In the experiments, few-shot exemplars (Table 1, column 2) were first input to a model, and then the model completes the answer for the target example (__ in Table 1, column 3) with yes/no.Here, the correct answer depends on whether the SPORTS entities mentioned in the chain (premises 1 and 2) are the same. 1 The exact input to the models is described in Appendix A.

Controlled Task Modification
To analyze how the models struggle with negation, we introduce presumably challenging properties 1 Strictly speaking, the answer should also be unknown when the two sports differ.In our experiments, our prompts explicate to answer no in such cases into the task gradually (see the examples shown in Table 1).

BASE setting:
In this setting, premises and conclusions are aligned with the fact, e.g., Messi did a stepoverr is plausible; however, Messi performed a triple axel is implausible.

Fictional setting (FIC):
We do not focus on deriving an answer directly based on the model's knowledge without considering the reasoning chain.To eliminate such a solution from the BASE setting and ablate the effect of factuality, we replace the PERSON and SPORT entities with fictional entities (see Appendix C.1 for details about fictional entities), where the correct conclusion can only be derived from the premise information in a given reasoning chain.Note that these fictional entities are also used in subsequent settings.
In-domain negation setting (FICNEG): With this setting, we test the model's robustness against lexical negation.Here, we first design an in-domain setting where both few-shot exemplars and a target example involve lexical negation.Specifically, we turn the original question into one that involves a word with a negative prefix, e.g., plau-sible→implausible (see Appendix B for the word list). 2 Thus, the correct answer to the question should be flipped from yes/no to no/yes. 3

Out-domain negation setting (FICNEG-O):
We design an out-domain setting where few-shot exemplars do not involve lexical negation, but the target example has.If a model fails at only this setting, this implies that the model overfits to the domain of the few-shot exemplars in terms of lexical negation.In addition, FIC and FICNEG-O differ only in terms of the existence of the negation in the target example (this point can also be isolated by comparing these results). 3We also adopt a setting involving real entities and negation (NEG) in Appendix D. The results are generally competitive or slightly better than those in the FICNEG setting 4 Note that the SP task was originally intended to evaluate the commonsense knowledge about sports.In contrast, we used them to assess a pure reasoning ability by providing the necessary facts to derive a conclusion in a reasoning chain.
5 While we created 1,000 instances for the OC task, 100 instances were created for the WT task since this task is regarded as a supplementary one; nevertheless, quite similar results to the other tasks were obtained.employed 10 variants of the base words and their corresponding negated expressions, e.g., plausible/implausible, reasonable/unreasonable.Average and standard deviation scores across these runs were reported.
Inference: Three exemplars are given to the model along with general task instructions, e.g., Let's think step by step (Appendix A).Note that the exemplars have at least one yes and one no answer.We also examined different exemplar orders, yielding consistent results independent of the exemplar orders (Appendix F).Here, the answer with the higher probability between yes and no in the model's outputs for the target example is considered the answer.See Appendix E.1 for additional technical details.
Metrics: To evaluate the LLMs, the accuracy of each model's binary answers was measured (see Appendix G for the F1-score results).In addition, to quantify the output bias, we also calculated a no-ratio, i.e., how many out of 1,000 instances the models answered no.Note that the chance rates of the accuracy and the expected no-ratio are 0.5 because the dataset is balanced in terms of the gold answer distribution Appendix C.3).

Results, Analyses, and Discussions
Tables 2 and 3 show the average and standard deviation of accuracy and no-ratio of each model in the SP, OC, and WT tasks.

Consistent degradation against negation:
We found that all models demonstrated performance degradation with the FICNEG-O setting; however, the GPT-4 model performed above chance (Table 2).In other words, the considered LLMs failed to address lexical negation in CoT-style reasoning.We also observed a notable trend whereby the LLMs preferred to answer no regardless of the gold answer in the FICNEG-O setting (Table 3).Note that LLMs with accuracy rates of approximately 50% tended to continuously respond with no (or yes).This finding was particularly noticeable with the FICNEG-O setting where the LLMs that exhibited higher accuracy were those that constantly answered no (with the exception of the GPT-4 model).
These indicate that models do not randomly behave but exhibit some systematic error patterns.Such consistent degradation was also observed in case of BASE→FIC, which suggests that CoT-style prompting is supported by factors aligning with factuality along with the (insufficient) pure reasoning ability of the model.

Differences across model families:
Interestingly, we also found that different LLM families struggled under different settings (the green to purple patterns in Table 2).For example, the LLaMA models performed well with the FICNEG task but not the OPT models (Table 2).In particular, although the GPT-3.5,OPT-175B, and BLOOM(Z) models have approximately the same scale of parameters, they exhibited contrastive trends.Similar trends were also observed for the no-ratio case.For example, with the FICNEG and FICNEG-O, the GPT 3.5, LLaMA-7B, and BLOOM models demonstrated extreme statistics approaching 0 or 100, and their behavior flipped completely due to the different types of prompting between the FIC- NEG and FICNEG-O tasks.The performance difference between, for example, LLaMA-65B and OPT-66B also demonstrates that some factors other than parameter size induce a substantial gap in performance toward certain linguistic phenomena.
Scaling law breaks: Scaling law in LLMs has generally been reported (Gordon et al., 2021;Ivgi et al., 2022); however, the improvement over the model scale broke, specifically in the FICNEG-O setting, which confirms that our introduced task is challenging.Figure 2 shows this tendency for the LLaMA models.
In summary, generally, we found that including lexical negation in the tasks caused a drastic performance reduction for the compared LLMs.The results of the controlled experiments further revealed that different LLMs exhibited substantially different limitations and biases.Notably, we further tested the robustness of our findings with different prompt configurations and obtained consistent results (Appendix F).

Related Work
Negation and neural models: Negation is a core operation in natural language and logic, and previous studies have investigated and attempted to improve neural models in terms of addressing negation (Socher et al., 2013;Warstadt et al., 2019;Kim et al., 2019;Kassner and Schütze, 2020;Ettinger, 2020;Hossain et al., 2020;Hosseini et al., 2021;Truong et al., 2023).We align such challenges in the context of CoT-style prompting and the scaling of LLMs.The closest work to ours reported an inverse scaling law of LLMs' performance against negated prompts (Joel et al., 2023).In addition, we further elucidated the exact limitations and intermodel differences under controlled task settings.
Step-by-step reasoning: Generating an inference process with neural models has received increasing attention in terms of both performance improvement and model explainability (Ling et al., 2017;Sun et al., 2019;Rajani et al., 2019;Shwartz et al., 2020;Madaan et al., 2021;Gu et al., 2022;Aoki et al., 2023).Recently, the instruction to make LLMs generate intermediate reasoning steps (i.e., CoT prompting) has led to improvements in model performance (Wei et al., 2022).In this study, we attempted to elucidate the LLM's reasoning ability implicitly assumed in the CoT-style prompting and clarify that this success does not entail the LLMs' robust logical reasoning abilities (particularly against lexical negation).Note that the deterioration in the fictional settings also elcidate that LLMs work well only in the frequent domain in the training data (McCoy et al., 2023).
Logical reasoning with LLMs and artificially controlled experiments: Integrating logical reasoning ability into neural models is a pivotal goal in the artificial intelligence field (Marcus, 2003).With this aim, enclosing the models' exact weakness with artificially controlled data has been actively conducted in our field (Betz et al., 2021;Clark et al., 2020;Lu et al., 2021;Kudo et al., 2023); we show the peculiar case that just the flip of one word (adding a nation prefix) causes drastic effects for modern LLMs.

Conclusions
In this study, we have investigated the ability of LLMs to derive valid conclusions given a reasoning chain with a (lexical) negation, a historically tough phenomenon for neural models.The results of multi-difficulty controlled experiments revealed that LLMs with CoT-style prompting struggled to address negation; a simple flip of one word (e.g., plausible→implausible) has significantly hurted their performance.In addition, we have found consistent, systematic failure patterns unique in each LLM family.For example, some models always answered no to different question settings.In the future, we plan to analyze the model's internal and explore the source of this weakness.

Limitations
First, although we considered up to 31 LLMs, several other LLMs cannot be evaluated due to computational limitations, e.g., PaLM-540B (Chowdhery et al., 2022) and PaLM2 (Anil et al., 2023).Thus, evaluating the performance of these models is left to future work.Second, in terms of the generality of the obtained results, the examined prompt variations were limited, although we did examine prompts with different formats and orders (Appendix F).Third, in the current study, we adopted a somewhat peculiar setting where the chain-of-reasoning process is given from the perspective of the original CoT setting.Therefore, exploring the limitations in the inference based on the reasoning chain generated by the model will be an interesting direction from a practical perspective.Fourth, our analysis was limited to behavior-based probing; however, there are other paradigms to investigate (Lasri et al., 2022).In particularly, inspecting the inner workings of the models would be important to understand the mechanism of the model's failure.However, this was difficult because some model parameters were not open, and the vast number of layers/heads/parameters in large models made it difficult to track the precise patterns of the inner workings of the model.Finally, this study only considered lexical negation in English and was further confined to specific task formats and a certain type of syllogism.Therefore, extending the experimental scope will help further elucidate the exact limitations of the models.

Ethics Statement
Our findings demonstrate that LLMs struggle to address lexical negation under step-by-step CoTstyle reasoning settings.This problem is generally related to the problem of hallucinations in LLMs.We hope that our findings help to understand this issue by highlighting their exact weakness against negation.
The synthetic dataset utilized in the current study was created using automatic rules; thus, there were no ethical concerns regarding human workers or annotators during the dataset creation processes.In addition, the entity distribution of the dataset is fully balanced, and most of them are fictional, and there are no intended biases, e.g., the stereotypical relationship between gender and occupation.

B Lexical Negation
We created instances involving lexical negation by replacing an adjective in a question with one with a negative prefix.For example, the question Is the proposition "Messi did a stepover" plausible? is converted to Is the proposition "Messi did a stepover" implausible?Specifically, we used the terms listed in Table 4 to achieve this conversion.Note that the original SP task only adopts the word plausible.Here, we enhanced the diversity of the prompts to ensure the generality of our findings.The lexical negation list was created as follows: (i) GPT-4 was employed to generate nine synonyms of the word plausible, and then (ii) we manually added proper negation prefixes to each synonym to form the lexically negated term.

C Task Details
Here, we describe the task settings in detail.To ensure the robustness of our findings, we conduct additional experiments on two tasks (in addition to the SP task), i.e., the OCCUPATION (OC) and weight transitivity (WEIGHTTRANS.;WT) tasks.
The results across the tasks support the overall conclusion derived in the main part of this paper (See Appendix D for additional information).

C.1 Fictional Names/Information
In the fictional settings, we used fictional entities in all tasks.Here, we used GPT-4 to generate the names of the fictional sports, occupations, and animals.We used five fictional sports (i.1,217 fictional animals (see Table 5 for specific examples).In terms of people's names, we initially collected 50 typical male and female first names from the"Name Corpus: List of Male, Female, and Pet names"7 , which is available in the CMU Artificial Intelligence Repository.We then randomly collected 100 family names from the "Telligent-CommunitySample 0.1.1"dataset,8 which is accessible via the PowerShell Gallery.Finally, we created a list of 100 fictional names for each sport by randomly combining the first and last names (examples for the SPORTS task are shown in Table 6).
We also used the weight data of Mammals9 to generate the gold label in theBASE (non-fictional) setting of the WEIGHT TRANS.task.

C.2 Task Formats
SPORTS Task: See Section 2 and Table 1.
OCCUPATION Task: The task format in the FICNEG-O setting is described as follows: Few-shot exemplar: Q: Is a sentence "PERSON is a TITLE" plausible?A: PERSON is a OCCUPATION1.Only OCCUPATION1/2 are TITLE.So the answer is yes/no.
Target example: Q: Is a sentence "PERSON is a TITLE" implausi-

ble?
A: PERSON is a OCCUPATION1.Only OCCUPATION1/2 are TITLE.So the answer is __ Put simply, the underlying reasoning flow is similar to that of the SP task; however, here, the entities (i.e., the occupation and property names) differ.
WEIGHT TRANS.Task: The task format in the FICNEG-O setting is as follows: Few-shot exemplar: Is a sentence "ANIMAL1 is heavier than ANIMAL2" plausible?ANIMAL1/2 is heavier than ANIMAL3.ANIMAL3 is heavier than ANIMAL2/1.So the answer is yes/no.
Target example: Is a sentence "ANIMAL1 is heavier than ANIMAL2" implausible?ANIMAL1/2 is heavier than ANIMAL3.ANIMAL3 is heavier than ANIMAL2/1.So the answer is __ Here, the transitivity of reasoning (A>B, B>C, then A>C) is targeted.

C.3 Answer Distribution
Essentially, the yes:no ratio of the gold labels was approximately 1:1.Strictly speaking, the distribution differed slightly from 1:1 due to the random seed used in the dataset creation process.For example, for the SPORTS task, the BASE dataset in-cluded 496 yes labels and 504 no labels, and the FIC dataset included 495 yes labels and 505 no labels.The FICNEG dataset included 504 yes labels and 496 no labels, and the FICNEG-O dataset included 505 yes labels and 495 no labels.

D Full Results
All results for the SP, OC, and WT tasks are shown in Table 9, 10, and 11, respectively10 .Note that the WT experiment was conducted at a 1/10 scale (1,000 instances=100 seed instances×10 negated words) as a supplementary experiment.
We also examined the textscNeg setting, where real (not fictional) entities were used; however, the question involved negation as an intermediate setting between the BASE and FICNEG settings.The performance of all models is shown in Table 8.As can be seen, the results are generally competitive or slightly better than those obtained with the FICNEG setting.In other words, the model cannot handle negation in natural text, and abstract reasoning over negation is even more difficult.
The experiments for the other (non-OpenAI) models were conducted using Huggingface Transformers (Wolf et al., 2020) with the 8-bit option (Dettmers et al., 2022).For the LLaMA (Touvron et al., 2023) models, we received the model weights from the LLAMA Release Team on May 25, 2023.In addition, we recovered the Vicuna and Alpaca (Taori et al., 2023) models based on the provided LLaMA weights.For the OPT models ranging from 1.3 B to 66B, OPT-IML models, and OPT-IML-Max models (Iyer et al., 2022), we employed the models available from the Huggingface Community Model Hub 12 .We received the model weight for the OPT-175B (Zhang et al., 2022) model from Meta on May 28, 2022.We also used the BLOOM (Scao et al., 2022), BLOOMZ (Muennighoff et al., 2022), and NeoXT-Chat-Base-20B (Together Computer, 2023) models available from the Huggingface Community Model Hub.

E.1 Model Settings During Generation
To ensure that the models only output yes or no, we applied some changes during the answer generation process.Specifically, for the OpenAI models, we introduced an equal logit bias to yes and no using the provided logit_bias option, while setting tempreture = 0.0, max_tokens = 1.For the other non-OpenAI models, we manually ascertained the logit of yes and no, ultimately using the greater of the two as the model's final response under the same settings as the OpenAI models, in which temperature = 0.0, max_new_tokens = 1.

F Robustness over Different Prompts
To ensure the robustness of our results across different settings, we conducted supplementary experiments to investigate both prompt order and format.These experiments were conducted using the SP and OC tasks.Note that these supplementary experiments were conducted at 1/10 scale (1,000 instances=100 seed instances×10 negated words).
Fictional prompt: The few-shot exemplars in the main experiments consistently involved real entities.Thus, we conducted supplementary experiments in which the few-shot exemplars pertained to fictional entities.These experiments were implemented under the FIC setting, and the results are presented in Table 7, where the values are the averages from the SP and OC tasks.
Prompt format: We explored the influence of the prompt format in both few-shot exemplars and target examples.Here, we used the following format on questions with the gold labels designated as no.(Note, the format with gold labels of yes were unaltered.)A corresponding example is shown as follows: Is a sentence "PERSON does ACTION" plausible?PERSON is a SPORTS palyer.ACTION happens/does not happen in SPORTS.So the answer is yes/no.Is a sentence "PERSON does ACTION" implausible?PERSON is a SPORTS palyer.ACTION happens/does not happen in SPORTS.So the answer is no/yes.
Compared to the original format (Section 2), premise 2 changes.Here, the task is not to identify the consistency of sports/occupation name; however, the conclusion depends on the existence of does not.
The results are shown in Table 12.Note that both the accuracy and no ratio values are the averages obtained from the SP and OC tasks.
Prompt order: We investigated the impact of the prompt order with a specific focus on the position of the no label in the three exemplars.The order of the three exemplars in the main experiments was yes, no, yes; thus, we conducted supplemental experiments where the gold label sequences were altered to yes, yes, no and no, yes, yes.The results of the prompt order experiments are shown in Table 13, which shows the averages from the SP and OC tasks.

G F1 Score
Certain models (e.g., BLOOMZ family and OPT family in Table 9) predominantly registered an accuracy of approximately 50% by consistently responding with no (or yes).Note that this pattern was particularly evident for the FICNEG-O setting, with the GPT-4 model being a significant outlier.To highlight these models, we provided the macroaveraged F1-scores in Table 14.

Figure 1 :
Figure 1: Overview of our experiments conducted to evaluate each model's reasoning ability against lexical negation.The model must answer no to the latter question about the implausibility of the sentence based on the valid logical flow.Here, to evaluate the robust logical skill separately, the controlled reasoning chain is given, and the model must derive the answer based solely on the validity of the logical flow without commonsense knowledge due to fictional entities, e.g., turboglide.

Figure 2 :
Figure 2: Relationship between model size (x-axis) and performance (y-axis) of LLaMA models for each task.Each point indicates the average accuracy of the corresponding model and setting.The colored area indicates the standard deviation.

Table 1 :
General task format in each setting.Few-shot exemplars are first shown to a model, and then the model answers to the target example given its question and reasoning chain.Symbols, e.g., A and α, are replaced with certain real or fictional entities in the actual input.The REAL setting indicates that the entity choices reflect the factual reality, and FIC.indicates that the entity choices do not reflect factual reality, e.
g., Is "Judy Tate was safe at first." plausible?Judy Tate is a turboglide player.Getting out at first happens in turboglide.So the answer is yes.Refer to Appendix A for the exact input.

Table 3 :
Average and standard deviation of models' no-ratio in the model outputs for each setting in the SPORTS TASK OCCUPATION TASK and WEIGHT TRANS.TASK (scores are multiplied by 100).

Table 4 :
Full list of base words and lexically negated words used in the experiments.

Table 5 :
Examples of fictional sports, occupations and animals we used.

Table 6 :
Examples of fiction person names we used for each fictional sport.

Table 7 :
Average and standard deviation of models' accuracies and the no-ratio of the model outputs in Fictional Prompt.

Table 8 :
Average and standard deviation of model accuracies and no-ratio for the NEG setting (i.e., real entities and questions with negation).

Table 9 :
Average and standard deviation of model accuracies and the no-ratio of the model outputs at each setting for the SPORTS TASK.

Table 12 :
Average and standard deviation of model accuracies and the no-ratio at each setting for different Prompt format.

Table 14 :
Models' average-macro F1 score at each settings.