Say What You Mean! Large Language Models Speak Too Positively about Negative Commonsense Knowledge

Large language models (LLMs) have been widely studied for their ability to store and utilize positive knowledge. However, negative knowledge, such as “lions don’t live in the ocean”, is also ubiquitous in the world but rarely mentioned explicitly in text.What do LLMs know about negative knowledge?This work examines the ability of LLMs on negative commonsense knowledge.We design a constrained keywords-to-sentence generation task (CG) and a Boolean question answering task (QA) to probe LLMs.Our experiments reveal that LLMs frequently fail to generate valid sentences grounded in negative commonsense knowledge, yet they can correctly answer polar yes-or-no questions.We term this phenomenon the belief conflict of LLMs.Our further analysis shows that statistical shortcuts and negation reporting bias from language modeling pre-training cause this conflict.


Introduction
Most of the world knowledge exists in a positive and affirmative form (Molnar, 2000;Barker and Jago, 2012;Vrandečić and Krötzsch, 2014;Speer et al., 2017). As a result, large language models (LLMs) pre-trained on a colossal amount of texts, such as GPT-3 (Brown et al., 2020;Ouyang et al., 2022) and PaLM (Chowdhery et al., 2022), have demonstrated their remarkable abilities for storing and utilizing positive knowledge in downstream tasks. In contrast, negative knowledge, such as the commonsense statement that "lions do not live in the ocean", is rarely mentioned in the textual world (Hossain et al., 2022). 2 Such negative knowledge also exists in the real world, and is important Figure 1: An example of the probing tasks studied in this paper. For the same negative commonsense knowledge <lion, located at, ocean> which is false, we find LLMs often fail to generate texts grounded in such negative knowledge while knowing its validity according to question answering.
for cognitive skills such as knowing what is not true or what not to think (MacDonald, 1965;Minsky, 1997; Barker and Jago, 2012). Therefore, we ask this question: Do LLMs (such as GPT-3 models) acquire such implicit negative knowledge through extensive language modeling pre-training?
One important way of probing LLMs, which are mostly generative models, is checking whether the generated texts are knowledge-grounded. This is because the generation of texts is a direct manifestation of a model's internal beliefs towards world knowledge (Kassner et al., 2021;Sumers et al., 2021;Tafjord et al., 2022). 3 Knowledgegrounded text generation has been a focus of NLP research (Yu et al., 2022). For example, the COM-MONGEN benchmark (Lin et al., 2020) evaluates generative commonsense reasoning that organizes concepts as keyword input and generates a sentence grounded in commonsense knowledge. However, ), 8.7% in QNLI (Rajpurkar et al., 2016, and 22.6-29.9% in general-purposed texts. 3 Our definition of belief is derived from Kassner et al. (2021), which is the assignment of a truth value to a proposition. In our study, the context for the proposition is the world knowledge that models learned. Therefore, we define a model's belief about such knowledge as its prediction about the truth value of a certain piece of world knowledge. previous work does not consider negative knowledge, nor do they probe the consistency between what models know and what they generate. Another line of work on probing (Petroni et al., 2019;Ettinger, 2020;Kassner and Schütze, 2020;Cao et al., 2021) is conducted through the mask-infilling task. However, this task mainly evaluates bidirectional models (Devlin et al., 2019), and is not natural for unidirectional LLMs. Also, this task suffers from the open-world problem in evaluation, i.e., there could be multiple valid answers to fill the mask. This is vital for evaluating negative knowledge, which has an infinite answer space, e.g., lions don't live in the sky, water, desk, car, etc.
In this study, we investigate the belief of LLMs about negative commonsense knowledge through the lens of text generation. Since LLMs have become a foundational service (Bommasani et al., 2021) and cannot be easily trained, we apply incontext learning (Brown et al., 2020) for the probing tasks, which is tuning-free. We design a Constrained Sentence Generation (CG) probing task, following Lin et al. (2020), where the model must generate a knowledge-grounded sentence based on a given triple <s, r, o>. For example, given a triple "<lion, located at, ocean>", a model should generate "lions do not live in the ocean". This task is rather simple and clear. The output sentence basically contains the same information as the input keywords. Thus, the generated texts are easy to evaluate according to the appearance of negation. We also add a Boolean Question Answering (QA) task that asks LLMs whether a knowledge triple is valid, which shows their beliefs about this piece of knowledge. An example is given in Figure 1.
In our experiments, we find that LLMs of different sizes and shapes often produce hallucinated claims of negative knowledge, even if they answer yes-or-no questions about it correctly. We term this phenomenon the belief conflict, i.e., actions (generating texts with it) conflict with its belief (answering question about it). Hallucinated generation of negative knowledge is seen in both our probing tasks and downstream tasks, such as explanation generation Jung et al., 2022), where negative knowledge plays an important role in the argumentation of refutation. Further analysis shows that this problem stems from the statistical shortcuts and reporting bias of negation during pretraining. Moreover, such implicit biases can be alleviated through explicit reasoning with Chain-of-Thought prompting (Wei et al., 2022b), such as syllogistic deduction and related fact comparison.
The main contributions of this paper are summarized as follows: 1) We are the first to investigate LLMs' belief about negative knowledge in the commonsense domain, which may shed light on a previously unstudied aspect of LLMs' abilities.
2) We propose to probe generative LLMs through constrained sentence generation, which is effective for evaluating generated texts grounded in positive and negative knowledge. 3) Through extensive experiments, we identify and analyze LLMs' belief conflict phenomenon on negative commonsense knowledge, and provide insights on the causes and solutions of such problems.  Barker and Jago, 2012). It plays an important role in the human reasoning process, because to think effectively, we need to know what "not to think" (Minsky, 1997). Current research of negative knowledge in NLP mainly focuses on developing negative knowledge bases that store relational negative commonsense knowledge (Arnaout et al., 2021;Safavi et al., 2021;Arnaout et al., 2022) and utilizing negative knowledge within arguments or explanations to refute a candidate (Camburu et al., 2018;Aggarwal et al., 2021;. This paper is based on these resources to probe the belief of LLMs about the relations of everyday concepts that are not true.

Understanding Negation in Texts
The manifestation of negative knowledge in texts is the phenomenon of negation (Horn and Wansing, 2022), which is difficult for pre-trained LMs to understand, e.g., filling "birds cannot [MASK]" with "fly" (Kassner and Schütze, 2020). Negation has been shown to be spuriously correlated with negative or contradictory labels due to the data distribution (Gururangan et al., 2018;Ettinger, 2020;Lai et al., 2021;Branco et al., 2021;Tian et al., 2022), raising doubts about the performance of previous models. Furthermore, LMs may ignore the existence of negative words when understanding texts (Kassner and Schütze, 2020) or processing prompts (Jang et al., 2022), which can be alleviated with unlikelihood training objective (Welleck et al., 2020) during training (Hosseini et al., 2021 or specifying pragmatic contexts (Gubelmann and Handschuh, 2022). While most current research focuses on NLU, this work fills in a gap in the investigation of the negation phenomenon in the context of text generation.
Knowledge-Grounded Language Models A major goal of NLP has been to ground LMs in world knowledge, such as factual knowledge (Vrandečić and Krötzsch, 2014) and commonsense knowledge (Speer et al., 2017). A line of work (Petroni et al., 2019;Kassner and Schütze, 2020;Cao et al., 2021) directly probes the knowledge implicitly learned by LMs through maskinfilling. However, such a probing paradigm only works for contextual LMs such as BERT (Devlin et al., 2019), leaving generative ones, especially modern LLMs, understudied. Another line of work focuses on making LM-generated sentences grounded in knowledge (Petroni et al., 2020;Liu et al., 2021). Lin et al. (2020) designed a constrained text generation task, COMMONGEN, which asks a model to generate a sentence given a set of concepts, testing the generative commonsense reasoning of LMs. However, these studies do not investigate text generation grounded in negative knowledge, which is the focus of this work.
In-Context Learning In-context learning (ICL; Brown et al., 2020) has become a prevailing paradigm for deploying LLMs (e.g., the GPT-3 family Brown et al., 2020;Chen et al., 2021;Ouyang et al., 2022) for downstream tasks. Through ICL, LLMs can solve tasks directly based on inputoutput examples without parameter updates (Min et al., 2022a;Rubin et al., 2022). Furthermore, recent work (Wei et al., 2022b;Wang et al., 2022) reveals that the ceiling performance determined by the scaling law can be beaten with ICL by generating immediate rationales, i.e., the Chain of Thought (CoT) prompting. Since LLMs are becoming a foundational service that do not need fine-tuning, our probing on LLMs are based on ICL.

Probing Protocol
In this section, we set up an evaluation protocol to understand what LLMs know about (negative) commonsense knowledge of everyday concepts.

Data
We limit the scope of the knowledge probed to relational knowledge between commonsense concepts, i.e., relational knowledge triples, which exist widely in knowledge graphs and are commonly studied by the community (Auer et al., 2007;Vrandečić and Krötzsch, 2014;Speer et al., 2017). Given a triplet in the form of <s, r, o> with a subject concept s, a relation r and an object concept o, we define a negative fact as ¬r(s, o) if the truth value of r(s, o) is False according to commonsense knowledge, and a (positive) fact if otherwise.

Dataset Statistics
We build the probing dataset (denoted as CSK-PN) based on the knowledge triples filtered by Safavi et al. (2021), which are the challenging ones sourced from ConceptNet (Speer et al., 2017). We also remove invalid triples with pronouns, negation, and adjectives as subjects or objects. The final dataset contains a total of 4,000 triples with six pairs of positive or negative relations (e.g., ISA and NOTISA), and the positive and negative splits have the same size (1:1). Detailed information of CSK-PN is shown in Figure 2.

Probing Task Formulation
The most commonly used probing task for understanding whether LMs have certain types of knowledge is mask-infilling (Devlin et al., 2019;Petroni et al., 2020;Kassner and Schütze, 2020). However, this task is not suitable for generative LMs, as the mask must exist at the end of a sentence.
We argue that LLMs, which are mainly autoregressive text generation models (Radford et al., 2019;Brown et al., 2020;Ouyang et al., 2022;Scao et al., 2022), should be investigated by text generation with text decoding from a large sentence space. Therefore, we propose to use Constrained Sentence Generation (CG) as the primary task to investigate LLMs, coupled with Boolean Question Answering (QA) for comparison, which is a common approach to probing the belief of models (Tafjord et al., 2022;Richardson et al., 2022).
Task 1: Boolean Question Answering (QA) The Boolean QA task requires LLMs to express its belief about a fact by answering a yes-or-no question. We first transform every triplet <s, r, o> into a yes or no question q, where we remove the negation in r for negative facts. For example, a prompt goes like this: Answer commonsense questions with yes or no: (Examples for in-context learning) Question: do lions live in the ocean? Answer: no where underlined texts are completed by LLMs. To generate the questions, we adopt InstructGPT using in-context learning ( §4.1). The questions are 94% valid according to a manual inspection of 50 random cases. 4 Task 2: Constrained Sentence Generation (CG) Generating texts is a direct manifestation of a model's belief. However, evaluating generated texts is notoriously difficult in NLP, especially without references. Therefore, we design a keywordto-sentence task to make the probing more controllable, which is similar to COMMONGEN (Lin et al., 2020). Given a triple <s, r, o>, models need to generate sentences grounded in (negative) knowledge, i.e., add negation cues (e.g., not, unable) in the sentence if necessary, e.g., Write a short and factual sentence according to commonsense based on the keywords: (Examples for in-context learning) Keywords: lion, located at, ocean Sentence: lions don't live in the ocean.
We remove the NOT prefix from the negated relations. Note that we allow the paraphrasing of the input keywords, making it a soft-constrained sentence generation task.

Evaluation Metrics
Metric for QA The QA task can be easily evaluated by checking the generated token yes and no (cased and uncased). Following the CG task, we also use TP and TN as accuracy metrics. For rare scenarios (< 1%) that LLMs do not generate a yes or no token, we compare the conditional probability of these two tokens.
Metric for CG Due to the controlled task setting, which essentially forces LLMs to decide whether and how to add a negation cue during decoding, the CG task can be efficiently evaluated by detecting the existence of negation cues (e.g., not, unable, etc.) in the generations. We define TP and TN as the accuracy on the positive and negative splits in CSK-PN, and Acc as the accuracy on the whole dataset (i.e., Acc = (TP + TN)/2, since the positive and negative splits have equal size). To implement this metric, we first use keywordsbased matching for negation cues, followed by a RoBERTa model (Liu et al., 2019) as a token classifier looking for unmatched negation cues. 5 This metric produces 1 or 0 based on the finding of negation cues in a sentence. After manual inspection of 200 cases, we find that this metric is correct 97% of the time, which is reliable for evaluating such a constrained probing task. Errors are mostly due to double negations and ambiguous negative cues (e.g., less, opposite, etc.), which are quite rare.
Can we trust negation detection as the metric to evaluate CG? We manually evaluate the factuality of generated texts based on commonsense knowledge and see whether the CG metric (detection of negation) correlates well with humans in this task. Note that only the sentences that make common sense and adhere to the keywords constraints are accepted as true during manual annotation. After examining 100 cases, we find that the agreement between human judgment and this metric achieves 95%. This is predictable, since this task is rather easy and constrained, yet LLMs do not solve it well, especially not very consistent with the QA task. Errors made by the metric are mostly because 1) generated sentences use uncertain adverbs to modify the sentences, e.g., may, some, etc.; 2) noisy triples in the dataset. Overall, we think this metric is trustworthy and evaluates this task far better than most popular text generation metrics.

Probing LLMs with In-Context Learning
To execute the probing tasks without fine-tuning, we exploit the few-shot in-context learning (Brown et al., 2020)  (text-curie-001) and the ≥175B variants, i.e., text-davinci-001 (tuned on instructions), text-davinci-002 (tuned on code and instructions), and text-davinci-003 (further tuned with reinforcement learning with human feedback, RLHF). 8 For deterministic predictions, all models use greedy decoding (temperature as 0.0) 9 . We use InstructGPT 002 as the default LLM for experiments due to its powerful capability and the fact that it has been extensively researched and applied as of the time of writing this paper. We also include the recent ChatGPT (OpenAI, 2022), which is built upon InstructGPT and trained with dialogue data and RLHF.

The Belief Conflict
We report the results of the probing tasks in Table 1 for LLMs with 2-and 10-shot in-context learning.
Based on the results, we discover a clear conflict of LLMs, that LLMs behave inconsistently in QA and CG tasks on negative commonsense knowledge, which we term belief conflict. Such conflict manifests itself in two ways: the gap between TP and TN on the CG task, and the gap of TN between the QA and CG tasks. In general, belief conflicts exist across LLMs of various sizes and structures. Ablated results per relation is presented in Appendix B.3. When specifically asked, LLMs can distinguish between positive and negative commonsense knowledge, as evidenced by stable and balanced scores for positive and negative splits in the QA task. For CG, LLMs seem to accurately generate sentences grounded in positive knowledge according to TP. However, they perform poorly in negative knowledge, even for the best-performing LLMs, i.e., Codex 002 , InstructGPT 002,003 , as shown by the lower bars of the CG on the negative split. 10 Also, the inconsistency between QA and CG reflects this conflict, as the content generated by a trustworthy AI system should consistent and faithful to what it believes. We present a case study and error analysis in Appendix B.5.
Among these LLMs, InstructGPT 003 and Chat-8 https://beta.openai.com/docs/ model-index-for-researchers 9 We find our findings in the experiments are consistent for different temperatures, according to Appendix B.1. 10 The only exception is GPT-3 (davinci). It scores poorly on the positive split with 10-shot learning, with TN exceeding TP. This happens when k ≥ 4, while its 6.7B variant (curie) behaves consistently with others. Detailed results for GPT-3 are in Appendix B.2.  GPT achieve much better results than others. We assume that such improvements are probably a result of training LLMs with human feedback (e.g., RLHF) based on the disclosed differences between them by OpenAI. Another evidence is that the recent ChatGPT also expresses great capabilities of generating negative knowledge, even better than InstructGPT 003 in this regard. We hypothesize that this is because negative knowledge and rebuttal statements are frequently used in human feedback to steer the model, e.g., admitting errors or instructing the model not to do something. To validate this claim, future work could conduct more rigorous comparisons on public available LLMs, which would be an interesting research problem to trace certain abilities of LLMs to a specific period of training.
Sensitivity to the Number of In-Context Examples To find whether adding more examples helps solve the probing tasks, we increase the in-context examples from 0 to 32. Figure 3(a) shows a consistent finding with previous results, that LLMs are so good at answering yes or no questions that the number of examples does not affect much of the QA performance. Figure 3(b) shows that, adding more examples helps generate both positive and negative commonsense knowledge. However, the gap between TP and TN in the CG task still exists.
5 Analysis on the Belief Conflict 5.1 Could keywords as task input hinder the manifestation of LLMs' belief?
The task input difference for CG and QA leads to a concern that LMs may find it easier to understand natural questions (QA) than keywords (CG); hence, the belief conflict. In response to this concern, we change the input of the two tasks. For example, the keywords-to-answer task takes the form as:  Can these keywords form a truthful common sense fact? Answer with yes or no. Keywords: lion, located at, ocean Answer: no As for the question-to-sentence task: Answer the question by writing a short sentence that contains correct common sense knowledge. Question: do lions live in the ocean? Sentence: lions don't live in the ocean.

Results
In Figure 4(a), we see a 4-point performance decrease given keywords as input for QA, which is not significant in comparison, and the results on the positive and negative splits are as balanced as before. This implies that LLMs' imbalanced performance in CG is not due to the use of keywords as input. In Figure 4(b), CG performance is greatly improved given question as input, approximating QA results. Our assumption is that CG is basically transformed into QA, because the textual corpus has seen too many negated texts following a Boolean question and rephrasing it, e.g., "...? No, lions do not live in the ocean." To validate this, we provide LLMs with zero-shot question-to-sentence instructions, and check if the output sentences start with yes or no given an input question. If our assumption is correct, models without examples will be biased toward QA even with a question-tosentence instruction. The results of models optimized for instructions show that: 84.58% of sentences generated by InstructGPT 002 begin with yes or no, and 80.28% for InstructGPT 003 .
With 10 examples, this number drops to less than 4%. Thus, these results confirms that question-to-sentence generation degenerates to the QA task. As a result, we conclude that the keyword-tosentence (CG) is an appropriate and challenging task to probe generative LLMs. Employing keywords as input does not impact LLMs' grasp of the task (Figure 4(a)), while using questions as input may produce shortcuts that obscure whether LLMs can generate texts of negative commonsense knowledge (Figure 4(b)). Even if we use different instruction wordings (instructions are at Appendix A.2), none escapes the belief conflict, as shown by the error bars in Figure 4. Additionally, this experiment brings up the problem of how LLMs encode commonsense knowledge. According to this experiment, commonsense knowledge seems to be stored in LLMs in the same manner as it is in the corpus. LLMs struggle to generalize them, as evidenced by the keyword inputs for negative knowledge that do not have a statistical shortcut from pre-training.

5.2
Will the keyword co-occurrence within corpus affect LLMs' generation?
LLMs are essentially statistical models. In this experiment, we investigate the influence of word co-occurrence in the corpus on the CG task, which is one of the most common statistical factors. We categorize the dataset into buckets based on keywords co-occurrence on naturally existing corpora such as OMCS (706K sentences, Singh et al., 2002) and Wikipedia (1M, a subset built by Gao et al. (2021)). The co-occurrence for each triple is calculated by i,j cooccur(w i ,w j ) lslo , where w i ∈ s, w j ∈ o, and l s , l o denote the word count of subject s and object o, discarding stopwords.
From Figure 5, we have an interesting finding that three of the best-performing LLMs from Table 1 suffer from a performance drop at the > 1000 bucket of the negative split (TN), the most frequent data bucket. In contrast, LLMs achieve the best performance this bucket on the positive split (TP). We conclude that the hard-to-generate negative knowledge for LLMs tend to be those in which they have seen many subjects and objects appear together. For example, worm and bird usually co-occur in sentences, but models tend to generate "worms can eat birds." Such statistical shortcuts hinder the generation of negative knowledge. This is also validated by TP results, where LLMs find it easy to generate sentences with frequently co-occurring entities in a positive fact.

How does the balance of positive and negative examples affect negation bias?
A possible answer for the difference between CG and QA is that: LMs suffer from reporting bias of negation during pre-training, while answering TN (%) Figure 5: 10-shot CG results of three best-performing LLMs on different co-occurrence buckets. a ∼ b denotes that keywords co-occurrence in a bucket ranges from a to b. n is the number of triples in a bucket.
questions with yes or no is quite balanced in the corpora. We validate this problem by mitigating the negation bias through adjusting the examples of positive and negative cases. With more E − s, LLMs are encouraged to generate more negations.
Results Figure 6(a), 6(b) adjust the ratio η = |E − | k while fixing k. Figure 6(a) shows that InstructGPT 002 is very resilient against the example ratio in the QA task, except for extreme cases where only E + s or E − s are presented (i.e., η ∈ {0, 1}). This also demonstrates the robustness of adopting QA results as LLMs' belief. In Figure 6(b), the CG performance on the negative split is improving as η grows. The turning point appears somewhere near η ∈ (0.9, 1) when E − takes over all the examples. Also, TP drops as E + becomes less. What if we add E − without dropping E + ? In Figure 6(c), 6(d), we keep E + as constant (|E + | = 5) and increase |E − | from 5 to 15. With enough amount of E + , TN to CG continues to increase without sacrificing TP.
Overall, Figure 6 presents the possibility that we can overcome the belief conflict brought about by reporting bias by increasing negated texts in the training data or in-context examples. However, this is not always feasible in practice.

Do Chain-of-Thought help generate texts with negative commonsense knowledge?
Can the implicit reporting bias be overcome by explicit reasoning? Recent studies (Wei et al., 2022b,a)   steps in natural language, extending <input, output> to <input, chain-of-thought, output>. We adopt two instances of CoT: deductive reasoning and fact comparison, whose examples are manually written, which are in Appendix A.1.
Deductive Reasoning Prompting We instantiate CoT with deductive argumentation in the form of syllogism (two premises and one conclusion). The prompt is extended into <input, "Let's think step by step: ...", output> with intermediate steps. A natural way to identify a negative proposition is deductive reasoning with modus tollens, i.e., denying the consequent (Speranza and Horn, 2010;Bobzien, 2020): "If P then Q. Not Q. Therefore, Not P." For example, "If something is a intelligent being (P), then it must have the ability to think (Q). Computers cannot think (Not Q). Therefore, computers are not intelligent beings (Not P)." To reason about positive propositions, we use modus ponens logic, i.e., affirming the antecedent (Bobzien, 2020): "If P then Q. P. Therefore, Q." For example, "Things with lightweight bodies and strong wing muscles (P) can usually fly (Q). Birds have these physical characteristics (P). Therefore, birds can fly. (Q)" Notice that the deduction is not strictly logical but is enough to arrive at commonsense knowledge.

Model
CoT k = 2 (1:1) k = 10 (1:1)  Fact Comparison Prompting Deduction emphasizes the intensional aspects of the fact, whereas fact comparison highlights the extensional comparison between counterpart facts (Fitting, 2006). For example, the related fact for "lions do not live in the ocean" is "lions live in the land". A negative fact often comes with a core fact that is true, which has been shown to be useful in explaining why a claim is wrong (Cheng et al., 2022). Therefore, we extend the <input, output> in each example by <input, "Related fact: ...", output>. For positive cases, we write a related fact for consistent examples.

TP TN Acc TP TN Acc
Results Table 2 displays the results of Codex 002 and InstructGPT 002 . Both CoT instances improve LLMs' performance on TN, showing the benefit of explicit reasoning for deriving negative knowledge, where different models prefer different rationales. However, the increase in TN comes at the expense of a performance drop in TP. This is mostly because models previously predicted most of the cases to be positive, making TP irrationally high. Overall, these results suggest that, even though LLMs picked up implicit bias during pre-training, it can be overcome by making the reasoning chain explicit. Nevertheless, deductive reasoning seems to be more rigid about confirming commonsense knowledge with a lower TP. This can be attributed to the fact that commonsense knowledge contains exceptions (Allaway et al., 2022), e.g., birds can fly but penguins can't. Thus, LLMs with deductive reasoning may hold concerns about exceptions for confirming a commonsense fact, leading to a significant lower TP than fact comparison. We conduct a simple experiment of exceptions in Appendix B.4, which shows that adding adverbs of degree (e.g., usually, generally) in the texts alleviates the belief conflict, but the problem still exists.
In this study, we explored and quantified the limitations of LLMs in generating texts grounded in negative commonsense knowledge that they seem to know, a phenomenon we term as "belief conflict". To investigate this, we probe LLMs with a constrained sentence generation (CG) task, coupled with a QA task. Our experiments demonstrated the existence of the belief conflict in all LLMs when it comes to negative knowledge, which is mostly brought by quantifiable statistical shortcuts such as keywords co-occurrence. We also see that this can be lessened by giving more in-context examples of negative knowledge or by using a chain-of-thought (CoT) prompting method to explain the explicit reasoning process for deriving negative knowledge.
With the rapid increase of the study on languagebased reasoning (Clark et al., 2020;Tafjord et al., 2021;Wei et al., 2022b), there would be cause for concern if LLMs have trouble generating proofs or reasoning steps with negative knowledge. With all the good scores they achieve at QA tasks, whether they can be trusted with their knowledge expressed during generation, which is one of the most prominent way of human-AI interaction, is still questionable. In this sense, the study of negative knowledge creates a good testbed for assessing real languagebased reasoning skills for LLMs without the statistical heuristics they memorized. We hope that the findings in this work could raise the awareness of the community on negative knowledge for LLMs in downstream text generation tasks.

Limitations
In this work, we highlight that the probing tasks are placed in the commonsense domain that are generally acknowledged by people in most situations. We do not consider the exceptions of commonsense knowledge, which has gradually drawn some research attentions (Do and Pavlick, 2021;Allaway et al., 2022). Exceptions are important for negative knowledge and are widely used in tasks such as argumentation or deductive reasoning. However, in the experiments, we find that such exceptions might make models generate commonsense statements with uncertain adverbs (e.g., may, some, etc.) on rare cases.
Another limitation of this work is that the probing task is based only on relational commonsense knowledge from commonsense knowledge bases such as ConceptNet. We design the keyword-to-sentence task mostly for the purpose of convenient evaluation for text generation, which is notoriously known as difficult. The probing and evaluation of LLMs' belief about negative knowledge in more complex tasks are beyond the scope of this work, but really interesting and challenging. Also, other types of knowledge could be studied in a similar way, such as negative social, temporal and spatial knowledge, to name but a few.
In this paper, we identify the belief conflict problem in LLMs through extensive experiments. Future work could explore more advanced training or prompting-based methods to improve the consistency between a model's belief and its actions (text generation for various tasks), especially for negative knowledge.

Ethical Statement
The commonsense knowledge triples from Con-ceptNet may include offensive and biased sentences, which may also exist in the dataset that we use in this work. As stated before, the identification of commonsense negative knowledge may slightly vary from people from different cultural and social background when considering exceptions.   Table 6.

A.2 Example Prompts for the Probing Tasks
The task inputs to the LLMs are presented in Table 3. Note that instructions can be replaced by others. LLMs with in-context learning are known to be sensitive to the wording and examples in the prompts (Min et al., 2022b). Therefore, we manually write 4 interchangeable instructions for each probing tasks. For the QA task, the instructions include: 1. Answer the commonsense questions with yes or no. 2. Choose "yes" or "no" to indicate whether you agree or disagree with the commonsense questions. 3. Respond to the questions using "yes" or "no". 4. Indicate whether the commonsense questions are correct or incorrect by writing "yes" or "no".  For the CG task, the instructions include: 1. Write a short and factual sentence according to commonsense based on the keywords: 2. Use the keywords to create a short and factual sentence that accurately reflects commonsense knowledge. 3. Create a short, factual sentence based on the keywords and what is generally accepted as true. 4. Construct a factual and concise statement based on the provided keywords and commonsense knowledge.

B Additional Results
B.1 Sensitivity to Temperature Tuning Figure 7 shows that temperature does not influence much of the performance, thus the findings of this paper are not sensitive to temperature tuning.

B.2 Abnormal Results of GPT-3 (davinci)
Different from the trends of other LLMs reported in § 4.2, GPT-3 davinci shows a confusing pattern of the results on the CG task. A more de- tailed experiment in Figure 8(a) shows that, when k < 4, GPT-3 (davinci) performs similarly with its sibling LLMs, with TP greatly surpasses TN. TN continues to enlarge as k increases, even beating TP. Based on Acc over the whole dataset, GPT-3 does not achieve results as good as other GPT-3 derivatives. However, a smaller version of GPT-3 (i.e., curie, 6.7B) does not express such pattern, according to Figure 8(a). We do not have proper reasons for this finding, but further training on code and instruction tuning (i.e., Codex and InstructGPT) seem to fix this problem.

B.3 Results of Different Relation Types
What types of relations do LLMs find the most difficult to verbalize? As seen in Figure 9, we see LLMs achieve good results in the positive split. On the negative split, LLMs unanimously believe NOTHASPROPERTY to be the most difficult relations.

B.4 Do LLMs hold concerns about exceptions for commonsense knowledge?
Commonsense knowledge usually comes with exceptions. Could LLMs answer or generate commonsense knowledge incorrectly be because they are thinking about exceptions? For example, "birds can fly, but penguins cannot." (Allaway et al., 2022). So when asked "can birds fly?", LLMs may think of a counterexample and thus arrive at the answer no. We rephrase the in-context examples by adding adverbs of degree (e.g., typically, generally, usually, most, etc.) to make the tasks be about the commonsense instead of exceptions. For instance, we rewrite "can birds fly?" into "can most birds fly?" or "can birds generally fly?", and "lions   don't live in the ocean." into "lions don't usually live in the ocean." In this way, we make language explicitly convey uncertainty (Reiter, 2019) and try to rule out exceptions in the tasks.
Based on the results in Table 4, we find that adding adverbs of degree to the texts does improve LLMs' performance on both CG and QA. This suggests that LLMs do hold a certain amount of concerns toward exceptions when dealing with commonsense reasoning, especially for negative knowledge. However, considering exceptions with this trick still does not resolve the belief conflict. Also, this approach could also serve as a useful trick for future commonsense research. Table 5 presents some examples of generated by InstructGPT 002 (10-shot). In the 1st case, the model correctly generated negative commonsense sentences. The 2nd one suffers from the problem of weak negation, i.e., for negative triple, the model sometimes use "may" or "some" for weak negation, which is not detected by the negation cue detector metric. The 3rd one suffers from unfaithful generation to the constraints, where the model generates information outside the input triples to avoid generating negation. The 4th one is wrong due to the noise in the dataset. The 5th one is probably due to the high co-occurrence of the concept worms and birds, the model finally generates a positive sentence.