Is ChatGPT a Good Causal Reasoner? A Comprehensive Evaluation

Causal reasoning ability is crucial for numerous NLP applications. Despite the impressive emerging ability of ChatGPT in various NLP tasks, it is unclear how well ChatGPT performs in causal reasoning. In this paper, we conduct the first comprehensive evaluation of the ChatGPT's causal reasoning capabilities. Experiments show that ChatGPT is not a good causal reasoner, but a good causal explainer. Besides, ChatGPT has a serious hallucination on causal reasoning, possibly due to the reporting biases between causal and non-causal relationships in natural language, as well as ChatGPT's upgrading processes, such as RLHF. The In-Context Learning (ICL) and Chain-of-Thought (CoT) techniques can further exacerbate such causal hallucination. Additionally, the causal reasoning ability of ChatGPT is sensitive to the words used to express the causal concept in prompts, and close-ended prompts perform better than open-ended prompts. For events in sentences, ChatGPT excels at capturing explicit causality rather than implicit causality, and performs better in sentences with lower event density and smaller lexical distance between events. The code is available on https://github.com/ArrogantL/ChatGPT4CausalReasoning .


Introduction
Causal reasoning ability is crucial for numerous NLP applications.The recent causal reasoning systems are mainly based on fine-tuned pre-trained language models (PLMs) such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019).However, their causal reasoning abilities rely on supervised training using large amounts of annotated data.
Most recently, ChatGPT has achieved remarkable performance in various NLP tasks without the need for supervised training.However, there is currently no work that comprehensively evaluates ChatGPT's ability in causal reasoning.
Figure 1: The forms of three causal reasoning tasks and the prompts we use.The content that requires ChatGPT to reply is marked in red.The multiple-choice CD task also involves samples that ask for selecting the result of the input event.For such samples, we modify the "cause" in the question to "result".
Firstly, we utilize the Event Causality Identification (ECI) task as a comprehensive causal reasoning benchmark.As shown in Figure 1, the ECI task aims to determine whether there is a causal relationship between two events in a sentence.This requires the ChatGPT to not only use extensive commonsense knowledge but also understand the complex context composed of multiple entities and events.Finally, ChatGPT must combine all information to identify causal relationships.
Secondly, we employ Causal Discovery (CD) task for evaluation, which requires ChatGPT to possess a broader and more specialized knowledge, yet does not necessitate consideration of complex contexts.As shown in Figure 1, two CD task formats are used: 1) multiple-choice, aims to select the cause or effect of the input event from two options; 2) binary classification, aims to determine whether there is a causal relationship between the two input events.For binary classification, we convert each multiple-choice example into two binaryclassification examples by pairing the input event with each of the two options.Our experiment indicates that binary classification is a more reliable evaluation method for ChatGPT.
Furthermore, as shown in Figure 1, we conduct Causal Explanation Generation (CEG) task to test whether ChatGPT can generate explanations for causal relations between events.This is typically used to test whether machines truly understand the principles behind causality, which is crucial for building a reliable causal reasoning system.

Key takeaways
The key findings and insights are summarized as follows: • ChatGPT is not a good causal reasoner, but a good causal explainer.
• ChatGPT has a serious causal hallucination issue, where it tends to assume causal relationships between events, regardless of whether those relationships actually exist.
• The main reason of ChatGPT's causal hallucinations may be the reporting biases between causal and non-causal relationships in natural language, as well as ChatGPT's upgrading processes, such as RLHF (Ouyang et al., 2022b).Besides, the ICL and CoT (Wei et al., 2022) prompts can further exacerbate the causal hallucination of ChatGPT.
• The causal reasoning ability of ChatGPT is sensitive to the words used to express the causal concept in the prompt.
• As the number of events in the sentence increases, and the lexical distance between events becomes greater, ChatGPT's causal reasoning performance decreases.Besides, Chat-GPT is better at identifying explicit causality rather than implicit causality.
• Open-ended generation prompts cannot improve ChatGPT's causal reasoning ability.
2 Related Work

Causal Reasoning in NLP
Causal reasoning ability is important in NLP.Previous work has made significant efforts to improve the causal reasoning ability of machines, such as incorporating external knowledge (Liu et al., 2020;Du et al., 2021;Liu et al., 2023;Wang et al., 2023), conducting causal-specific pre-training (Li et al., 2021;Zhou et al., 2022), or applying data augmentation techniques (Li et al., 2020;Zuo et al., 2021a,b).However, these methods are highly dependent on annotated training data in specific domains and task formats.They perform poorly in scenarios where annotated data is scarce.In this paper, we evaluate ChatGPT, which does not require training data.

Evaluation of ChatGPT's Capabilities
Recently, a large amount of work has conducted evaluations of ChatGPT's various capabilities.However, ChatGPT's causal reasoning capabilities have not been fully evaluated.Qin et al. (2023) and Chan et al. (2023) only employed the multiplechoice CD format on the COPA dataset to evaluate ChatGPT, which consists of only 1,000 examples primarily focused on simple everyday causality.Besides, our experiments show that this multiplechoice format leads to an overestimation of Chat-GPT's performance.Furthermore, Kıcıman et al. (2023) claimed that ChatGPT achieved a 97% accuracy on the causal discovery task.However, they only required ChatGPT to determine the causal direction of 583 causal event pairs, without requiring it to predict whether causality exists.This does not constitute a complete causal discovery task.In summary, previous evaluations only involved small-scale datasets and simple task formats, which overestimated ChatGPT's causal reasoning abilities.However, we conduct a comprehensive and objective evaluation of the ChatGPT's causal reasoning abilities, involving four different task forms and five widely-used causal reasoning datasets.

Causal Discovery
We conduct experiments on two widely used CD datasets: 1) COPA (Roemmele et al., 2011), which is a classic dataset for causal reasoning and consists of 1,000 multiple-choice questions that primarily focus on everyday life scenarios.2) e-CARE (Du et al., 2022), contains 21,324 multiple-choice questions covering a wide range of domains.We adopt Accuracy as the evaluation metric.

Causal Explanation Generation
We conduct experiments on e-CARE, which contains human annotated causal generations for 21,324 causal event pairs.Following the evaluation settings of e-CARE, we first adopt average-BLEU (n=4) (Papineni et al., 2002) and ROUGE-L (Lin, 2004) as the automatic evaluation metrics.Secondly, we sample 100 explanations generated by each version of ChatGPT on the e-CARE for human evaluation.Specifically, we label whether the generated explanation can explain the corresponding causal fact to calculate the accuracy.

Experiment Setting
For ChatGPT, we follow an instruction-prompt scheme for ECI, CD and CEG tasks.Figure 1 shows the prompts employed for these three causal reasoning tasks.We evaluate ChatGPT's performance under zero-shot settings.Additional prompts and settings are discussed in §5.
We conduct our experiments using OpenAI's official API1 , covering four progressive SOTA versions of ChatGPT: text-davinci-002, text-davinci-003, gpt-3.5-turboand gpt-4.Specifically, textdavinci-002 was further trained using RLHF to obtain text-davinci-003, which was subsequently further trained using conversational data to obtain gpt-3.5-turbo.Although OpenAI has not disclosed gpt-4, gpt-4 has shown superior reasoning capabilities in various NLP tasks.For gpt-4, we sample 1000 instances from each dataset for evaluation.We set the temperature parameter to 0 to minimize randomness.

Baselines
In this paper, all baseline methods for the three causal reasoning tasks are based on PLMs finetuned on the full training dataset.
For the ECI and the CD task, we compare Chat-GPT with vanilla classification models based on BERT-Base (Devlin et al., 2019) and RoBERTa-Base (Liu et al., 2019).Their framework and training process are consistent with previous work (Liu et al., 2020;Li et al., 2021).
Besides, we compare ChatGPT with two SOTA ECI method: KEPT (Liu et al., 2023), based on BERT-Base, incorporated background and relational information for causal reasoning; and DPJL (Shen et al., 2022) based on RoBERTa-Base, introduced information about causal cue words and the relation between events into ECI model.

Event Causality Identification
Table 1 show the results on the three ECI datasets: ESC, CTB and MAVEN-ERE.We find that: Firstly, ChatGPT, even the gpt-4, has been comprehensively outperformed by baseline methods based on fine-tuned small PLMs.This indicates that ChatGPT is not a good causal reasoner in complex causal reasoning task like ECI.Secondly, the recall of ChatGPT is high, but the precision is low, indicating that a large number of non-causal event pairs are falsely identified as causal pairs.This is also why ChatGPT performs particularly poorly on the CTB dataset, which contains more non-causal event pairs.The main reason for this may be that natural language contains a large number of descriptions of causal relationships, mainly indicated by causal cue words such as "lead to" and "therefore".However, natural language generally does not express which events are not causally related.Furthermore, since ChatGPT's ability comes from training on massive amounts of natural language text, such reporting bias between causal and non-causal event pairs in texts makes ChatGPT good at identifying causal event pairs but not in recognizing non-causal event pairs.
Besides, it can be observed that the fine-tuned small PLMs do better at identifying non-causal event pairs rather than causal ones.This is because there are much more negative examples than positive examples in the ECI dataset, and the fine-tuned models have learned such data distribution.

Causal Discovery
Table 2 show the results on the two CD datasets: COPA and e-CARE.We find that: Firstly, although ChatGPT performs well in multiple-choice, its performance is poor in binary classification.The main reason is that in multiplechoice, ChatGPT only needs to consider the option that shows the more obvious causal or non-causal relationship with the input event, while the other more difficult option can be ignored.However, previous work (Qin et al., 2023;Chan et al., 2023) only used multiple-choice to evaluate ChatGPT's causal reasoning ability, leading to a misrepresentation that ChatGPT is good at causal reasoning.
Secondly, compared to the ECI task, ChatGPT achieves higher accuracy on non-causal pairs in the CD task.This is mainly because the non-causal pairs in the e-CARE and COPA datasets are generated manually given a input event, and they have a simple structure and weak correlation with the input events, making them easier to identify.This is also the reason why the fine-tuned small PLMs do better at identifying non-causal event pairs rather than causal ones.Besides, compared to the ECI task, ChatGPT achieves slightly lower accuracy on causal pairs in the e-CARE dataset.This is because e-CARE requires ChatGPT to grasp a wider range of knowledge, which involves not only commonsense knowledge in more scenarios, but also professional knowledge in certain fields, such as biology.
More importantly, we notice that the upgrading process of ChatGPT (text-davinci-003→gpt-3.5-turbo→gpt-4) leads to the ChatGPT become increasingly inclined to classify events as having a causal relationship, regardless of whether it is actually correct or not.This may be due to the alignment tax (Ouyang et al., 2022a) of RLHF.This indicates that while OpenAI (2023) mentioned that ChatGPT's upgrading process reduces the hallucination issue in various other tasks, it also makes the ChatGPT better at fabricating causal relationships.A preliminary analysis of the impact of RLHF on causal reasoning is provided in the Appendix A.

Causal Explanation Generation
Table 3 show the experimental results on the CEG task.It can be observed that: Firstly, according to the human evaluation results, the accuracy of causal explanations generated by ChatGPT is close to those generated by humans.This indicates that ChatGPT is a good causal explainer.
Secondly, compared to "Human Generation", ChatGPT achieves a better ROUGE-l score, which is a text generation metric similar to "recall" in text classification.This is because ChatGPT tends to generate explanations that are more complete and detailed.This was confirmed by the evaluators during our human evaluation process.This is also the reason why ChatGPT received a lower AVG-BLEU score, which is a text generation metric similar to "precision" in text classification.
Thirdly, in manual evaluation, we find that the explanations generated by LLaMA and FLAN-T5 are highly correlated with the input events.However, the explanations might be mere repetitions of the input events or provide relevant but uninformative descriptions that cannot be used for explanation.This is also a reason for the poor performance of both LLaMA and FLAN-T5 in human evaluations.
Besides, compared to ChatGPT, the explanations provided by LLaMA and FLAN-T5 are noticeably shorter, as the gold explanations provided by e-CARE are very concise.However, ChatGPT excels in providing more comprehensive and detailed explanations in the zero-shot setting.This shows the advantage of ChatGPT's causal explanations compared to traditional fine-tuning methods.
Moreover, it is worth noting that fine-tuned LLaMA, FLAN-T5 and ChatGPT achieve similar ROUGE-l scores, but the two finetuned LLMs perform worse in our human evaluation.This is because that the fine-tuned LLaMA and FLAN-T5 may generate less informative explanations, e.g., mere repetitions of the input events.However, ChatGPT may offer valuable explanations for the input causal event pairs, but from different perspectives or in distinct syntactic forms compared to the gold explanations in e-CARE.

In-Context Learning
As shown in the Table 4 and Table 5, we analyze ChatGPT under different In-Context Learning settings: 1) "x pos + y neg": we randomly select x positive and y negative training examples as demonstrations for in-context learning, and all test cases share the same demonstrations; 2) "top k similar": for each test case, we retrieve the top k most similar training examples as its demonstrations.To compute the similarity, we first encode them with BERT-large, then compute the cosine similarity of their embeddings.We further analyze the impact of the order and the label distribution of ICL demonstrations in Appendix B and C. Firstly, when both x and y are less than or equal to 4, ICL mainly improves ChatGPT's accuracy on causal pairs, but decreases the accuracy on noncausal pairs.This may be because although ICL can stimulate the ChatGPT's abilities, ChatGPT is better at identifying causal event pairs.Therefore, ICL further exacerbates the imbalance of ChatGPT in identifying causal and non-causal pairs.
In addition, "4 pos + 48 neg" achieves higher Full accuracy.However, this is because it improves Neg accuracy at the expense of Pos accuracy, as the ESC dataset contains a larger proportion of noncausal pairs, making Neg accuracy have a greater impact on Full accuracy.A substantial improvement in the overall performance should not be a case of robbing Peter to pay Paul, rather than improve both the Pos and the Neg accuracy.

Chain-of-Thought Prompting
As shown in the Table 6, we analyze ChatGPT under different Chain-of-Thought settings: 1) "-w/ CoT zero-shot": we employ the zero-shot CoT (Kojima et al., 2022) by adding "Let's think step by step" after the prompt; 2) "-w/ CoT x pos + y neg": we manually annotate the reasoning chain for x positive and y negative training examples.They are selected as demonstrations for in-context learning, and all test cases share the same demonstrations.
We further analyze the error types of ChatGPT in the Appendix D. The examples of our used demonstrations and the reasoning chains generated by ChatGPT are presented in the Appendix E.
Firstly, "-w/ CoT zero-shot" cannot effectively improve the performance of ChatGPT in the ECI task.This is because the quality of the reasoning chain generated by zero-shot CoT is not high enough to effectively guide the model.
Secondly, "-w/ CoT x pos + y neg" improves ChatGPT's accuracy on causal pairs, but decreases its accuracy on non-causal pairs.Observing the reasoning chains generated by ChatGPT, we found that the ChatGPT generates lower quality chains for non-causal pairs than for causal pairs.This difference may worsen the imbalance of ChatGPT in identifying causal and non-causal event pairs.

Ways of Expressing Causality in Prompts
As shown in Figure 2, we analyze the performance of ChatGPT on the ECI task using prompts that express the causal concept in different ways: 1) "counterfactual", a prompt based on the counterfactual causality view of Pearl (2009); 2) "one- step", we add constraint words "one-step" to alleviate the issue of identifying non-causal event pairs as causal; and 3) "trigger(<X>)", we use different causal cue words <X> (e.g., "lead to") to construct prompts.Results are shown in Table 7. Firstly, the "counterfactual" prompt makes almost all non-causal pairs to be identified as causal.This is mainly because ChatGPT's counterfactual reasoning results are not accurate.
Secondly, the "one-step" improves ChatGPT's accuracy on non-causal pairs, but lowers its accuracy on causal pairs.This is because while constraint words such as "one-step" can make the model more likely to predict event pairs as noncausal, it does not truly enhance ChatGPT's causal reasoning abilities.
Moreover, the performance of "trigger(<X>)" with different causal cue words is significantly different.This may be due to the fact that during pretraining, ChatGPT mainly learns causal knowledge triggered by causal cue words, but the distributions

Lexical Distance between Events
As shown in the Figure 3, we analyze the performance of ChatGPT on pairs of events with different lexical distances in the ECI task.The "lexical distance" refers to the number of words that separate two events within a sentence.Firstly, we find that as the event distances increase, ChatGPT is more inclined to predict event pairs as non-causal.This may be because in natural language, the larger the distance between events, the less likely there is a causal relationship, and ChatGPT has learned this pattern.
Secondly, as the event distances increase, the F1 scores of ChatGPT decrease.This indicates that ChatGPT is not good at identifying long-distance causal relationships.An outlier is the F1 score of gpt-4 at the interval [25,30).This is due to the fact that out of 1000 test instances for gpt-4, there are only 35 examples within the interval [25,30), leading to more random performance.However, all other results demonstrate that ChatGPTs' performance decreases as the event distance increases.

Density of Events
As shown in the Figure 4, we analyze the performance of ChatGPT in sentences with different numbers of events in the ECI task.We find that: Firstly, as the event density increases, most versions of ChatGPT are more inclined to predict event pairs as non-causal.This is mainly because as the event density increases, the context of events becomes more complex, making it more difficult to capture the correlations between events.
Secondly, as the event density increases, the F1 scores of ChatGPT decreases.This indicates that ChatGPT is not good at handling complex situations involving multiple events.

Types of Causal Relationship
As shown in the Figure 5, we analyze the accuracy of ChatGPT on pairs of events with different types of causal relationships in the ECI task: 1) Explicit Causality, which refers to causal relationships explicitly triggered by causal cue words (e.g., "lead to"); 2) Implicit Causality, which refers to causal relationships expressed without causal cue words.
It can be observed that, compared to implicit causality, ChatGPT performs better on capturing explicit causality.This is mainly because iden- tifying explicit causality only requires recognizing causal cue words, whereas identifying implicit causality requires reasoning with contextual information and commonsense knowledge.

Prompts in the Form of Open-Ended Generation
Recently, Arora et al. (2023) revealed that openended prompts ("Who went to the park?") tend to yield better results for ChatGPT than prompts that restrict ChatGPT's outputs ("John went to the park.True or False?").As shown in Table 8, we analyze ChatGPT with open-ended prompts: 1) "open-ended A.1/2/3", requires ChatGPT to generate all the causal event pairs in the input sentence.
Open-ended A.1 Input: Minutes after … taken into custody.
Question: List the cause-effect pairs in the input sentence. Answer:

1.<the response of ChatGPT>
Open-ended A.2 Input: Minutes after … taken into custody.Question: List the events in the input sentence that are causally related to the event "suspended".

Answer:
1. <the response of ChatGPT> We designed three different prompts to fully evaluate ChatGPT's performance.
2) "open-ended B", gives a target event in the input sentence, and requires ChatGPT to generate events in the input sentence that have causal relations with the target event.We employ a relaxed P, R, and F1 calculation for open-ended prompts.Specifically, a predicted causal-effect pair is considered correct if at least one token is shared between the predicted and the labeled cause, as well as between the predicted and the labeled effect.The formats of these prompts are shown in Figure 6.It can be observed that the open-ended prompts decrease the performance of ChatGPT.This is because open-ended prompts require ChatGPT to jointly perform the event extraction and the ECI task.However, previous studies (Gao et al., 2023;Wei et al., 2023) show that ChatGPT is not good at extracting events.

Conclusion
In this paper, we conduct a comprehensive evaluation of ChatGPT's causal reasoning capabilities.Experiments show that: 1) ChatGPT is not a good causal reasoner, but is good at causal explanation generation; 2) ChatGPT has a serious causal hallucination, possibly due to the causal reporting biases and ChatGPT's upgrading processes 3) The ICL and CoT techniques can further exacerbate such causal hallucination; 4) The ChatGPT is sensitive to the words used to express the causal concept in prompts, and open-ended causal reasoning prompts is not suitable for ChatGPT; 5) For events in sentences, ChatGPT excels at capturing explicit causality, and performs better in sentences with lower event density and smaller event distances.
Although there may be more delicate prompts that can further surpass our reported results, we believe that relying solely on prompts cannot fundamentally solve the issues that ChatGPT faces in causal reasoning.We hope that this study can inspire future works, such as addressing the causal hallucination issue of ChatGPT or further evaluating ChatGPT in scenarios involving multi-factor and multi-modal causal reasoning.

Limitations
This work is a comprehensive evaluation on the causal reasoning ability of ChatGPT, and it has several limitations.Firstly, ChatGPT's capabilities are constantly being updated, and current test results may change as ChatGPT evolves.Secondly, although OpenAI has provided rough introductions of different versions of ChatGPT, the implementation details are unclear, making it difficult to deeply analyze why different versions of ChatGPT have different performance, and how each data and training technique affects the performance of ChatGPT.Finally, there may still be prompts that can further outperform the results we reported, such as different questioning formats and more advanced ICL techniques, but we believe that relying solely on prompts cannot fundamentally solve the illusion problem that ChatGPT currently in causal reasoning.It can be find that the RLHF also exacerbates the causal hallucination issue of FLAN-T5-Large.Through our analysis of the Anthropic dataset, this may be due to:

Methods
1.Among questions about "Why", only 10.29% include the word "not".This indicates that the majority of RLHF data guide the LLMs about causality rather than non-causality.
2. In the gold responses, the frequency of the word "yes" is approximately twice that of the word "no".This could potentially increase the likelihood of the LLMs producing the positive label rather than the negative label.
Both of these two characteristics in the RLHF data could potentially exacerbate ChatGPT's causal hallucination, leading it to assume causal relationships between events, regardless of whether those relationships actually exist.

B Effect of the Label Distribution in ICL Demonstrations
As shown in each containing instances that only use 0, 1, 2, 3, or 4 causal demonstrations in their top 4 similar demonstrations, respectively.Then, we present the performance of the "top 4 similar" on these five subsets."Proportion+" is the proportion of causal instances in the corresponding subset."x pos y neg" indicates the subset that only use x causal and y non-causal demonstrations.Firstly, we can observe that when using only causal or non-causal demonstrations, the model achieves a higher Pos accuracy.This might be because that including only one classes prevents the model from contrasting the meanings of different labels, thus potentially confusing the model's understanding of the task objectives.
Additionally, when using both causal and noncausal demonstrations, there is a smaller change in Pos accuracy, while Neg accuracy increases as the number of causal demonstrations rises.This might be because having more causal demonstrations helps the model understand the situations that truly involve causality, thus avoiding misclassifying non-causal instances as causal.

C Effect of the Order of ICL Demonstrations
As shown in Non-causal first: from non-causal demonstrations to causal demonstrations; 3) Random: we conduct three times experiments under random demonstration orders.Besides, Zero-shot indicates the performance of ChatGPT under the zero-shot setting.Firstly, we find that Causal first is more inclined to classify event pairs as causal compared to Noncausal first.This might be because that the demonstrations located earlier have a stronger impact on the ChatGPT.
Secondly, despite different orders, all of these few-shot settings make the ChatGPT more inclined to classify event pairs as causal compared to the zero-shot setting.

D Error Analysis for the Causal Reasoning of ChatGPT
As shown in Table 11, we analyze the error types of ChatGPT's causal reasoning by observing the reasoning chains generated in the CoT setting.Specifically, we randomly select 100 instances of errors for each model on the ESC dataset, and then manually annotate them to analyze the types of errors among different models.Common error types include: 1) Fabricating additional fake conditions to establish causality, even if these conditions are not described or are incorrect in the input sentence; 2) Misunderstanding basic event relationships such as sub-events and temporal relationships between events; 3) Failing to accurately identify which two events the causal question is referring to; 4) Introducing incorrect commonsense knowledge.
Firstly, causal reasoning is a comprehensive skill that requires commonsense knowledge, as well as the ability to understand basic event relationships, and to perform logical reasoning based on information.However, ChatGPT is still not entirely reliable in these aspects, leading to the accumulation of errors.On the other hand, ChatGPT has encountered numerous causal event pairs in pretraining data, enabling it to associate many event pairs with potentially causal contexts.However, this context might not align with the input.
Besides, compared to text-davinci-003, which is not fine-tuned on dialog data, gpt-3.5-turboand gpt-4 show a clearer tendency to fabricate additional fake conditions to establish causality.This could be due to dialog data guiding them to produce longer and more divergent responses, which deviate from the context provided in the original input.Additionally, gpt-4 introduces fewer incorrect commonsense knowledge, as it has a better grasp of knowledge.

E Details of CoT Experiments
Figure 7 shows the examples of the demonstrations utilized in our few-shot CoT experiments.Figure 8 shows the examples of the reasoning chains generated by ChatGPT.

F1Figure 3 :
Figure 3: Performance of ChatGPT on pairs of events with different lexical distances in the ESC dataset.

F1Figure 4 :Figure 5 :
Figure 4: Performance of ChatGPT on sentences with different numbers of events in the ESC dataset.

:
If there is a causal relationship between two events in the input sentence, extract the causal pair at the word level.If there are multiple causal pairs, add AND between them, otherwise answer None.For example: (accuse of) cause (death) AND (kill) cause (death) Minutes after … taken into custody.Question: Is there a token-level causal relationship in the sentence?If so, please extract it into this form: cause->effect.If there are multiple causal relationships, add AND between causal pairs, and display No if there is no causal relationship.Minutes after … taken into custody.

Figure 6 :
Figure 6: Prompts in the open-ended form.The content that requires ChatGPT to reply is marked in red.

Table 1 :
Experimental results (%) on the ECI task.P, R and F1 indicate Precision, Recall and F1-score, respectively.Pos, Neg and Full indicate accuracy on the causal pairs, non-causal pairs and all test datas, respectively.

Table 2 :
Experimental results (%) on the CD task.Pos, Neg and Full indicate accuracy on the causal pairs, non-causal pairs and all test datas, respectively.

Table 4 :
Performance of ChatGPT on the ECI task with ICL."none" indicates ChatGPT without ICL.

Table 5 :
Performance of ChatGPT on the binaryclassification CD task with ICL."none" indicates Chat-GPT without ICL.

Table 6 :
Performance of ChatGPT on the ECI and the binary-classification CD task with the Chain-of-Thought prompts."none" indicates ChatGPT without ICL.
trigger(<X>)Input: Minutes after … taken into custody.Question: does"suspended"<X>"injuring"?Answer: Yes one-step Input: Minutes after … taken into custody.Figure 2: Prompts that express causal concepts in various ways.The content that requires ChatGPT to reply is marked in red.

Table 7 :
Performance of ChatGPT in the ECI task using prompts that express the causality in different ways.

Table 9 :
(Bai et al., 2022)ts (%) on the ECI task.P, R and F1 indicate Precision, Recall and F1-score, respectively.Pos, Neg and Full indicate accuracy on the causal pairs, non-causal pairs and all test datas, respectively.A Effect of RLHF on Causal ReasoningIt is necessary to explore why RLHF enhances the causal hallucination issue of ChatGPT.Due to Ope-nAI's decision not to open-source ChatGPT, we lack access to specific details about the data and experimental setup for ChatGPT's RLHF.As an alternative approach, we analyze the effect of RLHF with the Anthropic RLHF dataset(Bai et al., 2022), which is an open-source RLHF dataset.As shown in Table9, we test the zero-shot performance of FLAN-T5-Large(Chung et al., 2022)with/without RLHF process on the Anthropic dataset.

Table 10
, we analyze the impact of the labels of ICL demonstrations on performance of "top k similar" (described in §5.1).For k=4, we first divide the ESC dataset into five subsets,

Table 12
, we analyze the few-shot ChatGPT's performance under different orders of ICL demonstrations: 1) Causal first: from causal demonstrations to non-causal demonstrations; 2)

Table 11 :
The distribution (%) of causal reasoning error types of ChatGPT.

Table 12 :
Experimental results (%) of "top 4 similar" in §5.1 with different orders of demonstrations on the ESC dataset.