LEGO: A Multi-agent Collaborative Framework with Role-playing and Iterative Feedback for Causality Explanation Generation

,


Introduction
Causality explanation generation is a generative task that aims to explain why the given cause-effect pair is true using natural language.For example, given the cause clause C and effect clause E in Figure 1, a corresponding explanation X needs to be generated is "A monsoon was striking Ceylon or southern India at the time and the fleet of the seventh voyage of the Ming Treasure Voyage did not want to be caught in the storm".Causality explanation generation can facilitate various applications, including explainable question answering (Yang et al., 2018), complex reasoning (Dalvi et al., 2021) and future event prediction (Zhou et al., 2022).
Cause : The fleet of the seventh voyage of the Ming Treasure Voyages was sailing under favorable conditions and running before the southwest monsoon on their homeward voyage.Effect : The fleet of the seventh voyage of the Ming Treasure Voyages did not stop at Ceylon or southern India.

Input (Cause-Effect pair)
Explanation : A monsoon was striking Ceylon or southern India at the time and the fleet of the seventh voyage of the Ming Treasure Voyage did not want to be caught in the storm.

Output (Explanation)
Figure 1: An example of causality explanation generation from the WIKIWHY dataset (Ho et al., 2023).
Despite the extensive applications of causality explanation generation, this task is highly challenging as it demands rigorous and explicit rationales to demonstrate that a model not only "knows" the causality in task but also "understands" the underlying mechanics of why that is the case.Previous studies (Du et al., 2022;Ho et al., 2023) leverage a single language model with in-context learning to tackle the task, however, language models often generate spurious causal association based on the given cause-effect pair, leading to deviation from the correct reasoning path.As depicted in Figure 2(a), when observing the information that "The fleet was sailing under favorable conditions" in cause C, GPT-3 and GPT-3.5 both generate the spurious causal association without exception, i.e., "under favorable condition" ⇒ "they were making good time", which leads to incorrect reasoning directions and hinders language models from taking a step further to explore the causal mechanism.
Through in-depth analysis, we attribute this kind of errors to the limitation of unidirectional reasoning of large language models (LLMs).As illustrated in Figure 2(a), both GPT-3 and GPT-3.5only conduct unidirectional reasoning based on the content of the cause.When encountering favorable condition, they directly associate it with making good time, which suggests that this spurious causal

Effect Ceylon or southern India Climate
The southwest monsoon from June to September account for most of the rainfall in the region… Output (GPT-3.5) A monsoon was striking Ceylon or southern India at the time and the fleet of the seventh voyage of the Ming Treasure Voyage did not want to be caught in the storm.
Since the fleet was sailing under favorable conditions, they were able to make good time and didn't need to make any stops for supplies or rest.Ceylon and southern India were typical stopping points for these voyages, but since the fleet didn't need to stop, they simply sailed past them.
The fleet of the seventh voyage of the Ming Treasure Voyages was sailing under favorable conditions and running before the southwest monsoon on their homeward voyage.This meant that they were making good time and didn't need to stop at Ceylon or southern India.
Figure 2: The explanation generated by large language models and human for the task mentioned above.
association appears to be ingrained in the parameters of large language models.As a result, they inevitably deviate from the correct reasoning path, overlook crucial information present in the effect, and fail to autonomously search for broader world knowledge beyond the given text.
Actually, it is widely accepted that human memory is characterized by its bidirectional associative and parallel processing capabilities (Kosko, 1988;Hattori and Hagiwara, 1995;Anderson and Bower, 2014).The ability of bidirectional reasoning may be helpful in alleviating the phenomenon of spurious causal associations.We demonstrate the bidirectional thought process required to arrive at the gold explanation in Figure 2(b).On the one hand, we begin our reasoning from the information on cause side and progressively search for "Ming Treasure Voyages", "the seventh voyage", and "homeward voyage".Eventually, we obtain the key fine-grained world knowledge: "The fleet departed from Hormuz on 9 March 1433, ... On 22 July 1433, they arrived in the capital Beijing".On the other hand, we reason from the information on effect side and search for "Ceylon or southern India", then associate it with the route of "the seventh voyage" mentioned in cause side (pass through).Subsequently, from the regional climate data, we learn another key fine-grained world knowledge: "The southwest monsoon from June to September accounts for most of the rainfall in the region".Finally, by linking these two pieces of fine-grained world knowledge obtained by the bidirectional reasoning process, we discover that the return time of the seventh voyage coincides with the arrival of the southwest monsoon in southern India.Therefore, we can conclude ⇒ "A monsoon was striking Ceylon or southern India at the time".Combining this with task-specific commonsense knowledge, such as the fleet needed to avoid storms at sea, we can deduce ⇒ "the fleet did not want to be caught in the storm".In summary, to sucessful complete this task, the system needs the abilities of bidirectional reasoning and knowledge retrieval to effectively integrate fine-grained world knowledge.Furthermore, it necessitates the capacity for commonsense induction to augment task-specific commonsense knowledge.
Based on the detailed analysis above, we can be observed that it demands multiple abilities, such as bidirectional reasoning, knowledge retrieval, and commonsense induction for this complex task.Although LLMs demonstrate a wide range of capabilities, a single language model is unable to simultaneously provide all of them.Therefore, we propose a novel Multi-agent Collaborative Framework with Role-playing and Iterative Feedback (LEGO) to effectively combine the different abilities of multiple large language models for causality explanation generation.Specifically, we exploit LLMs as character malleable LEGO blocks and utilize role-playing to assign specific roles to the five LLMs.We firstly devise a Fine-grained World Knowledge Integration Module to augment information about tasks to alleviate the phenomenon of spurious causal associations.The module consists of three LLMs, two LLMs are designated as Cause Analyst and Effect Analyst, reasoning around Cause and Effect respectively to simulate the process of bidirectional inference, and posing questions to another LLM which acts as Knowledge Master to autonomously mine fine-grained world knowledge.Then, we leverage an Iterative Feedback and Refinement Module to improve explanation by multi-aspect feedback.The module utilizes one LLM as Explainer to generate an initial explanation, and iteratively receives Observation feedback and Commonsense feedback from Critic LLM to refine its explanation.
Overall, the main contributions of this work can be summarized as follows: 1) We propose a novel multi-agent collaborative framework with role-playing and iterative feedback (LEGO) to effectively combine different abilities of multiple LLMs for causality explanation generation; 2) We devise a Fine-grained World Knowledge Integration Module to augment information about tasks through the interaction of three agent, i.e.Cause Analyst, Effect Analyst and Knowledge Master.We leverage an Iterative Feedback and Refinement Module to improve the generated explanation by multi-aspect feedback involving two LLMs, i.e.Explainer and Critic; 3) Extensive experiments on WIKIWHY and e-CARE show the superiority of our multi-agent framework in terms of reasoning about the causality among cause and effect.

Methodology
In this section, we introduce our multi-agent collaborative framework LEGO.As shown in Figure 2, our framework consists of two major components involving five LLMs: (1) Fine-grained World Knowledge Integration Module, which augments information about tasks through the interaction of three agents; (2) Iterative Feedback and Refinement Module, which utilizes one LLM serve as Explainer to generate an initial output, and iteratively receives Observation and Commonsense feedback from Critic LLM to refine its explanation.

Fine-grained World Knowledge Integration
We devise this module to precisely augment information about task for alleviating the phenomenon of spurious causal associations.
Cause-Effect Analyst Role Assignment After receiving the task, the Cause Analyst role and the Effect Analyst role will be assigned to two LLMs respectively by inception prompt (Li et al., 2023).
In practice, a system message prompts are passed to the LLMs before the conversations start to assign LLMs with corresponding roles.We refer to the Cause Analyst message prompt is P C and that of the Effect Analyst is P E .Let L 1 and L 2 denote two large language models, when the system message is passed to those large language models respectively, we obtain M ← L P C 1 and A ← L P E 2 which are referred to as the Cause Analyst role and the Effect Analyst.In addition, The third agent, serves as the Knowledge Master for answering the questions of the reasoners, does not require a specific prompt.Reasoning Towards Causality After the roles are assigned, the Cause Analyst M and Effect Analyst A will collaborate in reasoning about the finegrained world knowledge by thought and action (Yao et al., 2022).The Cause Analyst is responsible for reasoning about the information in Cause and directing the conversation around the causality in task.Meanwhile, the Effect Analyst is designed to reason about the information in Effect and follow the reasoning trace of the Cause Analyst.One example of Effect Analyst reasoning about the task presented in Figure 1  The Knowledge Master returns its Observation: Observation: Ceylon, is an island country located in the Indian Ocean, off the southern coast of India.It experiences a tropical climate and is greatly influenced by the southwest monsoon...The southwest monsoon typically occurs between June and September each year... Formally, we denote the Cause Analyst message obtained at time t by M t and the corresponding observation from Knowledge Master is O mt , the Effect Analyst message is A t and the corresponding observation is O at , the historical messages: At the next time step t + 1, the Cause Analyst takes the historical conversation message set and creates a new message M t+1 , as shown: The assistant reasoning will respond with A t+1 : We focus on the Observations obtained by two reasoners during their interaction, which encompass fine-grained knowledge about the task.These

Initial input Reasoning
Cause Analyst Effect Analyst Knowledge Master Observations can aid the Explainer in alleviating the spurious causal associations.Moreover, the reasons why we use LLM as the Knowledge Master instead of searching on Knowledge base like Wikipedia are that 1) Wikipedia cannot accept freeform text queries from Analyst and the error analysis of ReAct (Yao et al., 2022) indicates 23% of the errors came from the search information returning empty or useless information; 2) Recently Yu et al. (2022) has demonstrated that the generated contextual documents more frequently contain the correct answer compared to the retrieved documents.Therefore, utilizing LLM to query a large amount of fine-grained knowledge is more efficient than traditional search methods.We will further analyze this in the experiment section.Inception Prompting Following Li et al. (2023), we utilize the inception prompts to declare roles to each LLM before the conversation begins.After the inception prompt is delivered to the LLM as a system message, the agent automatically assumes the corresponding role and interacts in the conversation with the way of thinking first and then action.Our inception prompt consists of Cause Analyst prompt P C and Effect Analyst P E , which encompass role definitions, action spaces, and guidelines.We present part of the Cause Analyst prompt P C in Figure 4.The details of inception prompts are in Appendix E.

Iterative Feedback and Refinement
In this section, one LLM serve as Explainer to generate output, and iteratively receives multi-aspect feedback from Critic LLM to refine its explanation.
Multi-aspect Feedback Although LLMs can generate coherent outputs with in-context learning, they often fall short in addressing more intricate requirements.According to our error analysis1 of GPT-3.5 on WIKIWHY dataset (Ho et al., 2023), the most prominent errors are the lack of common-sense knowledge and repetitive causal relationships (accounting for a combined 54%) and Current Explanation (  ): The fleet was sailing under favorable conditions and running before the southwest monsoon on their homeward voyage.This means that they were trying to make good time to get back home and did not want to make any unnecessary stops that would slow them down.

Refined explanation (𝒚 𝒊+𝟏 ):
The monsoon season typically arrives in Ceylon around May or June and lasts until September, which could have created risky sailing conditions for the fleet.Therefore, the fleet did not stop at these places to avoid being involved in the storm and continued their journey to get back .
Observation Feedback：The Explanation is not a simple concatenation of Cause and Effect, but ignores that the southwest monsoon typically arrives in Ceylon around May or June and lasts until September.
Commonsense Feedback: Fleets need to pay close attention to the weather forecast and marine meteorological information, try to avoid being involved in the storm.

Critic Explainer
Figure 5: The explanation refinement process.
there are many research studies (Bai et al., 2022b;Yang et al., 2022) have demonstrated the success of multi-aspect feedback.Therefore, we decide to break critic feedback into observation and commonsense.Specifically, we utilize a LLM to act as a Critic to provide multi-aspect feedback on the explanation.Critic receives the explanation y i and Observations from previous stage, then provides Observation and Commonsense feedback to improve the explanation.The Observation feedback covers two aspects: 1) report on whether the explanation is repeating the cause-effect relation; 2) supplementary information based on Observation.
The Commonsense feedback aims to present the commonsense knowledge required to explain the causality of the task.
Iterative Refinement The Explainer improves it output based on received feedback and previous generated output.The Critc provides feedback ⇒ Explainer refines explanation ⇒ Critc provides feedback loop can be applied multiple times.We set the number of iterations to a fixed number due to budget.One key aspect of Critic is the retention of a history of past experiences.This is achieved by appending the previous outputs to the prompt continuously.This allows Explainer to learn from past mistakes and avoid repeating them.Figure 5 depicts the explanation refinement process.The current explanation exhibits the spurious causal association that "under favorable condition" ⇒ "they were trying to make good time".The Critic, based on the Observations obtained by knowledge integration module, provides Obser-vation feedback that the current explanation does not repeat cause-effect relation but overlooks the key information that "the southwest monsoon typically arrives in Ceylon around May or June and lasts until September."Furthermore, it raises the Commonsense feedback that "Fleets need to pay close attention to the weather forecast and marine meteorological information ..." After receiving the feedback, the Explainer becomes aware that the reason why the fleet did not stop was the monsoon striking the southern India at that time, and the fleet needed to avoid getting caught in the storm.It can be observed that the Explainer corrects the error of spurious causal association and generates a valid explanation by incorporating the mentioned commonsense.

Datasets
We conduct extensive experiments on two datasets without training.WIKIWHY (Ho et al., 2023) a large-scale QA dataset built around explaining why an answer is true in natural language, which contains over 9,000 "why" question-answer-rationale triples, grounded on Wikipedia facts across a diverse set of topics.e-CARE (Du et al., 2022) is a large human-annotated explainable causal reasoning dataset, which contains over 21K causal reasoning questions, together with natural language formed explanations of the causal questions.Since the test set of e-CARE is not public and our method needs to call the OpenAI API which is costly.Therefore, we conduct experiments on the published validation set of e-CARE.We present data examples and other details in Appendix A.

Automatic Evaluation We provide numerous examples from both datasets in
In contrast, explanations in WikiWhy exhibit two structures: multi-hop step sequences and rationale sets, rendering them more instantiated.According to statistics, the average length of explanations in this dataset is 1.5 sentences (Table 4), with some extending up to 6 sentences.Additionally, there is a fixed order among the sentences.Consequently, the paper of WikiWhy introduces both ordered and unordered evaluation to compare the ideas contained in the predictions and references.Specifically, we follow Ho et al. (2023) to use unordered evaluation and ordered evaluation.The former aims to compare the ideas contained in the predictions and references, and the latter tends to penalize incorrectly ordered explanations for the structure of multi-hop explanations.Details of the unordered and ordered evaluation can be found in Appendix B. prior works (Ho et al., 2023), we present a panel of three graduate students a random sample of 50 entries from each setting and the following binary True/False criteria guidelines: 1) Correctness: Is the explanation both complete and satisfying? 2) Concision: Does the explanation say everything it needs to say and nothing more? 3) Fluency: Is the explanation writing fluent?4) Validity: Does the explanation make logical sense?5) Win/Tie/Lose: Compare the generated explanation against the provided reference.The three annotators work independently on randomly selected samples.The human evaluations were conducted via spreadsheets.We randomly shuffled the golden reference and the generated candidates, we don't reveal which column is the golden reference and which is the generated candidate in the evaluation setup.Details of human evaluation can be found in Appendix B.

Main Results
Our main results of WIKIWHY dataset are provided in Table 1.We can observe that our framework significantly improves the quality of explanations.For InstructGPT based setting, there is a 3.9% improvement in unordered f1 score by using our framework; similarly, for GPT-3.5 based setting, there is a 5.1% improvement.This establishes that our method is effective for different language models.The corresponding human evaluation experiments presented in Table 2. 75.3% of the explanations generated by our framework with GPT-3.5 are judged to be satisfactory, significantly higher than other baselines.It should be noted that the results in Table 2 just intend to count the number of samples in the current generated explanations that meet these metrics (e.g."Correctness"), which cannot intuitively reflect the advantages of our model.Thus by using the explanations of LEGO (GPT-3.5)as references, we compared different baselines against LEGO (GPT-3.5)separately.This was done to visually demonstrate the superiority LEGO (GPT-3.5),as shown in Figure 6 and 7.The quality of the explanations we generate is significantly better than that generated by baselines under different metrics.More detail can be found in Table 8 in Appendix.Figure 9: Performance of our framework with knowledge integration module (one round of interaction).

Discussion and Analysis
Impact of Knowledge Integration Module Our ablation experiments with GPT-3.5 as base LM are presented in Table 9. 1) Without knowledge integration module, we plot the performance comparison of LEGO without the knowledge integration module in Figure 8.We can observe that the precision of the model experiences a continuous decline in the iterative process.This indicates that without fine-grained knowledge support, relying solely on iterative self-feedback is insufficient to effectively improve the generated explanations.2) With knowledge integration module (one round of interaction), as shown in Figure 9, the precision of LEGO gradually increases demonstrates that this module provides the necessary fine-grained knowledge.3) With knowledge integration module (multiple round of interaction), as illustrated in Table 9, in this setting, the performance of the model does not further improve.This could be attributed to the fine-grained knowledge stored in Observations is sufficient after one round of interaction.The more times the two analysts interact, the more noisy the collected knowledge may become, resulting in inefficient information in the feedback.Furthermore, Based on our analysis, the knowledge required for explanations is often fine-grained, such as "The fleet of the seventh voyage of the Ming Treasure Voyages."Firstly, this type of question cannot be directly answered through an API; it needs to be broken down into components like "Ming Treasure Voyages" and "the seventh voyage."Additionally, the information related to "the seventh voyage" is located on line 370 of the search page for "Ming Treasure Voyages," making knowledge localization a challenge.Moreover, as depicted in Figure 2, the thinking process can accumulate a significant amount of information.Relying solely on the memory of a single agent can lead to the omission of crucial information during reasoning.Our approach of using multiple agents effectively mitigates this issue, as demonstrated.
Impact of Iterative Feedback Module This module aims to provide multi-aspect feedback during the iterative process, allowing the Explainer to supplement fine-grained knowledge and taskspecific commonsense knowledge.Take Figure 9 as an example, we observe a significant improvement in Recall (8.8%) after one round of refinement, we attribute this improvement to the incorporation of commonsense knowledge into the explanations.However, after one iteration, the improvement of Recall slows down or even decreases, which may be the modification of commonsense knowledge in the subsequent iterations is not obvious and multiple iterations will increase the length of explanation and lead to the decrease in precision.
Refinement ability of LLM By comparing the recall of ordered evaluation of our approaches based on different underlying models (Instruct-GPT and GPT-3.5),we observed that InstructGPT demonstrated a growth of 3.5%, while GPT-3.5 exhibited an increase of 9.4% after one round of iteration.We attribute this noticeable difference to the stronger task comprehension and commonsense induction abilities of GPT-3.5.Furthermore, we noticed that InstructGPT is more "conservative" during the iteration process as presented in Figure 10.Due to space limitations, we present the explanations generated by GPT-3.5 in Figure 11 in Appendix.We can find that the explanations of InstructGPT remained almost identical, with only a few words being replaced by synonyms.In contrast, the explanations generated by GPT-3.5 exhibited greater richness and diversity.

Related Work
Causality Explanation Generation Understanding causality is one of the most central cognitive abilities of human beings, different from the causality identification (Caselli and Vossen, 2017;Zuo et al., 2021;Cao et al., 2021;Tran Phu and Nguyen, 2021) which can only distinguish whether there is a causal relationship, causality explanation generation is especially worth explore since it not only test if a model "knows" that causality but also if it "understands" the underlying mechanics of Cause: similar nature of the railways and to foster cooperation and volunteer exchanges Effect: Lynton & Barnstaple signing a twinning agreement with the Walhalla Goldfields Railway Explanation: Railways with similar goals improve their chance of success by creating a joint venture.---------------------------------------------------------------------------------GPT-3   why that is the case.Du et al. (2022) proposed a human-annotated explainable CAusal REasoning dataset (e-CARE) to explore, which contains over 21K causal reasoning questions, together with natural language formed explanations of the causal questions.As large language models (LLMs) grow larger and more sophisticated, Ho et al. (2023) introduced WIKIWHY, which built around a novel auxiliary task: explaining why an answer is true in natural language, while InstructGPT basedlines achieve only low human-evaluated correctness in the end-to-end answer and explain condition.Communicative Agents Large language models have exhibited remarkable multi-dimensional capabilities in tackling complex tasks.However, their effectiveness greatly hinges on human guidance to steer the conversation, a process that can present challenges and consume significant time (Brown et al., 2020;Yao et al., 2022;Yang et al., 2023).It is important to consider how to autonomously guide the communicative agents toward task completion while maintaining consistency with human intentions.Communication between AI agents can occur in a competitive setting or a cooperative setting (Hadfield-Menell et al., 2016;Silver et al., 2017;Bard et al., 2020;Dafoe et al., 2020;Meng et al., 2023;Kwan et al., 2023).Cooperative AI systems consider the requirements and capacities of other agents within the system, actively striving to collaborate and synchronize their actions.This approach offers numerous advantages, such as enhanced efficiency, improved decision-making, and the ability to address intricate problems that surpass the capabilities of individual agents.Li et al. (2023) enables two communicative agents to engage in a conversation and cooperate with each other to solve assigned tasks.However, designing effective cooperative AI systems is still an active area of research, as it requires addressing a range of technical, ethical, and social challenges.Learning from Feedback The utilization of natural language feedback, generated by both humans and machines, has proven to be effective across various tasks.Reinforcement learning (RL) approaches have been employed to optimize for human preferences or task accuracy, resulting in the generation of valuable feedback.(Bai et al., 2022a;Lu et al., 2022;Le et al., 2022).Recently LLMs have been used to generate feedback for a general domain solution (Yang et al., 2022;Fu et al., 2023;Peng et al., 2023).To make better use of feedback, pairs of feedback and revision have been employed to learn supervised refiners and correctors (Schick et al., 2022;Bai et al., 2022b;Yasunaga and Liang, 2020).However, gathering supervised data from humans is costly, to overcome this, Welleck et al. (2022) selected the best output by relying on knowing the ground truth at test time.Madaan et al. (2023) proposes soliciting feedback from an LLM on its own output, refining the output with feedback, and repeating this feedback-refine process.

Conclusion
We introduce LEGO, a Multi-agent Collaborative Framework with Role-playing and Iterative Feedback for causality explanation generation.Specifically, we treat LLM as character malleable LEGO block and utilize role-playing to assign specific roles to five LLMs, i.e.Cause Analyst, Effect Analyst, Knowledge Master, Critic and Explainer.We devise a Fine-grained World Knowledge Integration Module to augment information about tasks for alleviating the phenomenon of spurious causal associations.Then, we leverage an Iterative Feedback and Refinement Module to improve the generated explanation by multi-aspect feedback.Extensive experiments on WIKIWHY and e-CARE show the superiority of our multi-agent framework in terms of causality explanation generation.

Limitations
Our method aims to explore the cooperation of multi-agents on causality explanation.In the exper-iments, we found that GPT-3.5 is more powerful than InstructGPT in terms of task comprehension and commonsense induction abilities, but due to the high cost of API calls, we did not further test the performance of GPT-4 in this complex task.In addition, Explanations come in various structures, as presented in the typology defined by Neves Ribeiro et al. ( 2022), multiple explanations may be valid, the datasets we based on covered a large proportion of explanations to simple 'why' questions which minimizes the variability of explanations to some extent.The experimental results show that the explanations generated by LLMs are not ideal in the ordered evaluation.Moreover, although the automatic evaluation results show the effectiveness of our method, 44.8% of the explanations are still judged by human to be worse than the gold reference.In future work, we intend to delve deeper into the underlying structure of causal explanations within LLMs.

A Dateset Details
A.1 WIKIWHY WIKIWHY (Ho et al., 2023) is a large-scale QA dataset built around explaining why an answer is true in natural language, which contains over 9,000 "why" question-answer-rationale triples, grounded on Wikipedia facts across a diverse set of topics.Each entry contains a cause-effect pair and rationale explaining the pair's causal relation.On average, each rationale contains 1.5137 elements.The statistics for the reasoning component are shown in Table 4.

A.3 Examples from Datasets
Table 6 contains examples from WikiWhy and e-CARE.c denotes cause, e effect, and x is the explanation of the cause-effect pair.

B.1 Metrics for WIKIWHY
While the still developing area of text generation has measures and proxies for similarity that help with simple sequences, comparing reasoning sequences or rationale sets requires more involved measures.Ho et al. (2023) proposed two related metrics, unordered and ordered, to handle sets and sequences, respectively.Unordered Evaluation This first approach compares the ideas contained in the predictions and references.First, split predicted and reference explanations into "ideas" or "steps" by sentence.Then compute a matrix of pairwise similarity scores before using a threshold to classify "matches".Since

WIKIWHY Example c
The thermal stress at dawn and dusk. e The boulders on Ceres are brittle and degrade rapidly. x The thermal temperatures change so drastically the rocks expand and contract.This process weakens the structural integrity of the rocks. c The duration of Hotel California was longer than songs generally played by radio stations.
e Don Felder had doubts about the 1997 Eagles song Hotel California.
x Most songs are only 3-4 minutes long.Hotel California is over 6 minutes.People would not want to listen to same song on radio for that long.
c Seeing the Castle of Cagliostro entrenched in Yamazaki that Japan can make high-quality films.
e Director Takashi Yamazaki modeled his 2019 film Lupin III: The First after The Castle of Cagliostro.
x Viewing The Castle of Cagliostro inspired Takashi Yamazaki.Out of national pride, Takashi Yamazaki followed a model that he believed would produce quality films. c The geographic isolation of the Hupa homeland. e The Hupa had few interactions with early European explorers up to the 19th century.
x The Hupa's homeland was separated by bodies of water or mountains.Not many people could get to the Hupa's homeland. c The use of coal power in Turkey.
e Burning coal leads to air pollution.Air pollution causes sickness and early death.Sick and dead people cannot work.
x 1.4 million working days were lost across the population of Turkey in 2019.

e-CARE Example c
He was infected with gram-positive bacteria. e The doctor raised the lysozyme in his body.
x Lysozyme destroys cell wall of bacteria. c The researcher investigated the premature death in these pet birds.
e He found they all died of Malnourishment.
x Malnourishment is a leading cause of premature death in pet birds.
c It is quite cold here in winter, and the temperature can reach as low as minus 30 degrees. e In winter here, people wear clothes with very good thermal insulation when they go out.
x Clothing provides protection from the elements by increasing the insulating capacity of the body.
c Mary sent a emoticon cryingẗo her boyfriend on her cell phone.
e Her boyfriend immediately called to comfort her.
x Emoticons are combinations of characters used to represent various emotions.Ordered Evaluation To respect the structure of multi-hop explanations, the method penalizes incorrectly ordered explanations.Here, use the previously generated pairwise score matrix and its alignments to generate all possible assignments of prediction sequence elements to reference elements.Then compute the length of the longest common subsequence (LCS) between a prediction alignment against the reference labels for each candidate assignment.This length becomes the count of correctly incorporated structural elements-true positives.Note that under this scheme, the repeated ideas in the prediction are discounted by the LCSstyle alignment process.
We employ the BERTScore metric to measure sentence similarity (not evaluated directly using BERTScore).Taking the unordered evaluation algorithm proposed by (Ho et al., 2023) as an example, Precision and Recall are obtained using the following formulas: Where "predictions" represents the number of sentences in the generated explanations, "relevant" denotes the number of sentences in the reference explanations, "precise" indicates how many of the generated sentences correspond to sentences in the explanations, and "covered" represents the number of sentences in the reference explanations that were successfully predicted.For example, consider the following examples of reference explanations and generated explanations: Reference Explanation: Opening the highway brought in an influx of unsafe people.With the higher traffic from a highway it would be hard to police the unsafe people.With them being harder to police it would become safer for them and they could drug deal and prostitute in the open.
Generated Explanation: When Interstate 5 was opened, it diverted traffic away from these areas and caused a decline in economic activity.With less commerce and people around, it became easier for illegal activities and establishments to operate without attracting attention from law enforcement or regular citizens.
The proposed algorithm keeps separate counts of precise prediction steps and covered reference steps, which effectively addresses the situation where a single prediction sentence may contain multiple reference ideas.For further details of the algorithms, please refer to the evaluation methods provided by (Ho et al., 2023).
For a fair comparison, we follow Ho et al. (2023) to select BERTScore using a large De-BERTa (He et al., 2020) model (microsoft/debertaxlarge-mnli)2 at a threshold of 0.64.

B.2 Metric for e-CARE
Causal Explanation Quality (CEQ) Du et al. (2022) proposed a novel causal explanation quality evaluation metric (namely, CEQ score) as a step towards directly measuring the quality of generated explanations.Specifically, let C, E and X denote the cause, the effect and the generated explanation, respectively.Formally, the CEQ score is defined: where cs(C, E) is the original causal strength between C and E; cs(C, E|X) is the causal strength after involvement of the additional explanation information.The explanation enhanced causal strength cs(C, E|X) is defined as: where "+" denotes the string concatenate operation.Therefore, the CEQ score is positively related to the increase of causal strength between C and E after the involvement of the explanation X.
we employ a widely-adopted model-agnostic method proposed by Luo et al. (2016) to calculate the causal strength.The model-agnostic nature enable us to avoid reliance on certain models and keep the fairness of evaluation.Specifically, the phrase-level causal strength is derived through synthesizing the word-level causality.where (C A , E B ) is an arbitrary causal fact; N C A and N E B are the number of words within C A and E B , respectively; cs(w i , w j ) is the causal strength between word w i and w j , which is estimated from a large corpus as: cs(wi, wj) = Count(wi, wj) Count(wi)Count(wj) α where α is a penalty coefficient and empirically set α = 0,66 as same as Luo et al. (2016).We use the average CEQ score as the final score.

B.3 Human Evaluation
Automatic metrics may not reliably evaluate results produced by models with few-shot capabilities like GPT-3.In light of this, we select the highest scoring explanations for each set of experiments for additional fine-grained evaluation and measure the agreement among the annotators.
(1) For each human evaluation task, we present a panel of three undergraduate students a random sample of 50 entries from each setting and the following binary True/False criteria guidelines: • Correctness: Mark true if and only if the explanation is both complete and satisfying.
• Concision: Mark true if the explanation says everything it needs to say and nothing more.Mark false if extra information is included.
• Fluency: Is the explanation writing fluent?Mark false if there are any mechanical mistakes.
• Validity: Does the explanation make logical sense?Ignore whether or not the explanation successfully explains the cause/effect relation.Mark false if the explanation contains any illogical or untrue conclusions.
• Win/Tie/Lose: Compare the generated explanation against the provided reference.Mark Win if you prefer the generated explanation, Tie if you have no preference, and Lose if you prefer the reference explanation.
(2) Using the explanations of LEGO (GPT-3.5)as references, we compared different baselines against LEGO (GPT-3.5)separately.This was done to visually demonstrate the superiority of LEGO (GPT-3.5),as shown in Table 8.
(3) We measured the agreement among three annotators, and the table below displays our Fleiss' Kappa scores for the annotation agreements of Win/Tie/Lose (Compare the generated explanation against the provided reference).

C Details of Ablation Experiments
As shown in Table 9, we conduct extensive ablation experiments on WIKIWHY dataset.

D Case Analysis
The cases we show are in Figure 12 and 13

E Prompts
We include all the prompts used in the experiments in Figures 14-17.

Few
need to analyze from the Cause.So I need to ask about the fleet of the seventh voyage of the Ming Treasure Voyages.Ask: [The fleet of the seventh voyage of the Ming Treasure Voyages] Observation: The fleet of the seventh Ming Treasure Voyage set sail in the early summer of 1430, …The fleet set sail from China and traveled to Southeast Asia, southern India, the Persian Gulf, and the east coast of Africa … Thought : I need to analyze from the Effect.So I need to ask about Ceylon, southern India and southwest monsoon Ask: [Ceylon, southern India and southwest monsoon] Observation: Ceylon, also known as Sri Lanka, is an island country located in the Indian Ocean, off the southern coast of India.It experiences a tropical climate and is greatly influenced by the southwest monsoon … Thought 2: I need to analyze from the Cause.I need to ask about the homeward voyage of the seventh voyage of the Ming Treasure Voyages .Ask: [ the homeward voyage of the seventh voyage of the Ming Treasure Voyages ] Observation: …During the Seventh Voyage, Zheng He and his fleet sailed back from the east coast of Africa, where they had made port in present-day Kenya and Tanzania.The fleet sailed across the Indian Ocean… Cause: The fleet of the seventh voyage of the Ming Treasure Voyages was sailing under favorable conditions and running before the southwest monsoon on their homeward voyage.Effect: The fleet of the seventh voyage of the Ming Treasure Voyages did not stop at Ceylon or southern India.

Figure 3 :
Figure 3: Overview of our multi-agent framework.It first augments information about current task through the interaction of three agents (left), then utilizes one LLM serve as Explainer to generate an initial output and iteratively refine its explanation by following the Observation and Commonsense feedback from Critic LLM (right).

Figure 4 :
Figure 4: Part of the Cause Analyst inception prompt.

based explanation Iteration 0 :
Lynton & Barnstaple signed a twinning agreement with the Walhalla Goldfields Railway in order to foster cooperation and volunteer exchanges due to the similar nature of the railways.Iteration 1: Lynton & Barnstaple signed a twinning agreement with the Walhalla Goldfields Railway to promote cooperation and exchange of ideas between the two similar railways.Iteration 2: Lynton & Barnstaple signed a twinning agreement with the Walhalla Goldfields Railway to foster cooperation and exchange of ideas between the two similar railways.

Figure 10 :
Figure 10: The explanations generated by our method with InstructGPT as the base model.
cs(CA, EB) = 1 NC A + NE B w i ∈C A ,w j ∈E Bcs(wi, wj)

Iteration 2 :
The agreement among three annotators Cause: similar nature of the railways and to foster cooperation and volunteer exchanges Effect: Lynton & Barnstaple signing a twinning agreement with the Walhalla Goldfields Railway Explanation: Railways with similar goals improve their chance of success by creating a joint venture.---------------------------------------------------------------------------------GPT-3.5 based explanation Iteration 0: Lynton & Barnstaple signed a twinning agreement with the Walhalla Goldfields Railway in order to foster cooperation and volunteer exchanges due to the similar nature of the railways.Iteration 1: The similar nature of the railways, as well as the desire to foster cooperation and volunteer exchanges, led Lynton & Barnstaple to sign a twinning agreement with the Walhalla Goldfields Railway.. Organizations often form partnerships or twinning agreements with similar groups or organizations in order to enhance collaboration and mutual exchange.The Lynton & Barnstaple and Walhalla Goldfields railways signed a twinning agreement due to their similar nature and desire to foster cooperation and volunteer exchanges.

Figure 11 :
Figure11: The explanations generated by our method with GPT-3.5 as the base model.
is shown below:

Table 1 :
Table 6 in the Ap-Performance on WIKIWHY dataset.We conduct experiments on InstructGPT (text-davinci-003) andas baselines and underlying language models respectively.

Table 2 :
Human evaluation.We show the results after using the our framework.Overall correctness is marked on a binary scale to indicate an explanation is complete and satisfying or not.

Table 3 :
Performance on e-CARE dataset.We used a binary scale (correct/incorrect) for human evaluation and report the proportion of correct evaluations.For comparison, the human-generated CEQ score is 0.038.
" (example 4 of e-CARE).Therefore, to maintain consistency with previous works, we follow is a large humanannotated explainable causal reasoning dataset,

Table 5 :
Corpus level statistics of the e-CARE dataset.Uniq.Explanations refer to the explanations that only correspond to a single causal fact.

Table 6 :
Examples from datasets a single prediction sentence may contain multiple reference ideas, so keep separate counts of precise prediction steps and covered reference steps.These counts are then micro-averaged for the test set's overall precision, recall, and F1 scores.