How to Enhance Causal Discrimination of Utterances: A Case on Affective Reasoning

Our investigation into the Affective Reasoning in Conversation (ARC) task highlights the challenge of causal discrimination. Almost all existing models, including large language models (LLMs), excel at capturing semantic correlations within utterance embeddings but fall short in determining the specific causal relationships. To overcome this limitation, we propose the incorporation of i.i.d. noise terms into the conversation process, thereby constructing a structural causal model (SCM). It explores how distinct causal relationships of fitted embeddings can be discerned through independent conditions. To facilitate the implementation of deep learning, we introduce the cogn frameworks to handle un-structured conversation data, and employ an au-toencoder architecture to regard the unobservable noise as learnable “implicit causes.” More-over, we curate a synthetic dataset that includes i.i.d. noise. Through comprehensive experiments, we validate the effectiveness and inter-pretability of our approach. Our code is available in https://github.com/Zodiark-ch/ mater-of-our-EMNLP2023-paper .


Introduction
Nowadays, numerous conversation recognition tasks (such as Emotion Recognition in Conversation (ERC) task (Pereira et al., 2023;Thakur et al., 2023), Intent Recognition (IR) task (Ye et al., 2023;Ni, 2023) and Dialogue Act Recognition (DAR) task (Arora et al., 2023)) have shown promising performance in specialized supervised and unsupervised methods.Considering the RoBERTa pretrained model (Liu et al., 2019) as the examples, "My eyelids are fighting" and "I want to sleep," which have similar semantics but different tokens can be well fitted within embeddings.(i.e., these two embeddings exhibit a strong resemblance via certain metrics such as cosine similarity.) However, when it comes to the relationship between two utterances, denoted as A and B, wherein their embeddings can be fitted, various possible relationships exist: A acts as the cause of B (A → B), A acts as the outcome of B (A ← B), or more complex, A and B are both influenced by a common cause (A ← C → B), and so on.Particularly in reasoning tasks (Uymaz and Metin, 2022;Feng et al., 2022), it is crucial for these methods to transcend the mere fitting of embeddings and possess the capacity to discriminate diverse causal relationships.(i.e., the ability of causal discrimination) (Bao et al., 2022;Shirai et al., 2023).
To specifically investigate the causal discrimination capability of existing methods in conversation, we narrow down our research to a particular task: Affective Reasoning in Conversation (ARC), which has included Emotion-Cause Pair Extraction (ECPE) Xia and Ding (2019) and Emotion-Cause Span Recognition (ECSR) Poria et al. (2021).
We begin with conducting tests to evaluate the causal discrimination of existing methods including the large language models (LLMs) (Kasneci et al., 2023).One typical evaluation involves the causal reversal test: for emotion-cause utterance pairs with true labels (A, B) representing a causal relationship of B → A, we scrutinize the predictions generated by the existing methods using both positive pairs (A, B) and negative pairs (B, A).The results reveal that all the examined methods performed similarly across the two sample types.As we are concerned, they lacked causal discriminability.(Details are shown in Section 2.3) In order to discriminate different causal relationships between two similar embeddings, we construct the dialogue process as a Structural Causal Model (SCM).Many endeavors (Cheng et al., 2022;Nogueira et al., 2022) supporting that i.i.d.noise of SCM could facilitate the discrimination of causal relationships when fitting two variables.Under the presence of noise, each utterance is not only explicitly influenced by the other utterances but also implicitly influenced by the i.i.d.exoge-nous noise.Consequently, this framework ensures that two fitted embeddings result in diverse causal relationships, which are determined by corresponding independent conditions between the residual terms and embeddings.For simplicity, we refer to other utterances as explicit causes and exogenous noise as implicit causes.
Furthermore, to enable the learnability of such causal discrimination within embeddings, we propose a common skeleton, named centering one graph node (cogn) skeleton for each utterance derived from some broadly accepted prior hypotheses.It can address the challenges arising from variable-length and unstructured dialogue samples.Subsequently, we develop an autoencoder architecture to learn the unobservable implicit causes.Specifically, we consider the implicit causes as latent variables and utilize a graph attention network (GAT) (Veličković et al., 2017) to encode its representation.Additionally, the decoder leverages the inverse matrix of the causal strength, ensuring an accurate retrieval of the causal relationships.
Finally, we conduct extensive experimental evaluations: 1) our approach significantly outperforms existing methods including prominent LLMs (GPT-3.5 and GPT-4) in two affective reasoning tasks (ECPE and ECSR) and one emotion recognition task (ERC), demonstrating its effectiveness in affective reasoning.2) our method exhibits a significant reduction in false predictions for negative samples across three causal discrimination scenarios.3) we curate a synthetic dataset with implicit causes to visualize the latent variable in our implementation.
Our contribution is four-fold: • We formulated the dialogue process as an SCM and analyzed the causal relationships represented by different independent conditions.
• We devised the cogn skeleton to address the problems of variable-length and unstructured dialogue samples.
• We adopted an autoencoder architecture to overcome the unobservability of implicit causes and make it learnable.
• We constructed a synthetic dataset with implicit causes and conducted extensive evaluations of our proposed method.
2 Related Works and Challenges

Task Definition
For notational consistency, we use the following terminology.The target utterance U t is the t th utterances of a conversation D = (U 1 , U 2 , U 3 , . . ., U N ) where N is the maximum number of utterances in this conversation and 0 < t ⩽ N .The emotion label Emo t denotes the emotion type of U t .
The emotion-cause pair (ECP) is a pair (U t , U i ), where U i is the i th utterance of this conversation.
In the ECP, U t represents the emotion utterance and U i is the corresponding cause utterance.Moreover, the cause label C t,i denotes the cause span type of the ECP (U t , U i ).
Thus, in a given text, ERC is the task of identifying all Emo t .Moreover, ECPE aims to extract a set of ECPs and ECSR aims to identify all C t,i .

Affective Reasoning in Conversation
Chen et al. ( 2018) introduced the pioneering work on ERC due to the growing availability of public conversational data.Building upon this, Xia and Ding (2019) further advanced the field by proposing the ECPE that jointly identifies both emotions and their corresponding causes.Moreover, Poria et al. (2021) has extended ECPE into conversations and proposed a novel ECSR task, specifically designed to identify ECP spans within conversation contexts.More recently, increasing works have indicated the crucial role played by accurate inference models in facilitating complex reasoning within these tasks, such as the assumption about interlocutors (Zhang et al., 2019;Lian et al., 2021;Shen et al., 2021) andcontext (Ghosal et al., 2019;Shen et al., 2022;Chen et al., 2023).

Challenge of Affective Reasoning
We examined the performance of a range of methods for addressing affective reasoning in conversations, including both unsupervised approaches (large language models (LLMs), BERT-based pretrained models) and supervised approaches (taskrelated approaches).
Overall, all the methods demonstrated a lack of discriminability on two types of challenges: • Samples where emotional utterances and causal utterances are interchanged.For a dialogue instance, if the ECP is (U 1 , U 2 ) (U 2 is the cause of U 1 ), the prediction results obtained by the existing methods tend to include both (U 1 , U 2 ) and (U 2 , U 1 ).Table 1: For the two challenges mentioned in Section 2.3, we conducted tests on a subset of 200 samples from the RECCON dataset.We recorded the number of samples identified by above methods.In the second row of Challenge 1, we showed the count of samples where these methods extracted the negative pairs in reverse cause order.Similarly, in the third row of Challenge 2, we showed the count of samples where these methods extracted negative indirect pairs.
• Samples with indirect connections.For example, if the ECPs in a conversation are (U 1 , U 2 ) and (U 2 , U 3 ), the prediction results obtained by the methods often include an additional pair (U 1 , U 3 ).
We evaluated the performance of existing methods on these two challenges, and the detailed results are shown in Table 1.All evaluated methods extracted a nearly equal number of negative samples as positive samples.Considering their performance in broad research domains, both unsupervised and supervised methods could demonstrate a desirable fitting ability to capture the semantic similarity between two utterances.This often apparently results in satisfactory performance in most tasks.However, when it comes to more specific causal relationships within semantically similar sentences (such as reasoning tasks), they may not exhibit the same level of "intelligence" and output some "pseudo-correlation".
In the area of causal discovery, Causal Markov and Faithfulness Assumptions (Spirtes et al., 2000;Colombo et al., 2012;Ogarrio et al., 2016), provide insights into capturing more specific causal relationships in the situation of the above challenges.Considering two similar variables: A and B that can be fitted, the independence tests enable the determination of specific causal relationships, such as "A → B," "B → A," or "A → L → B".More recently, the Structural Causal Model (SCM) (Shimizu et al., 2006;Shimizu and Bollen, 2014;Sanchez-Romero et al., 2019) built upon the independent noise assumptions has emerged as an effective approach to the limitation of Markov equivalence classes in distinguishing causal relationships.
i.e., This speaker wanted his rules to be respected but it is broken now.
U 1 Figure 1: The conversation case with five utterances.In the SCM, we assume that each utterance U i has a corresponding implicit cause E i , and has several explicit causes.i.e., U 4 has an implicit cause E 4 and two explicit causes U 3 and U 2 .In the lower part of the figure, SCM adopts U t = α it U i +E t to denote these relationships and formalize the conversation as a DAG.
The noise terms (also called exogenous variables) for each variable, enables methods such as Independent Component Analysis (ICA) to identify more comprehensive causal relationships between the two fitted variables.

Methodology
In this section, we begin by outlining incorporating i.i.d.noise terms into a dialogue model to construct an SCM in Section 3.1, demonstrating independent residual allowing for the identification of more specific causal relations within pairs of fitted utterances.Next, to mitigate conflicts between SCM models and dialogue data, we designed cogn skeletons with six instantiations in Section 3.2.Finally, we propose a deep learning implementation to tackle the issue of noise being unknown in dialogue data in Section 3.3.

Structural Causal Model
In order to imbue causal discriminability into the fitting process of two relevant utterances, we algebraically construct the conversation model as a Structural Causal Model (SCM).
Definition 1: An SCM of a dialogue is a 3 tuple ⟨U, E, F⟩, where U is the set of utterances the number of utterances.Note that each E i is independent in the SCM.Structural equations F = {f i } N i=1 are functions that determine U with U i = f i (rel U i ) + E i , where rel Ut denotes a set of utterances that point to the U t .
Definition 1 establishes the construction of a novel computational model for dialogue process, as exemplified in Figure 1.In such a computational model, each utterance is endogenous and influenced by an independent exogenous variable.For simplicity, we refer to the variable U as the explicit causes and the variable E as the implicit causes.The independence of the implicit causes makes the residual terms meaningful during the fitting of any two utterances.
Definition 2: The relationship of two utterances X and Y in a dialogue is causal discriminable, from the independent conditions: where Σ represents the residual terms in fitting process.(The proof is shown in Appendix A.) Example 1: In Example 1, it is observed that any two utterances can be fitted together as they are mutually dependent.However, causal discriminability can be employed to differentiate their distinct causal structures.For instance, the residual term and Σ U 2 is not independent of U 3 , implying the presence of common cause (U 1 ) between U 2 and U 3 .

Causal Skeleton Estimation
Establishing a skeleton is the first step in causal discovery, as different skeletons provide distinct learning strategies for recovering the relationships between variables.However, utterances differ from the variables that causal discovery often uses.Specifically, each conversation has a different amount (N ) of utterances, and different inter-utterances relationships related to the context.Hence, it is intractable to build a general causal skeleton with fixed nodes and edges to describe all conversation samples.
Fortunately, several published GNN-based approaches (Shen et al., 2021;Ishiwatari et al., 2021;Ghosal et al., 2019;Lian et al., 2021;Zhang et al., 2019) in ERC have proposed and verified a common hypothesis to settle down this issue.The Hypotheses are elaborated on in Appendix B. Figure 2 showcases the design of six cogn skeletons, derived from the latest works that have employed one or more of these hypotheses.The statistic and specific algorithms are also shown in Appendix B. Note that we only conduct experiments for II-VI because our structure is hard to apply with the recurrent-based skeleton.

Approach Implementation
From a given causal skeleton, a linear SCM can be equivalently represented as: where rel t denotes a set of utterances that point to the U t (7-th utterance) in Figure 2, E

Causal Skeleton
Word embedding GNN Causal Decoder Figure 3: Processing of our approaches, with a six-utterances conversation case as the input.Causal skeleton indicates which utterances (nodes) are used for aggregation.For each layer ℓ, we collect representations H ℓ for all utterances where each row represents one utterance.Causal Encoder yields the implicit causes E ℓ , the input for Decoder learning the causal representation.In all matrices, light gray nodes represent the masked part.
SCM, as well as the implicit cause towards the utterance U t in conversation.Furthermore, we denote the word embedding of U by H = h 1 , h 2 , . . ., h N , and the relationships between utterances in rows can also be written as: Thus we can define the Graph G = (V, E) with adjacency matrix A i,i = 0 for all i.However, in this equation, only H is known.The unknown variable A is typically the target of inference, while the unknown variable E represents exogenous variables that implicitly influence each utterance, such as the speaker's memory, experiences, or desires.(Krych-Appelbaum et al., 2007;Sidera et al., 2018) These factors are typically missing in existing conversation resources.Therefore, determining A based on completely unknown E is another problem we aim to address.
Hence, we treat A T as an autoregression matrix of the G, and then E can be yielded by an autoencoder model.The whole process reads: where f (•) and g( The details of this process are shown in Figure 3. Encoder.We use the graph attention mechanism to learn the adjacency matrix A and construct a hierarchical GNN to instantiate the f (•).ℓ = 1, 2, . . ., L − 1 represents the layer of GNN.Thus, for each utterance at the ℓ-th layer, the A ℓ i,t computed by attention mechanism is a weighted combination of h ℓ t for each directly related utterance U i (i ∈ rel t ): where W ℓ row ∈ R N ×1 and W ℓ col ∈ R N ×1 are the learnable parameters in the graph attention.Moreover, the GNN aggregates the information from the neighbor utterances as following: where W ℓ stands for parameters in the corresponding layer.From the final layer of the evaluation process, by extracting A L−1 computed in Equation 4, the marginal or conditional "distribution" of H is obtained, showing how to discover Causal Graph G from D. Besides, by extracting H L in Equation 6, we can obtain the independent embedding for the implicit causes E = M LP (H L ).
Decoder.We utilize the A and E computed from Encoder to generate the causal representation H.With a fixed adjacency matrix A, the GNN aggregates the information of implicit causes from neighbor nodes as follows: where M ℓ is parameters in the corresponding layer.As the same architecture as the encoder, H = M LP (E L ).Additionally, the plug-in RNN is integrated with GNN to address the appetite of Hypothesis 6: where p ℓ is the state of GRU model, with p computed by self-attention proposed by Thost and Chen (2021).

Optimization
In where e is any emotion type in Emo t , p e denotes the probability labeled with emotion e.In the whole process of ARC tasks, we followed (Wei et al., 2020;Poria et al., 2021) to add several losses of ECPE and ECSR respectively.Furthermore, we would like to explain the difference between our approach and Variational Auto-Encoder (VAE) (Kingma and Welling, 2014).The output of the encoder in VAE is q ϕ (Z).With this estimation of q ϕ (Z), we can measure the variation ξ(q ϕ (Z)) (also called ∇ ϕ ELBO( q ϕ (Z))) to obtain the approximation estimation of ELBO(q).In contrast, our output is E, a fixed matrix rather than a distribution.In other words, the VAE depends on the prior distribution over the latent variables, whereas our approach has a dependency on the consistency of H and H, which is non-sampling and non-distributive.

Experiments
In this section, we conduct extensive experiments to answer the 3 research questions: RQ1: How effective is our method in affective reasoning tasks?
RQ2: How do we justify the causal discriminability of our method?
RQ3: How do we gauge the difference between the latent variable E and designed implicit causes?According to the hypotheses of these baselines, for each cogn skeleton, we choose one recent SOTA work: II: DialogXL (Shen et al., 2022)

Overall Performance (RQ1)
Table 3 reports the results of ECPE and ECSR, with p <0.01 in the t-test, where the best improvement and best performance both concentrate on VI.With the visualization of Appendix F, we infer that the upper triangular adjacency matrix of DAG-ERC, not restricted by the backpropagation, benefits from Hypothesis 6.Moreover, II lags farthest behind in the ECPE while achieving the second best in the ECSR, showing that the reliance on a hypothesis is not equal in different tasks.Furthermore, without Hypotheses 1 and 6, III, IV, and V are far from the best performance since Hypothesis 1 has the maximum identifying space, and Hypothesis 6 supplies the highest number of information passing.Finally, it is worth noting that three skeleton-agnostic baselines and unsupervised methods perform poorly in the RECCON-IE dataset, indicating that our models have stronger representation learning capabilities as well as highlighting the continued research value of affective reasoning tasks.
We further conducted six sets of ablation experiments to study the effects of different modules.In  Furthermore, Appendix E reports the results of ERC task and sensitivity experiments to analyze how our model performs in different L and k.

Relationship analysis (RQ2)
We are also concerned about the causal discriminability for similar utterances.Table 5 demonstrates that in all three different causal models, none of the methods were able to distinguish between negative and positive samples.Because both negative and positive samples can be fit within these three causal models, solely from an embedding similarity perspective.However, our method significantly decreases the percentage of negative samples indicating the effectiveness of incorporating implicit cause noise to enhance causal discriminative ability.
Additionally, we show the adjacent matrices of our model and current SOTA methods in Appendix F. which indicates that our model can more freely explore the relationship between different utterances via adjacent matrices shifting rather than being limited to a fixed structure (e.g., attention module).

Implicit Causes Study (RQ3)
The latent variable E is intended to represent the mentioned implicit causes.Therefore, the global distribution of the latent variable E should be approximately equal to the one of implicit causes.Although human evaluation labels are better for proving reasonable performance, it is intractable to annotate implicit causes due to their unobservability.We thus trained our model in a synthetic dataset given a set of fixed i.i.d.implicit causes to observe how the E is similar to the ground truth implicit causes distributions.Figure 4 (a-b) shows the projection of E and implicit causes, respectively, using t-SNE (Knyazev et al., 2019).We observe that E and implicit causes are both similarly clustered into three parts through the distribution properties.E is consistent with the implicit causes in the samples with or without noise indicating that E successfully learns the implicit causes.
Moreover, in Appendix G, we first prove the approximate emotion consistency between utterance U t and its implicit causes when U t and U i in the emotion-cause pair (U t , U i ) do not belong to the same emotion category.Then, we demonstrate through the ERC task that by replacing Ĥ with E, the emotion consistency provided by implicit causes is preserved.

Limitations
In our model, our method can distinguish between U i → U j and U i ← U k → U j .However, our method is unable to distinguish between U i → U j and U i ← L → U j , where L represents a unobserved variable, called common causes or con-founders.In Tables 3, 7, and 8, skeletons II, III, and IV generally lag far behind V and VI.This unsatisfactory performance of these skeletons indicates that excessive adding-edge leads to serious confounders.
Therefore, we proposed a theoretical design for testifying the existing of latent confounders: Confounding between Non-adjacent Nodes: Consider two utterances U i and U j being nonadjacent utterances.Let P a be the union of the parents of U i and U j : P a = U i ∪ U j .If we perform an intervention on P a (i.e., do(P a = pa)), we thus have Confounding between Adjacent Nodes: Consider two utterances U i and U j being adjacent utterances: U i → U j .If there are no latent confounders, we have Indeed, implementing intervention operations on conversation data poses a significant challenge.Therefore, in our new work, we have proposed general intervention writing: do(X) := P a(X) = ∅ where P a(X) denotes the parent set.Moreover, the most significant obstacle to further research is the lack of a high-quality dataset with complete causal relationship labels.Hence, we have constructed a simulated dialogue dataset via GPT-4 and plan to make it open soon.

Conclusion
The results of testing prevalent approaches on the ARC task have demonstrated that almost all approaches are unable to determine the specific causal relationship that leads to the association of two well-fitted embeddings.In order to enhance the causal discrimination of existing methods, we constructed a SCM with i.i.d.noise terms, and analyzed the independent conditions that can identify the causal relationships between two fitted utterances.Moreover, we proposed the cogn framework to address the unstructured nature of conversation data, designed an autoencoder implementation to make implicit cause learnable, and created a synthetic dataset with noise labels for comprehensive experimental evaluation.While our method still has some limitations, such as confounders and the inability to scale to all methods, we hope that our theory, design, and model can provide valuable insights for the broader exploration of this problem to demonstrate that our work is de facto need for identifying causal relationships.

A Proof of Definition 2
Let X and Y be two variables in an SCM, with their respective noise terms denoted as E X and E Y (where E X and E Y are mutually independent).Let X and Ŷ represent the fitted values of X and Y w.r.t. each other: X = λY and Ŷ = 1 λ X.The residual terms between the fitted values and the true values are denoted as Σ X = X − X and Hence, if the SCM only contains two variables writing: The residual terms could write: Then, if the true causal relationship is from Y to X, λ = k.Σ X does not contain the term of E Y while Σ Y contains the term of E X .We could obtain the independence of residual terms writting: and vice versa.Therefore, we could obtain the independence condition: Furthermore, there may exist a set of independence: We would like to assume that there is a latent variable L, for this situation, constructing two relationships L → X and L → Y .
Then we obtain: Σ L ̸⊥ ⊥ X, Σ L ̸⊥ ⊥ Y .By utilizing the transitivity of conditional independence, we can establish X ̸⊥ ⊥ Y , and finally acheive the situation We likewise assume a latent variable L establishing X → L and Y → L for the opposite situation where Σ X ⊥ ⊥ Y , Σ Y ⊥ ⊥ X, and X, Y are two isolated variables in SCM.From the above independence conditions, we could obtain: Due to the graph structure of SCM, we could obtain: Considering the residual terms, we finally obtain: Hence, we could obtain additional two independence conditions: Based on the independence conditions of 2variables SCM, we could extend it to the general SCM including more than 2 variables.Given any two variables in a SCM, we could testify to the independence condition and finally orientate via the whole SCM.

B Hypotheses and Algorithms for Skeletons
Hypothesis 0. ∀U i ∈ D, it has the same causal skeleton as other utterances.
By regarding Hypothesis 0 as the prior knowledge, a common causal skeleton containing a target variable and a fixed number of related variables can reason about the relations between the target utterance and other considered utterances.We denote this skeleton of U t by S(U t ).There are Additionally, there are some other empirical hypotheses from the above approaches.These hypotheses can be divided into two categories: one is about the "order" of utterances (Hypotheses 1, 2, 3), and the other is about intermingling dynamics among the interlocutors (Hypotheses 4, 5, 6).
Hypothesis 1. (Majumder et al., 2019) Under the sequential order, the target utterance receives information only from the previous utterance.Table 6: Statistics of 6 cogn skeletons.We detailed the hypotheses each cogn skeleton adopted and the original works from which we designed them.
Hypothesis 2. (Wei et al., 2020) Under the graph order, the target utterance receives information from all other utterances.Hypothesis 6. (Shen et al., 2021) Between two utterances both related to the target utterance, there is also information passing, often dubbed as a partial order.
A cogn skeleton is denoted by H = (V, E, M).The V = U 1 , U 2 , U 3 , ..., U N represents a set of utterances in a conversation, and the edge (i, j, m i,j ) ∈ E denotes the influence from U i to U j , where m i,j ∈ M is the type of the edge depending on whether U i and U j belong to one and the same speaker.Thus M = 0, 1, where 1 for that they are the same speaker and 0 for different.Then we denote the speaker type of U i by a function p(U i ).At last, we show the process of building 6 cogn skeletons in Algorithms 1 − 6.
Finally, in Figure 5, we show the adjacency matrix of each cogn skeleton by inputting a binary alternating conversation case with 6 utterances.But Output: H = (V, E) Output: H = (V, E, M) Output: Output: note that adjacency can not indicate all the differences among these skeletons, for example, Hypothesis 6 takes effect when the model learns the relationship based on the VI skeleton.
We follow Shen et al. (2021) to regard utterance turns as speaker turns.MELD (Poria et al., 2019): A multimodel ERC dataset with 7 emotion labels as the same as Daily-Dialog.
IEMOCAP (Busso et al., 2008): A multimodel ERC dataset with 9 emotion labels (neutral, happy, sad, angry, frustrated, excited, surprised, disappointed, and fear).However, models in ERC field are often evaluated on samples with first six emotions due to the too few samples of latter three emotions.20 dialogues for validation set is following Shen et al. (2021).
Synthetic dataset: We create a synthetic dataset by following the benchmark of the causal discovery field (Agrawal et al., 2021;Squires et al., 2022).To minimize sample bias, we did not randomly draw causal graphs as samples.Inversely, the number of samples in the synthetic dataset and the number of utterances and labels per sample are restricted to be consistent with RECCON.We use Causal Additive Models (CAMs), Specifically SCM structure for our datasets.As shown in Algorithm 7, first, we assume that each i.i.d.implicit causes E ∼ ∥ 50 N (1, 1) if it is an emotion utterance in the original dataset, and E ∼ ∥ 50 N (−1, 1) if it is not.Then, we update each utterance via speaker turns S: if there is an emotion-cause pair (U i , U j ) ∈ L, then U i = α j,i U j + E i (α j,i ∼ U nif rom([0.7,1])), and for those pairs without emotion-cause label, α j,i ∼ U nif rom([0, 0.3]).Finally, we randomly select a noise ξ ∼ U nif rom([−0.25,0.25]) for each utterance U i = U i + ξ i .

D Implementation Details
In the word embedding, we adopt the affect-based pre-trained features 1 proposed by Shen et al. (2021) for all baselines and models.
Although there are different pre-trained models in these skeleton baselines, the SOTA work DAG-ERC and EGAT have investigated their performances in a consistent pre-trained model.Therefore, for a fair and direct comparison, we continue this benchmark using the pre-trained embedding published by DAG-ERC for three tasks.
In the hyper-parameters, we follow the setting of Shen et al. (2021) in the ERC task.Moreover, in the ECPE and ECSR, the learning rate is set to 3e-5, batch size is set to 32, and epoch is set to 60.Further in our approach, we set L to 1, and implicit cause size is set to 192, hidden size of GNN is set to 300, and dropout rate is 0.3.Meanwhile, because there is only one training dataset for ECPE and ECSR, we evaluated our method ten times with different data splits by following Chen et al. ( 2023) and then performed paired sample t-test on the experimental results.
Finally, we adopted downstream task modules consistent with the SOTA baselines: Wei et al. (2020) in ECPE andECSR, andShen et al. (2021)for the ERC task.
For evaluation metrics, we follow

E Other Experiments in Affective Reasoning
In Table 7, our approach performs better than the corresponding baseline under all skeletons in four datasets.Hence, using a causal auto-encoder to find the implicit causes benefits this task.Besides, our approach improves significantly under skeletons II, III, and IV.From Figure 2, these three skeletons have more relevant nodes than others, so there are more redundant edges to be corrected   by our approach, which is demonstrated again in Appendix E. In contrast, V and VI achieve the best results in MELD, EmoryNLP, and IEMOCAP datasets, which indicates that Hypothesis 5 is more probably a strong inductive bias that conversation enjoys.
Then, we investigate how the number of layers and the variants of causal skeletons would affect the performance of our approach.So we further conducted several contrasts with k up to 5 and L up to 6, as shown in Figure 6.One observation is that the best performance occurs at either k = 1, 2, or 3, which indicates that k ⩾ 4 offers no advantage and even leads to confounding.Moreover, L = 1 achieves the best performance under all k values.In other words, one layer is sufficient to yield the most effective implicit causes.

F Visualization of Causal Graph
In the Figure 7 to 11, we showed the Visualization of the adjacency matrix (I − A T ) −1 .When the auxiliary loss Loss KL achieves the lower bound, (I − A T ) −1 represents the relationship matrix between utterances and implicit causes.
In the ECPE task, we extracted 10 samples from test sets in different folds.To facilitate comparison and contrasting, we selected five 7-utterances cases and five 8-utterances cases.The IDs are as  To obtain the non-negative value, we adopted the T = sigmoid(•) − 0.05 to process the original tensors (I − A T ) −1 outputted from the encoder.We follow a common practice: set the threshold as 0.05 to delete some unimportant edges.And to highlight which implicit cause contributes the each utterance best, we adopted the sof max(•) to process columns afterward and labeled the block with value > 0.
It is excepted that: (i) when skeletons construct overage edges, our model is able to degrade the influences of some negligible utterances by deleting the corresponding edges from their implicit causes.(ii) when skeletons construct insufficient edges, our model can add some edges to obtain more information.

G Proof of emotion consistency of implicit causes and utterances
We would like to explain why implicit causes and utterances are consistent in emotion from both theory and euqation, in the condition where emotional utterance and cause utterance possess different emotion types.
We define the implicit causes as the unobservable emotional desire and the utterances as the observable emotional expression.This definition is proposed in Ong et al. (2019Ong et al. ( , 2015)), which also argues that emotional expression is affected by desires and event outcomes.Moreover, for emotion utterances that are not influenced by explicit cause factors, the source of their emotions should originate from implicit causes.The desire and the expression generally belong to the same emotion because the outcomes often have little effect on emotional expression.Our paper can also deduce this conclusion from the SCM (Equation 1).Considering there is a linear map f (•) from representation space to emotion space.Then we can obtain the following: Note that W = (I − A) and A i,i = 0.So in W , the value of the elements on the diagonal is constant at 1 and is a constant maximum of each column.Naturally, f (E) is an approximate estimate of f (U ) especially U t and U i in the ECP (U t , U i ) do not belong to the same emotion category, which is why we think implicit causes are reasonable when the F1 score of Table 6 is high.Therefore, we test the F1 scores in ERC task by replacing H with E from a consensus that implicit causes should be aligned with utterances in the emotion types.
In Table 8, we reported the overall results of E in ERC task.Note that we only examine the sample of ECP with different emotion types.Among five skeletons and four datasets, almost all results achieve 90% scores of corresponding performances of H, which indicates that E is practically aligned with H in the affective dimension.

Figure 2 :
Figure2: Six cogn skeletons from a conversation case with 12 utterances.We adopted the 7-th utterance as the target utterance (Red).Orange nodes denote the utterances of the same speaker as the target utterance, and blue ones denote those belonging to other speakers.Arrow represents the information propagated from one utterance to another, and the bi-way arrow represents the influence-agnostic relationship.The black dash box represents the slide windows.

Figure 4 :
Figure 4: Visualization of E (upper subfigures) and implicit causes (lower subfigures) with colors in the simulated datasets.The gray cluster means padding utterances in each dialogue, the blue cluster corresponds to the non-emotion utterances, and the red cluster corresponds to emotion utterances.
Hypothesis 3. (Ghosal et al., 2019) Under the local graph order, target utterance receives local information from k surround utterances.Hypothesis 4. (Zhang et al., 2019) The influence between two utterances can be discriminated by whether the two utterances belong to the same speaker identity.Hypothesis 5. (Lian et al., 2021) Target utterance only receives information from the predecessor utterances.

Figure 5 :
Figure 5: Adjacency matrices towards 6 cogn skeletons when k = 2. (i, j) ̸ = N one represents that U i is influenced by U j .

Table 7 :
Overall performance in ERC task.† denotes the results implemented in this paper.The better scores in the same skeleton are in bold, and the best of all skeletons is in red.

Figure 6 :
Figure 6: Further layers L and related node number K with VI skeleton model in ECPE task.
our approach, H and H acts identically under the linear SCM model.Similarly, H should be aligned with H in emotion dimensions under the non-linear SCM model.In short, we adopt an auxiliary loss measuring the Kullback-Leibler (KL) divergence (Joyce, 2011) of H and H mapped into the exact emotion dimensions.Moreover, implicit causes E is one of the crucial influence factors on H, so that the loss aims to impose the constraint that H is the embedding of our need to ensure generating correct E.
t e∈Emot p e ( U t ) log p e ( U t ) p e (U t )

Table 2 :
The statistics of seven datasets4.1 Datasets, Implementation and BaselinesWe use six real datasets for three affective reasoning tasks and one synthetic dataset for justifying E in our model.The statistics of them are shown in Table2.Appendix C depicts the detailed introductions of each dataset.We adopt the consistent benchmarks of the SOTA methods in three tasks, including the pretraining language model, hyper-parameters, t-tests, and metrics.The implementation details are shown in Appendix D.

Table 4 ,
we summarized results under the following cases: replacing Loss KL with BCE loss function

Table 4 :
Shen et al. (2021)s shown in Table4, BCEloss performs similarly to Loss KL ; thus, we empirically demonstrate that our auxiliary loss is essentially different from Loss KL in VAE.The F1 score decreases heavily without auxiliary loss or decoder, these two are necessary ingredients for building complete processing to learn the causal representation via E. Besides, Hypotheses 4, 5, and 6 are all critical but removing Hypothesis 4 leads to the highlight degradation in 3 skeletons.This result corroborates the theory ofLian et al. (2021)andShen et al. (2021), who state that speaker identity is the strong inductive bias in conversation.Finally, it is expected to see that skeleton with Hypotheses 4, 5, and 6 should be the closest to perfection while the DAG-ERC+Ours indeed achieves the SOTA.
(BCE); removing the Loss KL (w/o Loss KL ); replacing Decoder module with a Linear layer (w/o Decoder); removing the RNN module (w/o Hypo 6); adding the edges from successors to predecessors (w/o Hypo 5); reducing the speaker types to one (w/o Hypo 4).

Table 8 :
Overall performance of implicit causes E in ERC task.