Multi-Granularity Semantic Aware Graph Model for Reducing Position Bias in Emotion-Cause Pair Extraction

The Emotion-Cause Pair Extraction (ECPE) task aims to extract emotions and causes as pairs from documents. We observe that the relative distance distribution of emotions and causes is extremely imbalanced in the typical ECPE dataset. Existing methods have set a fixed size window to capture relations between neighboring clauses. However, they neglect the effective semantic connections between distant clauses, leading to poor generalization ability towards position-insensitive data. To alleviate the problem, we propose a novel Multi-Granularity Semantic Aware Graph model (MGSAG) to incorporate fine-grained and coarse-grained semantic features jointly, without regard to distance limitation. In particular, we first explore semantic dependencies between clauses and keywords extracted from the document that convey fine-grained semantic features, obtaining keywords enhanced clause representations. Besides, a clause graph is also established to model coarse-grained semantic relations between clauses. Experimental results indicate that MGSAG surpasses the existing state-of-the-art ECPE models. Especially, MGSAG outperforms other models significantly in the condition of position-insensitive data.


Introduction
Emotion Cause Analysis (ECA) has attracted increasing research interest in recent years (Wei et al., 2020;Sun et al., 2021;Singh et al., 2021;Yu et al., 2021), because of the great potential of applying in consumer review mining, public opinion monitoring, and online empathetic chatbot building. Its goal is to detect causes or stimuli for a certain emotion expressed in text.
Emotion Cause Pair Extraction (ECPE) (Xia and Ding, 2019) is a new task related to ECA, which is concerned with causal relationships be- * Corresponding author. Figure 1: The distribution of the relative distance of an emotion clause and a cause clause that comprise a pair in the ECPE dataset (Xia and Ding, 2019). Dist0, Dist1, and Dist2 mean the relative distances between the two clauses are 0, 1, and 2 respectively. Dist > 2 means the relative distances are larger than 2. tween emotions and causes. It's a much more challenging task. Because we need a comprehensive understanding of document content and structure to perform emotion-cause co-extraction and discriminate emotion-cause clause pairs from negative ones (Wei et al., 2020). As shown in the following example, an emotion clause c 7 and a cause clause c 2 construct an emotion-cause pair (c 7 , c 2 ) which is needed to be extracted by an ECPE model.

Example.
When the driver was about to start the bus to leave the station (c 1 ), an old lady ran to the front of the bus with a fast speed and sat down on the ground (c 2 ). Passengers standing in the front of the bus can see this scene clearly (c 3 ). Seeing this scene (c 4 ), the passengers in the car immediately became restless (c 5 ), and had a heated debate (c 6 ). Some of the passengers were angry (c 7 ), and told the driver he shouldn't be meddlesome (c 8 ).
In general, the number of candidate emotioncause pairs is the square of the number of clauses in a document. However, most documents contain only one emotion-cause pair. Due to the problem of the tremendous search space, most existing methods have fully exploited relative position fea-tures to decrease the number of candidate pairs. For instance, ECPE-MLL (Ding et al., 2020b) and SLSN (Cheng et al., 2020) set a fixed size window around a certain clause, and the central clause and other clauses inside the window comprise candidate pairs. However, models heavily relying on the relative position features ignore the distant semantic cues, resulting in poor generalization ability towards position-insensitive data in which the cause clause is not in proximity to the emotion clause.
According to Figure 1, we can observe that there is a position bias problem in ECPE. For the most 85% emotion-cause pairs, the relative distances between its emotion clauses and corresponding cause clauses are less than 2. It means that most cause clauses either appear immediately preceding/following their corresponding emotion clauses or are the emotion clauses themselves. Existing methods mainly focus on the position-sensitive data (majority) and neglect the position-insensitive data (minority). How to improve the performance on the two parts of data instead of only focusing on one of them, has become an intractable challenge.
Some proposed methods (Xia and Ding, 2019;Chen et al., 2020a) without relative position information seem to be position-insensitive, but overlook the effective semantic connections between distant clauses which convey causal cues. Thus, they can not alleviate the position bias problem.
To alleviate this problem, we propose a multigranularity semantic aware graph model (MGSAG). We assume that fine-grained semantic features conveyed by global keywords in a document are conducive to exploring causal cues, especially cues implied in distant clauses. Besides, coarse-grained semantics between clauses is also important to find causal relations implied in the context. From the two perspectives, we realize multi-granularity semantic enhanced clause relationships modeling based on two graphs: clause-keyword bipartite graph and fully connected clause graph, utilize fine-grained and coarse-grained semantic features jointly. Experimental results show that MGSAG outperforms all of the state-of-the-art baselines. Especially, it achieves a significant improvement on position-insensitive test data. In summary, our contributions are three-fold: • To alleviate the position bias problem in ECPE, we propose MGSAG to achieve fine-grained and coarse-grained semantic enhanced clause representation learning.
• To value model performance on emotioncause clause pairs consisting of distant clauses, we split the original test set into two parts according to the relative distances of emotion clauses and cause clauses, and evaluate models on them.
• Experimental results prove that our model achieves remarkable improvement over bestperforming approaches on the original test set. Especially, it outperforms other methods in the condition of position-insensitive data.

Related Work
According to whether the relative position information is used explicitly or not, existing ECPE works can be divided into two categories: positionsensitive approaches and position-insensitive approaches.
Most methods (Ding et al., 2020a;Cheng et al., 2020;Ding et al., 2020b) have set a fixed size window to reduce the number of candidate pairs according to the inherent position bias in the dataset, because of the sparsity of true emotion-cause pairs compared with candidate emotion-cause pairs. Besides, Chen et al. (2020b) leveraged the relative position information explicitly in the process of pair representation learning. The ECPE-MLL model proposed by Ding et al. (2020b) is the state-of-the-art method in the ECPE task. An over-reliance on relative position information makes these methods have poor generalization ability towards position-insensitive data. Position-Insensitive Approaches. Some sequence-based methods without relative position information (Xia and Ding, 2019;Chen et al., 2020a;Fan et al., 2020) seem to be positioninsensitive. Xia and Ding (2019) proposed a RNN-based framework and generate candidate pairs by applying the Cartesian product. Chen et al. (2020a) reformulated the ECPE task as a unified sequence labeling problem. Fan et al. (2020) modeled the extraction of emotion-cause pairs as performing a sequence of transitions and actions. However, these methods have shown poor performance on position-insensitive data due to the neglect of effective semantic connections between distant clauses. Different from the above methods, our model incorporates fine-grained and coarse-grained semantic features jointly, which can alleviate the position bias problem well.

Problem Formulation
Given a document D = {c 1 , c 2 , ..., c |D| } where |D| is the number of clauses, the clauses are formed into |D| × |D| candidate emotion-cause pairs using Cartesian product: is clause c i serving as a candidate emotion clause, c c j is clause c j serving as a candidate cause clause. The ECPE task is to assign a binary label to each candidate pair (c e i , c c j ), where "1" means that clause c i is an emotion clause and clause c j provides the cause of it, otherwise "0".

Methodology
We propose a multi-granularity semantic aware graph model to alleviate the position bias problem in ECPE. More concretely, we obtain fine-grained semantic aware clause representations based on a clause-keyword bipartite graph. Simultaneously, coarse-grained semantic aware clause representations are generated based on a fully connected clause graph. As shown in Figure 2, the model consists of four components: 1) document encoding, 2) fine-grained semantic aware graph (FGSAG), 3) coarse-grained semantic aware graph (CGSAG), 4) pair classification.

Document Encoding
Given a document D = {c 1 , c 2 , ..., c |D| } consisted of |D| clauses, we adopt a hierarchical recurrent neural network to encode context information and generate emotion-specific and cause-specific clause representations for each clause in the document.
Word-Level Encoder. For each clause c i = {w i 1 , w i 2 , ..., w i |c i | }, we first adopt a word-level BiL-STM network to encode the context by passing words' information along the clauses forwards and backwards, and then obtain the clause's hid- An attention layer is adopted to combine them and return a state vector h i = is the attention weight of the j-th word in clause c i , W a is a trainable weight matrix for attention score calculation. Clause-Level Encoder. In order to extract the emotion features and the cause features respectively, the clause-level encoder consists of two BiLSTM networks. The document D's clause state sequence (h 1 , h 2 , ..., h |D| ) is fed into two clause-level BiLSTM networks to produce emotionspecific and cause-specific clause representations, respectively: where BiLSTM e and BiLSTM c generate the emotion-specific and cause-specific clause representation u e i , u c i ∈ R 2d h ×1 of clause c i , respectively. d h means the number of hidden units in BiLSTM.
Afterwards, we use a gate mechanism to fuse the emotion feature u e i and the cause feature u c i to obtain clause representation v i ∈ R 2d h ×1 : Figure 3: The influence of the two types of keywords from an intuitive aspect. It shows the proportion of emotion clauses, cause clauses, emotion-cause pairs, and clauses that are covered by the extracted key phrases or emotion words or both of them. "w/ EW", "w/ TW", and "w/ CW" means using emotion words, key phrases obtained by TextRank or both of them, respectively.
where W g ∈ R 1×2d h and b g are parameters; σ is the sigmoid function.
In the training process, we leverage the emotion labels and cause labels as auxiliary supervision signals to facilitate the clause representation learning in the clause-level encoder: where W e , W c ∈ R 1×2d h are trainable parameters and b e , b c are bias terms.

Fine-Grained Semantic Aware Graph
To obtain fine-grained semantic enhanced clause representations, we leverage external knowledge to extract keywords in the document first. Then, we build a clause-keyword bipartite graph to model the relations between clauses. In this way, the keywords which convey fine-grained semantic features can help highlight the potential causal features contained in the clause representations. Keywords Acquisition. We use the TextRank algorithm (Mihalcea and Tarau, 2004) to extract key phrases and a sentiment lexicon (Xu et al., 2008) 1 to obtain emotion words in a document. We take the union of the two sets as the final keyword set.
To measure the influence of the two types of keywords from an intuitive view, we count the proportions of emotion clauses, cause clauses, emotioncause pairs, and clauses that are covered by the emotion words or key phrases or both of them. Noted that if emotion clause and cause clause that 1 We download the sentiment lexicon from this link: http s://github.com/ZaneMuir/DLUT-Emotiononto logy.
comprise a pair both contain any keyword, we think that the pair is covered by the keywords.
From Figure 3 we observe that if we use the key phrases extracted by TextRank alone, only about 69% of emotion clauses can be found; if we use the emotion words alone, only about 54% of cause clauses can be identified. With the use of emotion words or key phrases, only about 50% or 63% of emotion-cause pairs can be figured out. Consequently, we take the union of the two sets as the final keyword set. However, given the complete keyword set, clauses that contain keywords account for a large proportion (79%), which means that the imported keywords may introduce noise as well. To this end, it's necessary to measure the importance of different keywords when modeling the interaction between clauses and keywords. Clause-Keyword Bipartite Graph Construction. Given a document D, we denote the clausekeyword bipartite graph as represents a node set composing of clause nodes and keyword nodes and E b denotes edges between nodes. V k = {k 1 , k 2 , ..., k m } and V c = {c 1 , c 2 , ..., c |D| } mean there are m keywords and |D| clauses in the document D. We establish edges between each node in V c and each node in V k , which means every element e ij in E b ∈ R |D|×m is 1. It is because the average length of clauses is too short, many keywords only appear once in one clause. Thus, an adjacency matrix based on keyword-clause co-occurrence is extremely sparse.
For keywords in V k , their feature vectors are initialized by the word embedding vectors released by Xia and Ding (2019). As for clause nodes c i ∈ V c , they are initialized with the corresponding context-aware clause representation v i generated from the clause-level encoder. We denote the feature matrices of keyword and clause nodes as is the dimension of the word embedding and is equal to 2d h in our setting. Attention Guided Clause Representations Update. We propose a graph attention module to model the semantic interaction between clauses and keywords, aiming to utilize the fine-grained semantic features implied in keywords to facilitate clause representation learning.
Intuitively, the clause-keyword bipartite graph realizes fine-grained semantic connections between distant clauses, which is helpful to extract emotion-cause pairs composed of distant clauses. Nevertheless, for a specific clause, the importance of various keywords is different. Therefore, we use the graph attention mechanism (Velickovic et al., 2018) to measure the document-level keyword preferencedegree of each clause, where the attention weight is computed as the edge weight between the clause node c i and the keyword node k j in a document: where v i and k j are features of clause c i and keyword k j respectively; [·; ·] is the concatenation operation; W 1 , W 2 ∈ R dw×dw and w ∈ R 2dw×1 are trainable parameters. Then, clause c i is encoded as the fine-grained semantic enhanced representation v b i as follows: where |D| t=1 α tj W 3 v t means the representation of the keyword k j , and m j=1 (α ij ( |D| t=1 α tj W 3 v t )) is the weighted added of keyword representations for generating fine-grained semantic enhanced clause representation. W 3 ∈ R dw×dw is a trainable parameter and b is a bias term.

Coarse-Grained Semantic Aware Graph
Coarse-grained semantic relationships between clauses are useful for finding causal cues implied in the context. We establish a fully connected clause graph and leverage graph attention mechanism to model the coarse-grained semantic relationships between clauses.
Given a document D, we define the clause graph as G c = (V c , E c ), where V c represents a node set and E c denotes an edge set. Each node in the fully connected graph is a clause in D, and every two nodes have an edge. Self-loop edge is added to every node because a clause can be an emotion clause and a cause clause simultaneously. We use clause representation v i generated from the clauselevel encoder for node feature initialization. Based on the self-attention mechanism (Vaswani et al., 2017) which aggregated neighboring clauses' information, the graph attention network propagates information among clauses by stacking multiple graph attention layers. The representation of clause c i in the t-th layer is updated as follows: where W (t) 1 ∈ R dw×dw is a transform matrix and b (t) is a bias term; N (i) represents the neighbouring clauses of c i ; v ij is learned as follows: We stack two graph attention layers and obtain v c i = v i as the updated representation for c i .

Pair Classification
We concatenate the two types of clause representations and obtainv as the final representation of clause c i . Emotion Cause Pair Extraction. For a candidate pair (c e i , c c j ) ∈ P , we pass its representation v p ij = [v i ;v j ] to a fully-connected layer with softmax activation function to predict the label of it: where W p ∈ R 4dw×2 and b p ∈ R 2×1 are trainable parameters. We obtain the predicted labelÊC ij for the candidate pair (c e i , c c j ) according to the probability distributionp ij .
During model training, we use two cross-entropy loss functions L emo and L cau to supervise the clause representation learning in the clause-level encoder and a cross-entropy loss function L pair to supervise the final emotion-cause pair prediction. The loss function L is formulated as follows: Emotion Extraction and Cause Extraction. Following Chen et al. (2020b), we implement emotion extraction and cause extraction based on the predictions of all candidate pairs. For emotion extraction, the predicted labelÊ i for clause c i can be obtained as follows: For cause extraction, the predicted labelĈ i for clause c i can be obtained similarly.
We conduct a series of experiments to verify the effectiveness of MGSAG.

Dataset and Evaluation Metrics
We use the benchmark dataset released by Xia and Ding (2019) for experiments. This typical and widely used dataset is constructed based on an emotion cause extraction corpus (Gui et al., 2016) that contains 1,945 Chinese documents from SINA city news 2 . To obtain statistically credible results, we adopt the same data split setting (10-fold crossvalidation) used by Xia and Ding (2019), repeat the experiments 10 times, and report the average results of precision (P), recall (R), and F 1 -score (F 1 ) on the main task: emotion-cause pair extraction (ECPE), and two sub-tasks: emotion extraction (EE) and cause extraction (CE), following existing works (Xia and Ding, 2019;Ding et al., 2020b,a;Chen et al., 2020a,b;Cheng et al., 2020).

Redistricting of Original Test Set
As ECPE is a newly proposed task, there is only one typical and widely used dataset. Because of the inherent position bias in ECPE, how to improve the performance on both position-sensitive (majority) and position-insensitive data (minority), has become one of the challenges. Therefore, it is essential to measure the reliance of existing methods on the relative position information.
To this end, we split the original test set (T est all ) of each fold into two parts according to the relative distance between emotions and causes. The first part (T est Bias ) contains documents with only one pair and the relative distance between the two clauses is less than 2. The second part (T est N oBias ) is the complement of the first part, which means T est all = T est Bias ∪ T est N oBias and T est Bias ∩ T est N oBias = ∅. We conduct experiments on the original test set first, and then use T est Bias and T est N oBias to evaluate various methods respectively. To ensure fairness, we use the same model parameters which produce results on T est all to obtain the results on the two subsets: T est Bias and T est N oBias .

Comparative Approaches
We compare MGSAG with the following methods, which can be divided into two types: position-2 http://news.sina.com.cn/society/ insensitive and position-sensitive methods. Position-insensitive Methods. Following methods haven't utilized the relative position information explicitly. Indep / Inter-CE / Inter-EC (Xia and Ding, 2019): these two-step approaches first extracted emotions and causes separately to form candidate emotion-cause pairs and then trained a classifier to recognize true pairs. IE-CNN (Chen et al., 2020a) reformulated the ECPE task as a sequence labeling task and extracted pairs in an endto-end fashion. Position-sensitive Methods. Following methods take relative position information as a crucial feature to recognize pairs. PairGCN (Chen et al., 2020b) is a method highly dependent on position information when modeling relations between pairs. ECPE-2D (Ding et al., 2020a) extracted pairs through 2D representation, interaction, and prediction. The window-constrained 2D Transformer achieved the best performance. SLSN-U (Cheng et al., 2020) extracted pairs through a process of local search which was defined by the setting of the local context window. RankCP (Wei et al., 2020) utilized kernel-based relative position embedding to enhance the clause representations obtained from inter-clause modeling module. ECPE-MLL (Ding et al., 2020b) used a multi-label learning method inside each sliding window which was defined manually.

Implementation Details
To conduct a fair comparison with the baselines, we utilize the same word embeddings followed Xia and Ding (2019). The dimension of word embedding is 200. The numbers of hidden units of BiL-STM in the word-level and clause-level encoder are set to 200 and 100, respectively. We stack two graph attention layers to build a graph attention network and add dropout (Srivastava et al., 2014) with the rate of 0.1 for each layer to reduce overfitting. During the training process, we use the Adam (Kingma and Ba, 2015) optimizer to update all parameters. We report the results of BERT (Devlin et al., 2019) in the appendix. Table 1 reports the comparative results on emotion cause pair extraction and two sub-tasks. We can observe that position-sensitive models perform better than position-insensitive models on average, indicating the effectiveness of using relative position    information. However, our method MGSAG hasn't utilized relative position information, aiming to alleviate the position bias problem in ECPE. In spite of this, MGSAG still outperforms the existing stateof-the-art methods. Especially, MGSAG achieves the best F 1 on the main task: emotion-cause pair extraction. The F 1 score of MGSAG on ECPE is 1.06% higher than that of ECPE-MLL, which indicates the efficiency of capturing multi-granularity semantic relations between clauses. For the two sub-tasks, MGSAG outperforms other baselines in terms of cause extraction compared with emotion extraction. This indicates that the effective clause representation learning based on MGSAG is beneficial to extract cause clauses and further facilitate the extraction of emotioncause pairs.

Results on T est Bias and T est N oBias
To evaluate if MGSAG is vulnerable when the causes are not in proximity to the emotion, we evaluate it on the two subsets as shown in 5.1.2. Table 2 shows the results on T est Bias and T est N oBias . Noted that when we get the best results on the original test set as shown in Table 1, we use the same parameters to evaluate models on the two subsets (T est Bias and T est N oBias ).
From Table 2 we observe that there is a significant gap (34∼41%) between the results on T est Bias and T est N oBias , for all of the methods. One of the reasons should be the imbalanced data of T est Bias and T est N oBias , which means the proportion of position-insensitive data is very small. More importantly, most of the methods exploit the relative position information explicitly or implicitly, leading to poor performance on T est N oBias .
However, MGSAG outperforms existing stateof-the-art baselines on both of the two subsets (T est Bias and T est N oBias ), proving its generalization ability towards position-sensitive and positioninsensitive data. Specially, the F 1 score of MGSAG on T est N oBias is 3.13% higher than that of ECPE-MLL. The results verify the effectiveness of capturing causal relations between clauses via multigranularity semantics encoding.

Discussions
We conduct ablation studies to analyze the effects of different components and settings in our method MGSAG.

Influence of Different Components
As shown in Table 3, we remove FGSAG, CGSAG, and both of them respectively to verify the effectiveness of the proposed two graphs with the semantics of different granularity. Effect of Fine-Grained Semantic Aware Graph. We remove the FGSAG to verify the effect of fine-grained semantic enhanced relations. Table 3 shows that removing FGSAG results in significant performance degradation, indicating that it is indeed useful for pair prediction. Especially, the result of F 1 on T est N oBias decreases 4.07% without the FGSAG, proving its efficiency of alleviating position bias. Effect of Coarse-Grained Semantic Aware Graph. We remove CGSAG which is used for coarse-grained semantic enhanced relations to verify its effect. Table 3 reports that model without CGSAG results in a clear drop (2.74%/3.17%) on T est N oBias and T est all , but a limited drop (0.76%) on T est Bias . It shows that modeling the coarse-grained semantic relations between clauses can alleviate position bias as well. Effect of Semantic Aware Graph Model. We further evaluate the effect of dual graph-based modules by removing FGSAG and CGSAG simultaneously. As shown in Table 3, the model without the two graphs performs worse than without any one of them. The significant performance decline of the F 1 score on all of the test sets verifies that the fine-grained semantics and coarse-grained semantics are complementary to each other. Thus, it's necessary to take both of them into account.

Influence of Two-Level Supervision
We use the two-level supervised signals to train MGSAG. A low-level signal L emo + L cau supervises the clause representation learning at the clause-level encoder and a high-level signal L pair supervises the pair representation learning at the   Table 5: Comparative F 1 results on T est Bias , T est N oBias , and T est all of our variant models, focusing on EC Pair Ext. "w/ RW" means using random embeddings for keyword feature initialization. "w/o EW" and "w/o TW" means removing emotion words and key phrases obtained by TextRank, respectively. classification stage. To evaluate the effectiveness of low-level supervision, we only use L pair to train the model, and the results are shown in Table 4. It shows that training with low-level supervision brings an improvement mainly on precision, which indicates that the low-level supervision is helpful to learn more accurate emotion-specific and causespecific features and eventually facilitates the performance on emotion-cause pair extraction.

Influence of Different Keyword Settings
As shown in Table 5, we use different keyword settings to verify the effectiveness of our proposed keywords, which is the union of emotion words obtained from a sentiment lexicon (Xu et al., 2008) and key phrases obtained by TextRank (Mihalcea and Tarau, 2004). Removing any one of them results in a performance decline on all of the test sets. It proves that it's necessary to take both of them into account. Moreover, we replace the keyword features with randomly initialized embeddings, showing a significant drop on T est N oBias . It indicates that the fine-grained semantics implied in keywords does help to alleviate the position bias problem.

Case Study
As shown in Figure 4, the distance between the emotion clause c 11 and the cause clause c 4 is 7.
Although the cause clause c 4 doesn't contain any keywords, global keywords in the document convey crucial fine-grained semantics, helping MGSAG extracts (c 11 , c 4 ) correctly.

Conclusion and Future Work
In this paper, we propose MGSAG to alleviate the position bias problem in the ECPE task. Our approach implements clause representation learning via fine-grained semantics introduced by keywords and coarse-grained semantics among clauses. Experimental results show that MGSAG surpasses the state-of-the-art baselines, and outperforms other methods significantly on the position-insensitive data. In the future, we would like to tackle the problem of imbalanced data by reducing non-emotioncause pairs, based on a position-insensitive approach.  We implement MGSAG with the pre-trained BERT (Devlin et al., 2019) to explore the effect  of pre-trained language model, where we use the base Chinese model 3 . We replace the word-level encoder with the [CLS] embeddings of a clause which is obtained by BERT. Results on T est Bias and T est N oBias with and without BERT are shown in Table 6. Results on the original test set with and without BERT are shown in Table 7. During the training process, we use the Adam (Kingma and Ba, 2015) optimizer to update all parameters. The mini-batch size with BERT is set to 2. The learning rate with BERT is set to 1e-5.

A Experimental Results with BERT
As shown in Table 7, methods with BERT per-form better than those without BERT on the original test set, which shows the effectiveness of utilizing the pre-trained BERT. As shown in Table 6, results of models with BERT on T est Bias and T est N oBias indicate that using BERT as the encoder cannot make up for the deficiency caused by position bias. MGSAG still outperforms other methods on T est all and T est N oBias . The results verify the effectiveness of capturing the causal semantic relations between clauses via fine-grained and coarse-grained semantics encoding.