Bidirectional Hierarchical Attention Networks based on Document-level Context for Emotion Cause Extraction

Emotion cause extraction (ECE) aims to extract the causes behind the certain emotion in text. Some works related to the ECE task have been published and attracted lots of attention in recent years. However, these meth-ods neglect two major issues: 1) pay few at-tentions to the effect of document-level context information on ECE, and 2) lack of suf-ﬁcient exploration for how to effectively use the annotated emotion clause. For the ﬁrst issue, we propose a bidirectional hierarchical attention network (BHA) corresponding to the speciﬁed candidate cause clause to capture the document-level context in a structured and dynamic manner. For the second issue, we design an emotional ﬁltering module (EF) for each layer of the graph attention network, which calculates a gate score based on the emotion clause to ﬁlter the irrelevant information. Combining the BHA and EF, the EF-BHA can dynamically aggregate the contextual information from two directions and ﬁlters irrelevant information. The experimental results demonstrate that EF-BHA achieves the competitive performances on two public datasets in different languages (Chinese and English). More-over, we quantify the effect of context on emotion cause extraction and provide the visualization of the interactions between candidate cause clauses and contexts.


Introduction
In recent years, emotion cause extraction (ECE), which aims to identify the causes with respect to certain emotion in the text (Li et al., 2018b;, has obtained increasing attention in academics and industry. Mining the causes of certain emotion has a wide range of applications, such as public opinion monitoring and product review mining (Wang et al., 2016;Tang et al., 2016;Ma et al., 2017;; Phan * Corresponding author. and Ogunbona, 2020; . The goal of ECE is to find out the cause clause (e.g., c 3 ) that contains the emotion cause for the given emotion clause (e.g., c 4 ), as presented in Example 1. Example 1 (c 1 ) Wu was diagnosed with advanced liver cancer at the beginning of 2014 (c 2 ) since he began to update his health condition in Microblog (c 3 ) If Wu didn't update his microblog for a long time (c 4 ) people worried that he may have passed away We divide the existing works related to ECE into rule-based methods, traditional machine learning algorithms and deep learning methods. The rule-based methods are dependent on linguistic rules and common-sense knowledge, which requires plenty of manual operations Russo et al., 2011). The traditional machine learning algorithms generally rely on feature engineering to manually select the features as the inputs of model Gui et al., 2016). Recently, a number of works adopted deep neural networks like self-attention , coattention (Li et al., 2018b), hierarchical attention (Gui et al., 2017; to capture the relations among clauses. Some works utilized the multi-task learning Hu et al., 2020a;Chen et al., 2018) to extract emotion cause clauses.
For Example 1, if the contextual clauses (e.g., c 1 and c 2 ) are ignored, there may be no direct causal relationship between the emotion clause (e.g., c 4 ) and the cause clause (e.g., c 3 ), since not updating one's social media account will not cause worried. In fact, context has been utilized in many causal relation tasks to provide semantic information and improve the model performance (Kruengkrai et al., 2017;Li and Mao, 2019;Sridhar and Getoor, 2019;Kayesh et al., 2019). Chen et al. (2020) has mentioned that the causal relationship between the emotion and cause clauses may only be valid in a specific context. However, few works related to emo-tion cause extraction consider the document-level context information. To determine whether a clause is the cause of a certain emotion clause, it actually requires to understand the entire document. In this paper, we take each clause in the document as the candidate cause clause of the given emotion, and propose a bidirectional hierarchical attention networks (BHA) to capture the document-level context for a specific candidate cause clause in a structured and dynamic manner. The "bidirectional" denotes the document-level context is divided into forward and backward context based on the position of current candidate cause clause in document, and "hierarchical" denotes that we use hierarchical attention networks to selectively focus on the context information related to the current candidate cause clause at word and clause levels. In contrast to the hierarchical attention networks (Werlen et al., 2018) and self attention model (Vaswani et al., 2017), the BHA allows dynamically access to the context from two directions and distinguishes the effects of forward and backward context on candidate cause clause. Moreover, the previous works are generally limited to calculate the relative position of candidate cause clause to the emotion clause for leveraging emotion clause , ignoring the role of emotion clause in filtering the irrelevant information. So we design an emotional filtering module (EF), which computes a gate score for each layer of graph attention networks (GAT) based on the emotion clause to filter the irrelevant information. Combining the BHA and EF, we propose a bidirectional hierarchical attention networks (EF-BHA) with emotional filtering to appropriately encode contextual features into the clause representation for emotion cause extraction.
The main contributions of this paper are summarized as follows: 1. Different from the hierarchical attention networks and self-attention mechanism, the proposed bidirectional hierarchical attention (BHA) dynamically integrates the forward and backward contexts related to the specified candidate cause clause into the clause representation in a multi-granularity way.
2. We design an emotional filtering module (EF) for each layer of graph attention networks, which calculates a gate score based on the emotion clause to filter the irrelevant information.
3. Experimental results on two public datasets in different languages (Chinese and English) demonstrate that the effectiveness of EF-BHA and further provide the visualization of the interactions between candidate cause clauses and contexts.
2 Related Work  firstly gave the definition of emotion cause extraction (ECE), and manually constructed a corpus from the Academia Sinica Balanced Chinese Corpus. Based on this corpus, a multi-label approach was proposed with the basic of linguistic features as cues . Russo et al. (2011) proposed an approach that automatically identified linguistic contexts. However, taking the word as labeling granularity of ECE brings some drawbacks including incompleteness in meaning and analysis difficulties. To overcome these shortcomings, Gui et al. (2016) released a clause-level Chinese emotion cause corpus and proposed an event-driven multi-kernel SVM model for this corpus.
Recently, a number of works adopted deep learning networks to solve ECE task. Based on the original memory network (Sukhbaatar et al., 2015), Gui et al. (2017) proposed to store local context of each word in different memory slots to extract emotion cause clause. Chen et al. (2018) captured the interactions between emotion classification and cause detection in an end-to-end fashion. Li et al. (2018b) built the co-attention relationship between emotion clause and each candidate clause with emotional context.  viewed ECE as a reordered prediction problem and incorporated the dynamic global labels into the model.  firstly employed Transformer (Vaswani et al., 2017) to encode the global level information on all the clauses rather than relying solely on the hidden state of one clause. Hu et al. (2020b) proposed a graph convolutional structure with fusion of semantics and structural constricts (FSS-GCNs) to automatically learned how to selectively attend the relevant clauses useful for emotion cause extraction.
However, we notice that these methods related to ECE task ignore two major issues: 1) pay few attentions to the effects of document-level context on ECE task. 2) lack of sufficient exploration about how to effectively use the annotated emotion clause. In this paper, we try to incorporate the documentlevel context into the ECE task and use the emotion clause to filter the irrelevant contextual information simultaneously.

Task Definition
Given a document d = {c 1 , c 2 , ..., c n }, c i = {w i1 , w i2 , ...w im } contains m words, where w ij is the j-th word in the clause c i . Each document contains an emotion clause and one or more corresponding emotion cause clauses, and each clause is annotated with label ∈ {0, 1}, where label "1" denotes the clause is a cause clause. We formalize the ECE task as a binary classification problem, and our goal is to determine which clauses contain the emotion cause according to the given emotion clause.

Overall Architecture
The overall architecture of EF-BHA is shown in Figure 1. It contains BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al., 2019), word-level context attention module (WCA), GATs with emotional filtering (EF-GATs), clauselevel context attention module (CCA) and context aggregation module(CAM). BERT is used to encode hidden states of word and clause (see Section 3.3). The word-level context attention module is used to extract those words that are related to candidate cause clauses from forward and backward contexts and then integrate them to generate highlevel context information (see Section 3.4). The GATs with emotional filtering is the modified version of graph attention networks, which aims to filter the irrelevant information to the emotion in each layer of GATs and capture the inter-clause dependency simultaneously (see Section 3.5). Following the GATs, the clause-level context attention module is used to capture those clauses that related to the candidate cause clause and then summarize them into two vectors to represent the forward context and backward context, respectively (Section 3.6). After obtaining the context representation in both direction, we use the context aggregation module to encode contextual information into clause representation (Section 3.7).

BERT Encoder
Following the idea of the RANKCP (Wei et al., 2020), each clause in the document is processed into the sequence that takes [CLS] as the start token and [SEP ] as the end token, and we concate-nate these sequences as the input of BERT encoder, where [CLS] is a special token that aggregates the sequence features as the hidden state of clause and [SEP ] is a dummy token not used for this task. We take the hidden state h ij in the BERT model as the representation of w ij and the hidden state of [CLS] as the vector of the clause.

Word-level Context Attention
For the candidate cause clause c i , its documentlevel context can be divided into forward context c a i = {c 1 , c 2 , ..., c i−1 } and backward context The forward context and the backward context do not include the current candidate cause clause as the context aggregation module will take it as one of the features. The word-level context attention summarizes information for each clause of the bidirectional context, e.g., clauses c j ∈ c a i and c r ∈ c b i , which uses the M ulti_Attention proposed by Vaswani et al. (2017) to capture different types of relations between the words in {c j , c r } and c i . The key and value are both the word-level hidden states of BERT encoder and then aggregate those words to hidden states x j and x r , respectively: where q w i is the word-level query representation and f w is a linear transformation with the parameters randomly initialized. h jk and h rs denote the hidden states of w jk and w rs respectively. Especially, h icls is the hidden state of i-th token [CLS], which is the representation of c i encoded by the BERT encoder. Similarly, we can obtain the representation x e for emotion clause after the word-level context attention module.

Graph Attention Network with Emotional Filtering
We employ the graph attention networks (GATs) (Velickovic et al., 2018) to capture the inter-clause dependency, in which each clause can be viewed as a node in the graph and every two nodes have an edge to represent the relation between nodes.
To retain self-information, each node adds a selfloop edge. 1-layer GATs encodes only information about immediate neighbors, while GATs with stacking L graph attention layers aggregates the L-order neighbor nodes. h l i denotes the representation of clause c i after l-layer GATs: is the attention weight of clauses c j to c i with an MLP parameterized by {v l , W l a , W l b }, and f is the normalization function Sof tmax. N i denotes the neighbor clauses of c i , {W l , b l } are learnable parameters. The node c j 's initial hidden state h 1 j = x j , which is encoded by the word-level context attention.
To ensure that the information transmitted between nodes is related to the annotated emotion, we modify the original GAT by adding an emotional filtering module in each layer of GATs to filter the irrelevant information. g l−1 ∈ R d denotes the gate score of (l − 1)-th layer of GATs, where d is the dimension of node representation. We apply g l−1 over the hidden vector h l−1 j via the elementwise multiplication operation •. The aggregation of node representation h l i after emotional filtering is given by: where x e is the representation of emotion clause encoded by the word-level context attention module, σ(·) is the sigmoid function and W l−1 g ∈ R d×d is a learnable matrix.
Considering that each emotional filtering module may retain different aspects of node representation, we concatenate these node representations generated in the previous L layers of GATs as m

Clause-level Context Attention
Similar to the word-level context attention module, the clause-level context attention summarizes the forward context and backward context into d a i and d b i based on the interactions between the candidate cause clause c i and contexts, respectively: where f s is a linear transformation to obtain clauselevel query representation q s i . The key and value are both the expressions of clauses based on WCA. In particular, s j and s r denote the representations of c j ∈ c a i and c r ∈ c b i , respectively.

Context Aggregation Module
d a i and d b i represent the aggregation of forward context and backward context, respectively. In order to selectively incorporate context information into the representation of candidate cause clause c i , we use two scores less than 1, i.e. λ a and λ b , to control which information of forward context and backward context can flow to the final clause representation. Following this idea, we can have the following aggregation: With this aggregation module, the model can handle forward context and backward context to capture the specific contextual information required by the candidate cause clause. We take s g i as the final feature for emotion cause prediction: where f is the Sof tmax function, and {W p , b p } are the learnable parameters. Specifically, the model is trained by using standard gradient descent algorithm with the cross-entropy loss, which is given as: where N is the number of training instances and p i is the real distribution.

Dataset and Metrics
The proposed model is evaluated on two public datasets: a Chinese public benchmark dataset (Chi dataset) (Gui et al., 2016) and an English Dataset (Eng dataset) (Gao et al., 2017). Chi dataset is collected from SINA city news and Eng dataset is collected from an English novel. Note that each document of both datasets contains only one emotion clause and one or more emotion cause clauses corresponding to it. We adopt the same experimental setting as RTHN , that is, we use 10-fold cross validation to conduct experiments with 9 folds as training data and remaining 1 fold as testing data. Table 1 gives the details about the two datasets. We repeat the experiments 20 times to report the average result and perform one sample t-test on the experimental results. We adopt the precision (P), recall (R) and F1 score (F1) as the metrics for evaluation, which are defined as: where n cc , n pc and n gc denote the correctly predicted causes, predicted causes and the groundtruth causes respectively.

Implementation Details
For the Chi dataset, we implement the EF-BHA based on the BERT that is initialized using BERT-Base, Chinese 1 . We use AdamW optimizer (Loshchilov and Hutter, 2019) and 20 epochs with early stopping to optimize the model. We set the batch size and learning rate to 4 and 1e − 5 respectively, and apply a scheduler to adjust the learning rate, that is, the first 10% of all training steps is linear warmup phrase, and then linear decay phrase. Furthermore, we set the weight decay for BERT model and downstream model to 0.01 and 2e − 5, respectively. The proposed EF-BHA achieves the best performance when GATs adopts 2-layer emotional filtering, where the first layer has one attention head and the second layer has four attention heads. For Eng dataset, we implement the proposed EF-BHA based on the BERT that is initialized using BERT-Base, English 2 . The model using 2e − 5 learning rate with 15 epochs achieves the best performance on the Eng dataset when the GATs adopts single-layer emotional filtering.

Baseline Methods
We compare the proposed EF-BHA with the existing methods. We summarize the baseline methods  into the following groups: • Rule-based and commonsense-based methods: RB is a traditional rule-based method based on linguistic rules . CB is a commonsense-based method proposed by Russo et al. (2011).
• Machine learning methods: SVM uses common facts and unigrams, bigrams and trigrams as the features to train an SVM (Cortes and Vapnik, 1995) classifier .
Multi-kernel is a method using multi-kernel convolution to learn the relations between emotion cause clause and events (Gui et al., 2016). Word2vec uses the word representations pre-trained by Word2vec (Mikolov et al., 2013) to train SVM as classifier.
• Deep learning method: Memet uses the memory networks (Sukhbaatar et al., 2015) to capture the mutual impacts between emotion word and emotion causes (Gui et al., 2017). ConvMS-Memnet stores relevant contexts in different memory slots with convolution operation (Gui et al., 2017). CANN builds the co-attention interaction between emotion and each candidate clause (Li et al., 2018b). PAE-DGL take ECPE task as a reordered prediction problem  to extract emotion causes. RTHN uses Transformer (Vaswani et al., 2017)   among the clauses . MANN uses the multi-attention mechanism and CNN layer to extract critical features from the text (Li et al., 2019b). FSS-GCN learns how to selectively focus on the relevant clauses by fusing the semantics and structural information (Hu et al., 2020b). RHNN adopts the hierarchical attention and knowledge-based regularization to extract emotion cause (Fan et al., 2019).

Main Results
The experimental results on Chi and Eng datasets are shown in Table 2 and Table 3, respectively. We firstly focus on the Table 2. It can be observed that EF-BHA is better than most competitive baselines. RB and CB seem difficult to achieve a balance between precision and recall. For traditional machine learning, Word2vec and SVM cannot achieve better results in accuracy and recall. The recent works, such as CANN, RTHN and FSS-GCN, model the relations among clauses to incorporate more information, and obtain significant improvements. The F1 score of EF-BHA is at least 1.6% higher than those of these works, which is close to the state-ofthe-art method RHNN. This improvement is significant with p-value less than 0.01 in one sample t-test. The reason for the improvement is that EF-BHA dynamically incorporates the document-level context information according to the candidate cause clause for emotion cause extraction. Next, we focus on the Table 3. The performances of the existing methods on Eng datasets are generally low. We can find that the proposed EF-BHA achieves the best performance on Eng dataset, outperforming the state-of-the-art (RHNN) method by 1.62% in F1 measure. These results illustrate that the proposed EF-BHA can better encode clause representation by effectively attending to those words or  clauses related to current candidate cause clause in the bidirectional hierarchical texts, so as to obtain improvements.

Ablation Study
We conduct an ablation study by removing each module or the combination of module separately to verify the effect of each module in EF-BHA. Experimental results are shown in Table 4. Removing the WCA bring degradation on F1 scores by 1.67%. If we only remove the EF, that is, the original GATs is used to model the inter-clause dependency, the performance will be impaired (over 3% drops in F1 score), which illustrates that emotional filtering module can help encode the clause representation better. When EF and GATs are removed from the model, the F1 score drops to 0.7314, which may be caused by the insufficiency of modeling the relation of inter-clause. Additionally, we remove the combination of modules, such as CCA+CAM, and then the F1 score declines to 72.18%. We also eliminate WCA+CCA+CAM, which leads to a poor F1 score of 71.44%. These experimental results show that these modules and their mining features are significant in the emotion cause extraction.

Re-evaluating ECE Models
Ding and Kejriwal (2020) pointed out that some existing deep neural networks make use of the bias in the benchmark to achieve better performance, where the bias denotes the location imbalance distribution phenomenon of emotional cause location as most of the cause clauses appear near the emotion clause. To verify the ability of EF-BHA to understand the actual context, we conduct the experiments on "de-bias" dataset (Ding and Kejriwal, 2020) and the results are shown in  Table 5: Comparison of the proposed methods with existing results that are implemented in Ding and Kejriwal (2020) on 'de-bias' dataset. models. Different from RTHN, PAE-GDL, EF-BHA is position-insensitive model, which attends to understand actual context instead of depending on the dataset bias. Compared with the typical methods, the F1 score of EF-BHA is still higher than PAE, PAEDGL and RTHN although the F1 scores about this dataset is relatively low. We speculate that due to the small size of "de-bias" dataset, the problem of over fitting may occur in the parametric deep networks based on BERT model.

Performance against GAT layers
We vary the number of graph attention layer (ranging from 0 to 4) to verify its effect on EF-BHA and BHA, and the results are shown in Figure 2. When removing the graph attention layer, the performance of EF-BHA drops a lot (0.7314 in F1 score). In this case, the F1 scores of EF-BHA is same as that of BHA as the emotional filtering performs on each graph attention layer. When using 2-layer graph attention networks, the proposed EF-BHA achieves the best performance, and more layers result in the deterioration of model performance. This finding just confirms the conclusive results that stacking more layers in a graph neural network (GNN) could lead to over smoothing and finally the features of graph vertices converge to the same value (Li et al., 2018a(Li et al., , 2019a. c 1 to save the woman as soon as possible c 2 the commander worked out a rescue plan c 3 the first group laid life-saving air cushion c 4 and evacuated irrelevant people around c 5 another group climbed up to the sixth floor c 6 persuading women in the building c 7 in the process of persuasion c 8 fire officers and soldiers understand c 9 due to the other party's project arrears c 10 her family is badly in need of money c 11 she lives a stressful life c 12 she helplessly intends to jump off a building to commit suicide

Case Study
Translation: (c 1 ) to save the woman as soon as possible (c 2 ) the commander worked out a rescue plan (c 3 ) the first group laid life-saving air cushion (c 4 ) and evacuated irrelevant people around (c 5 ) another group climbed up to the sixth floor (c 6 ) persuading women in the building (c 7 ) in the process of persuasion (c 8 ) fire officers and soldiers understand (c 9 ) due to the other party's project arrears (c 10 ) her family is badly in need of money (c 11 ) she lives a stressful life (c 12 ) she helplessly intends to jump off a building to commit suicide To deeper understand the bidirectional context attention networks, we choose one document (Example 2) from the Chi dataset to visualize the attention weights extracted from the word-level attention module using sequence labeling toolkit (Yang and Zhang, 2018). For Example 2, c 12 is the emotion clause containing the emotion "helplessly", and {c 9 , c 10 , c 11 } are the corresponding cause clauses.
Here, the attention distribution refers to the importance of words to the current candidate cause clause. Note that the intensity of color is proportional to the weight value (that is, dark color means large weight). Table 6 and Table 7 show the visualizations about the attention weights of bidirectional context under the candidate cause clauses c 4 c 1 to save the woman as soon as possible c 2 the commander worked out a rescue plan c 3 the first group laid life-saving air cushion c 4 and evacuated irrelevant people around c 5 another group climbed up to the sixth floor c 6 persuading women in the building c 7 in the process of persuasion c 8 fire officers and soldiers understand c 9 due to the other party's project arrears c 10 her family is badly in need of money c 11 she lives a stressful life c 12 she helplessly intends to jump off a building to commit suicide Table 7: Visualization results of word-level context attention module for the candidate cause clause c 9 . and c 9 , respectively. In the comparison of Table  6 and Table 7, it can be obviously observed that the larger attention weights (the red color blocks) focus on the different parts of text. Specifically, when candidate cause clause is c 4 , the words with larger attention weights are mainly concentrated in the contents above c 4 , i.e., clauses {c 1 , c 2 , c 3 }, which indicates the words in these clauses is more related to c 4 . When candidate cause clause is c 9 , the words with larger attention weights are mainly concentrated in the contents below c 9 , i.e., clauses {c 10 , c 11 , c 12 }, which indicates the words below c 9 is more related to c 9 . Moreover, those words related to clause c 9 are usually negative, such as "stressful", "badly" and "suicide". This finding may reflect that the emotion triggered by c 9 is negative if c 9 is the cause clause, which corresponds to the emotion "helplessly" in c 12 . These results illustrate the proposed BHA can dynamically capture the contextual information from two directions to distinguish the effects of bidirectional context on the current candidate cause clause. Through the visualization of word-level attention weights, we quantify the effect of the interaction between the clause and the contexts at the word level.

Conclusions and Future Work
In this work, we propose the EF-BHA to model the relations between candidate cause clauses and the contexts. Especially, EF-BHA extracts the relevant contextual information according to the candidate cause clause, and then summarize contextual information with different granularity into clause representation. Moreover, we propose to add an emotional filtering module for each layer of GATs to filter the irrelevant information. The experimental results on two public datasets demonstrate that the proposed EF-BHA achieves competitive performance in comparison with the existing methods, thereby validating the effectiveness. We further visualize the attention weight extracted by the bidirectional hierarchical context attention module, aiming to provide the visualization of the interactions between candidate cause clauses and contexts. EF-BHA may introduce some irrelevant information when integrating context information. In future work, we will only use the previous |w| clauses and the following |w| clauses of the candidate cause clause as the context to alleviate this problem.