Position Bias Mitigation: A Knowledge-Aware Graph Model for Emotion Cause Extraction

The Emotion Cause Extraction (ECE) task aims to identify clauses which contain emotion-evoking information for a particular emotion expressed in text. We observe that a widely-used ECE dataset exhibits a bias that the majority of annotated cause clauses are either directly before their associated emotion clauses or are the emotion clauses themselves. Existing models for ECE tend to explore such relative position information and suffer from the dataset bias. To investigate the degree of reliance of existing ECE models on clause relative positions, we propose a novel strategy to generate adversarial examples in which the relative position information is no longer the indicative feature of cause clauses. We test the performance of existing models on such adversarial examples and observe a significant performance drop. To address the dataset bias, we propose a novel graph-based method to explicitly model the emotion triggering paths by leveraging the commonsense knowledge to enhance the semantic dependencies between a candidate clause and an emotion clause. Experimental results show that our proposed approach performs on par with the existing state-of-the-art methods on the original ECE dataset, and is more robust against adversarial attacks compared to existing models.


Introduction
Instead of detecting sentiment polarity from text, recent years have seen a surge of research activities that identify the cause of emotions expressed in text (Gui et al., 2017;Cheng et al., 2017a;Rashkin et al., 2018;Xia and Ding, 2019;Kim and Klinger, 2018;Oberländer and Klinger, 2020).In a typical dataset for Emotion Cause Extract (ECE) (Gui et al., 2017), a document consists of multiple clauses, one of which is the emotion clause annotated with a pre-defined emotion class label.In addition, one or more clauses are annotated as the cause clause(s) which expresses triggering factors leading to the emotion expressed in the emotion clause.An emotion extraction model trained on the dataset is expected to classify a given clause as a cause clause or not, given the emotion clause.(Gui et al., 2016).Nearly 87% of cause clauses are located near the emotion clause (About 55% are immediately preceding the emotion clause, 24% are the emotion clauses themselves and over 7% are immediately after the emotion clause).
However, due to the difficulty in data collection, the ECE datasets were typically constructed by using emotion words as queries to retrieve relevant contexts as candidates for emotion cause annotation, which might lead to a strong positional bias (Ding and Kejriwal, 2020).Figure 1 depicts the distribution of positions of cause clauses relative to the emotion clause in the ECE dataset (Gui et al., 2016).Most cause clauses are either immediately preceding their corresponding emotion clauses or are the emotion clauses themselves.Existing ECE models tend to exploit such relative position information and have achieved good results on emotion cause detection.For example, The Rel-ative Position Augmented with Dynamic Global Labels (PAE-DGL) (Ding et al., 2019), RNN-Transformer Hierarchical Network (RTHN) (Xia et al., 2019) and Multi-Attention-based Neural Network (MANN) (Li et al., 2019) all concatenate the relative position embeddings with clause semantic embeddings as the clause representations.
We argue that models utilising clause relative positions would inherently suffer from the dataset bias, and therefore may not generalise well to unseen data when the cause clause is not in proximity to the emotion clause.For example, in a recently released emotion cause dataset, only 25-27% cause clauses are located immediately before the emotion clause (Poria et al., 2020).To investigate the degree of reliance of existing ECE models on clause relative positions, we propose a novel strategy to generate adversarial examples in which the relative position information is no longer the indicative feature of cause clauses.We test the performance of existing models on such adversarial examples and observe a significant performance drop.
To alleviate the position bias problem, we propose to leverage the commonsense knowledge to enhance the semantic dependencies between a candidate clause and the emotion clause.More concretely, we build a clause graph, whose node features are initialised by the clause representations, and has two types of edges i.e., Sequence-Edge (S-Edge) and Knowledge-Edge (K-Edge).A S-Edge links two consecutive clauses to capture the clause neighbourhood information, while a K-Edge links a candidate clause with the emotion clause if there exists a knowledge path extracted from the Con-ceptNet (Speer et al., 2017) between them.We extend Relation-GCNs (Schlichtkrull et al., 2018) to update the graph nodes by gathering information encoded in the two types of edges.Finally, the cause clause is detected by performing node (i.e., clause) classification on the clause graph.In summary, our contributions are three-fold:  et al., 2010) or incorporated commonsense knowledge bases (Gao et al., 2015) for emotion cause extraction.Machine learning methods leveraged text features (Gui et al., 2017) and combined them with multi-kernel Support Vector Machine (SVM) (Xu et al., 2017).More recent works developed neural architectures to generate effective semantic features.Cheng et al. (2017b) (Ding et al., 2019;Xia et al., 2019;Li et al., 2019).The Relative Position Augmented with Dynamic Global Labels (PAE-DGL) (Ding et al., 2019) reordered clauses based on their distances from the target emotion clause, and propagated the information of surrounding clauses to the others.Xu et al. (2019) used emotion dependent and independent features to rank clauses and identify the cause.The RNN-Transformer Hierarchical Network (RTHN) (Xia et al., 2019)  We propose a Knowledge-Aware Graph (KAG) model as shown in Figure 2, which incorporates knowledge paths extracted from ConceptNet for emotion cause extraction.More concretely, for each document, a graph is first constructed by representing each clause in the document as a node.The edge linking two nodes captures the sequential relation between neighbouring clauses (called the Sequence Edge or S-Edge).In addition, to bet-ter capture the semantic relation between a candidate clause and the emotion clause, we identify keywords in the candidate clause which can reach the annotated emotion class label by following the knowledge paths in the ConceptNet.The extracted knowledge paths from ConceptNet are used to enrich the relationship between the candidate clause and the emotion clause and are inserted into the clause graph as the Knowledge Edge or K-Edge.We argue that by adding the K-Edges, we can better model the semantic relations between a candidate clause and the emotion clause, regardless of their relative positional distance.
In what follows, we will first describe how to extract knowledge paths from ConceptNet, then present the incorporation of the knowledge paths into context modelling, and finally discuss the use of Graphical Convolutional Network (GCN) for learning node (or clause) representations and the prediction of the cause clause based on the learned node representations.

Knowledge Path Extraction from ConceptNet
ConceptNet is a commonsense knowledge graph, which represents entities as nodes and relationship between them as edges.To explore the causal re- lation between a candidate clause and the emotion clause, we propose to extract cause-related paths linking a word in the candidate clause with the annotated emotion word or the emotion class label, E w , in the emotion clause.More concretely, for a candidate clause, we first perform word segmentation using the Chinese segmentation tool, Jieba2 , and then extract the top three keywords ranked by Text-Rank3 .Based on the findings in (Fan et al., 2019) that sentiment descriptions can be relevant to the emotion cause, we also include adjectives in the keywords set.
We regard each keyword in a candidate clause as a head entity, e h , and the emotion word or the emotion class label in the emotion clause as the tail entity, e t .Similar to (Lin et al., 2019), we apply networkx4 to perform a depth-first search on the ConceptNet to identify the paths which start from e h and end at e t , and only keep the paths which contain less than two intermediate entities.This is because shorter paths are more likely to offer reliable reasoning evidence (Xiong et al., 2017).Since not all relations in ConceptNet are related to or indicative of causal relations, we further remove the paths which contain any of these four relations: 'antonym', 'distinct from', 'not desires', and 'not capable of '.Finally, we order paths by their lengths in an ascending order and choose the top K paths as the result for each candidateemotion clause pair5 .
An example is shown in Figure 3.The 5-th clause is annotated as the emotion clause and the emotion class label is 'happiness'.For the keyword, 'adopted', in the first clause, we show two example paths extracted from ConceptNet, each of which links the word 'adopted' with 'happiness'.One such a path is "adopted −related to→ acceptance −has subevent→ make better world −causes→ happiness".

Knowledge-Aware Graph (KAG) Model
As shown in Figure 2, there are four components in our model: a document encoding module, a context-aware path representation learning module, a GCN-based graph representation updating module, and finally a softmax layer for cause clause classification.
Initial Clause/Document Representation Learning For each clause C i , we derive its representation, C i , by using a Bi-LSTM operating on its constituent word vectors, where each word vector w i ∈ R d is obtained via an embedding layer.To capture the sequential relationship (S-Edges) between neighbouring clauses in a document, we feed the clause sequence into a transformer architecture.Similar to the original transformer incorporating the position embedding with the word embedding, we utilise the clause position information to enrich the clause representation.Here, the position embedding o i of each clause is concatenated with its representation C i generated by Bi-LSTM.
We consider different ways for encoding position embeddings using either relative or absolute clause positions and explore their differences in the experiments section.In addition, we will also show the results without using position embeddings at all.
Since the aim of our task is to identify the cause clause given an emotion clause, we capture the dependencies between each candidate clause and the emotion clause.Therefore, in the document context modelling, we consider the emotion clause ĈE , generated in a similar way as Ĉi , as the query vector, and the candidate clause representation Ĉi as both the key and value vectors, in order to derive the document representation, D ∈ R d .
Context-Aware Path Representation In Section 3.1, we have chosen a maximum of K paths {p t } K t=1 linking each candidate C i with the emotion clause.However, not every path correlates equally to the document context.Taking the document shown in Figure 3 as an example, the purple knowledge path is more closely related to the document context compared to the green path.As such, we should assign a higher weight to the purple path than the green one.We propose to use the document-level representation D obtained above as the query vector, and a knowledge path as both key and value vectors, in order to calculate the similarity between the knowledge path and the document context.For each pair of a candidate clause C i and the emotion clause, we then aggregate the K knowledge paths to derive the contextaware path representation s i ∈ R d below: where D is the document representation, p t is the path representation obtained from Bi-LSTM on a path expressed as an entity-relation word sequence.

Update of Clause Representations by GCN
After constructing a clause graph such as the one shown in Figure 2(c), we update the clause/node representations via S-Edges and K-Edges.Only clauses with valid knowledge paths to the emotion clause are connected with the emotion clause node.
After initialising the node (or clause) in the clause graph with Ĉi and the extracted knowledge path with s i , we update clause representation using an extended version of GCN, i.e.Relation-GCNs (aka.R-GCNs) (Schlichtkrull et al., 2018), which is designed for information aggregation over multiple different edges: where W r h j is the linear transformed information from the neighbouring node j with relation r at the -th layer, W r ∈ R d×d is relation-specific, N i is the set of neighbouring nodes of the i-th node, R N j is the set of distinct edges linking the current node and its neighbouring nodes.
When aggregating the neighbouring nodes information along the K-Edge, we leverage the path representation s i to measure the node importance.This idea is inspired by the translation-based models in graph embedding methods (Bordes et al., 2013).Here, if a clause pair contains a possible reasoning process described by the K-Edge, then ĥE ≈ ĥi + s i holds.Otherwise, ĥi + s i should be far away from the emotion clause representation ĥE .6Therefore, we measure the importance of graph nodes according to the similarity between (h i + s i ) and h E .Here, we use the scaled Dot-Attention to calculate the similarity e iE and obtain the updated node representation z i .
where e E is {e iE } N −1 i=1 .d is the dimension of graph node representations, and N r k is a set of neighbours by the K-Edge.
Then, we combine the information encoded in S-Edge with z i as in Eq. 3, and perform a non-linear transformation to update the graph node representation h +1 i : where N rs i is a set of i-th neighbours connected by the S-Edges.
Cause Clause Detection Finally, we concatenate the candidate clause node h i and the emotion node representation h e generated by the graph, and apply a softmax function to yield the predictive class distribution ŷi .
4 Experiments We conduct a thorough experimental assessment of the proposed approach against several state-of-theart models7 .

Dataset and Evaluation Metrics
The evaluation dataset (Gui et al., 2016) consists of 2,105 documents from SINA city news.As the dataset size is not large, we perform 10-fold cross-validation and report results on three standard metrics, i.e.Precision (P), Recall (R), and F1-Measure, all evaluated at the clause level.
Baselines We compare our model with the position-insensitive and position-aware baselines:  (Xu et al., 2019) uses the relative position, word-embedding similarity and topic similarity as emotion-related feature to extract cause.

Main Results
Table 1 shows the cause clause classification results on the ECE dataset.Two rule-based methods have poor performances, possibly due to their pre-defined rules.Multi-Kernel performs better than the vanilla SVM, being able to leverage more contextual information.Across the other three groups, the precision scores are higher than recall scores, and it is probably due to the unbalanced number of cause clauses (18.36%) and non-cause clauses (81.64%), leading the models to predict a clause as non-cause more often.Models in the position-aware group perform better than those in the other groups, indicating the importance of position information.Our proposed model outperforms all the other models except RHNN in which its recall score is slightly lower.We have also performed ablation studies by removing either K-Edge or S-Edge, or both of them (w/o R-GCNs).The results show that removing the R-GCNs leads to a drop of nearly 4.3% in F1.Also, both the K-Edge and S-Edge contributes to emotion cause extraction.As contextual modelling has considered the position information, the removal of S-Edge leads to a smaller drop compared to the removal of K-Edge.

Impact of Encoding Clause Position Information
In order to examine the impact of using the clause position information in different models, we replace the relative position information of the candidate clause with absolute positions.In the extreme case, we remove the position information from the models.The results are shown in Figure 4.It can be observed that the best results are achieved using relative positions for all models.Replacing relative positions using either absolution positions or no position at all results in a significant performance drop.In particular, MANN and PAE-DGL have over 50-54% drop in F1.The performance degradation is less significant for RTHN, partly due to its use of the Transformer architecture for context modeling.Nevertheless, we have observed a decrease in F1 score in the range of 20-35%.Our proposed model is less sensitive to the relative positions of candidate clauses.Its robust performance partly attributes to the use of (1) hierarchical contextual modeling via the Transformer structure, and (2) the K-Egde which helps explore causal links via commonsense knowledge regardless of a clause's  relative position.

Performance under Adversarial Samples
In recent years, there have been growing interests in understanding vulnerabilities of NLP systems (Goodfellow et al., 2015;Ebrahimi et al., 2017;Wallace et al., 2019;Jin et al., 2020).Adversarial examples explore regions where the model performs poorly, which could help understanding and improving the model.Our purpose here is to evaluate if KAG is vulnerable as existing ECE models when the cause clauses are not in proximity to the emotion clause.Therefore, we propose a principled way to generate adversarial samples such that the relative position is no longer an indicative feature for the ECE task.

Generation of adversarial examples
We generate adversarial examples to trick ECE models, which relies on swapping two clauses C r 1 and C r 2 , where r 1 denotes the position of the most likely cause clause, while r 2 denotes the position of the least likely cause clause.We identify r 1 by locating the most likely cause clause based on its relative position with respect to the emotion clause in a document.As illustrated in Figure 1, over half of the cause clauses are immediately before the emotion clause in the dataset.We assume that the position of a cause clause can be modelled by a Gaussian distribution and estimate the mean and variance directly from the data, which are, {µ, σ 2 } = {−1, 0.5445}.The position index r 1 can then be sampled from the Gaussian distribution.As the sampled value is continuous, we round the value to its nearest integer: To locate the least likely cause clause, we propose to choose the value for r 2 according to the attention score between a candidate clause and the emotion clause.Our intuition is that if the emotion clause has a lower score attended to a candidate clause, then it is less likely to be the cause clause.We use an existing emotion cause extraction model to generate contextual representations and use the Dot-Attention (Luong et al., 2015) to measure the similarity between each candidate clause and the emotion clause.We then select the index i which gives the lowest attention score and assign it to r 2 : where Ĉi is the representation of the i-th candidate clause, ĈE is the representation of the emotion clause, and N denotes a total of N clauses in a document.
Here, we use existing ECE models as different discriminators to generate different adversarial samples. 8The desirable adversarial samples will fool the discriminator to predict the inverse label.We use leave-one-model-out to evaluate the performance of ECE models.In particular, one model is used as a Discriminator for generating adversarial samples which are subsequently used to evaluate the performance of other models.

Results
The results are shown in Table 2.The attacked ECE models are merely trained on the original dataset.The generated adversarial examples are used as the test set only.We can observe a significant performance drop of 23-32% for the existing ECE models, some of which even perform worse than the earlier rule-based methods, showing their sensitivity to the positional bias in the dataset.We also observe the performance degradation of our proposed KAG.But its performance drop is less significant compared to other models.The results verify the effectiveness of capturing the semantic dependencies between a candidate clause and the emotion clause via contextual and commonsense knowledge encoding.

Case Study and Error Analysis
To understand how KAG aggregate information based on different paths, we randomly choose two examples to visualise the attention distributions (Eq.4) on different graph nodes (i.e., clauses)  in Figure 5. 9 These attention weights show the 'distance' between a candidate clause and the emotion clause during the reasoning process.The cause clauses are underlined, and keywords are in bold.
C i in brackets indicate the relative clause position to the emotion clause (which is denoted as C 0 ).

Ex.1
The crime that ten people were killed shocked the whole country (C−4).This was due to personal grievances (C−3).Qiu had arguments with the management staff (C−2), and thought the Taoist temple host had molested his wife (C−1).He became angry (C0), and killed the host and destroyed the temple (C1).
In Ex.1, the emotion word is 'angry', the knowledge path identified by our model from ConceptNet is, "arguments → fight →angry" for Clause C −2 , and "molest → irritate →exasperate→angry" for Clause C −1 .Our model assigns the same attention weight to the clauses C −2 , C −1 and the emotion clause, as shown in Figure 5.This shows that both paths are equally weighted by our model.Due to the K-Edge attention weights, our model can correctly identify both C −2 and C −1 clauses as the cause clauses.

Ex.2
The LongBao Primary school locates between the two villages (C−2).Some unemployed people always cut through the school to take a shortcut (C−1).Liu Yurong worried that it would affect children's study (C0).When he did not have teaching duties (C1), he stood guard outside the school gate (C2).
In Ex.2, the path identified by our model from ConceptNet for Clause (C −1 ) is "unemployment → situation → trouble/danger→ worried".It has 9 More cases can be found in the Appendix.
been assigned the largest attention weight as shown in Figure 5.Note that the path identified is spurious since the emotion of 'worried' is triggered by 'unemployment' in the ConceptNet, while in the original text, 'worried' is caused by the event, 'Unemployed people cut through the school'.This shows that simply using keywords or entities searching for knowledge paths from commonsense knowledge bases may lead to spurious knowledge extracted.We will leave the extraction of event-driven commonsense knowledge as future work.

Conclusion and Future Work
In this paper, we examine the positional bias in the annotated ECE dataset and investigate the degree of reliance of the clause position information in existing ECE models.We design a novel approach for generating adversarial samples.Moreover, we propose a graph-based model to enhance the semantic dependencies between a candidate clause and a given emotion clause by extracting relevant knowledge paths from ConceptNet.The experimental results show that our proposed method achieves comparative performance to the state-of-the-art methods, and is more robust against adversarial attacks.Our current model extracts knowledge paths linking two keywords identified in two separate clauses.
In the future, we will exploit how to incorporate the event-level commonsense knowledge to improve the performance of emotion cause extraction.

A Model Architecture
In this section, we describe the details of the four main components in our model: contextual modelling, knowledge path encoding, clause graph update and cause clause classification.
The dataset has 2,105 documents.The maximum number of clauses in a document is 75 and the maximum number of words per clause is 45.So we first pad the input documents into a matrix I with the shape of [2105,75,45].
A.1 Contextual Modelling a. token → clause We first apply a 1-layer Bi-LSTM of 100 hidden units to obtain word embeddings, w ∈ R 200 .We then use two linear transformation layers (hidden units are [200,200]

A.2 Knowledge Path Encoding
For each candidate clause and the emotion clause, we extract knowledge paths from ConceptNet and only select K paths.The values of K is set to 15, since the median of the number of paths between a candidate clause and the emotion clause is 15 in our dataset.
We use the same Bi-LSTM described in Section A.1 to encode each knowledge path and generate the K number of path representations {p it } K t=1 between the i-th clause and the emotion clause.Then, the document representation D is applied as the query to attend to each path in {p it } to generate the final context-aware path representation s i ∈ R 200 .

A.3 Clause Graph Update
The graph nodes are initialised by clause presentations, with the feature dimension 200.To calculate the attention weights e iE in R-GCNs, We use the non-linearly transformed h i + s i as the query, the non-linearly transformed h E as the value and key.
The non-linear functions are independent Selu layers.

A.4 Cause Clause Classification
The MLP with [400,1] hidden units takes the concatenation of each candidate node {h L i } N i=1 and the emotion node representation h L E to predict the logit, after which, a softmax layer is applied to predict the probability of the cause clause.

B Training Details for KAG
We randomly split the datasets into 9:1 (train/test).For each split, we run 50 iterations to get the best model on the validation set, which takes an average time of around 23 minutes per split, when conducted on a NVIDIA GTX 1080Ti.For each split, we test the model on the test set at the end of each iteration and keep the best resulting F1 of the split.The number of model parameters is 1,133,002.

Hyper-parameter Search
We use the grid search to find the best parameters for our model on the validation data, and report in the following the hyper-parameter values providing the best performance.
• The word embeddings used to initialise the Bi-LSTM is provided by NLPCC10 .It was pre-trained on a 1.1 million Chinese Weibo corpora following the Word2Vec algorithm.
The word embedding dimension is set to 200.
• The position embedding dimension is set to 50, randomly initialised with the uniform distribution (-0.1,0.1).
• The number of Transformer blocks is 2 and the number of graph layers is 3.
• To regularise against over-fitting, we employ dropout (0.5 in the encoder, 0.2 in the graph layer).
• The network is trained using the the Adam optimiser with a mini-batch size 64 and a learning rate η = 0.005.The parameters of our model are initialised with Glorot initialisation.

C Error Analysis
We perform error analysis to identify the limitations of the proposed model.In the following examples (Ex.1 and Ex.2), the cause clauses are in bold, our predictions are underlined.
Ex.1 Some kind people said (C −6 ), if Wu Xiaoli could find available kidneys (C −5 ), they would like to donate for her surgery (C −4 ).4000RMB donation had been sent to Xiaoli (C −3 ), Qiu Hua said (C −2 ).The child's desire to survival shocked us (C −1 ).The family's companion was touching (C 0 ).Wish kind people will be ready to give a helping hand (C 1 ).Help the family in difficulty (C 2 ).
In the first example Ex.1, our model identifies the keyword survival in C −1 and extracts several paths from 'survival' to 'touching'.However, the main event in clause C −1 concerns desire rather than survival.Our current model detects the emotion reasoning process from ConceptNet based on keywords identified in text, and inevitably introduces spurious knowledge paths to model learning.
Ex.2 I have only one daughter (C 0 ), and a granddaughter of 8 year-old (C −10 ).I would like to convey these memory to her (C −9 ).Last Spring Festival (C −8 ), I gave the DVD away to my granddaughter (C −7 ).I hope she can inherit my memory (C −6 ).Thus (C −5 ), I feel like that my ages become eternity (C −4 ).Sun Qing said (C −3 ).His father is a sensitive and has great passion for his life (C −2 ).He did so (C −1 ).Making me feel touched (C 0 ).His daughter said (C 1 ).
In the Ex 2, our model detected the passion as a keyword and extracted knowledge paths between the clause C −2 and the emotion clause.However, it ignores the semantic dependency between the clause C −1 and the emotion clause.It is therefore more desirable to consider semantic dependencies or discourse relations between clauses/sentences for emotion reasoning path extraction from external commonsense knowledge sources.

D Human Evaluation on the Generated Adversarial Samples
The way adversarial examples generated changes the order of the original document clauses.Therefore, we would like to find out if such clause reordering changes the original semantic meaning and if these adversarial samples can be used to evaluate on the same emotion cause labels.We randomly selected 100 adversarial examples and ask two independent annotators to manually annotate emotion cause clauses based on the same annotation scheme of the ECE dataset.Compared to the original annotations, Annotator 1 achieved 0.954 agreement with the cohen's kappa value of 0.79, while Annotator 2 achieved 0.938 agreement with the cohen's kappa value of 0.72.This aligns with our intuition that an emotion expressed in text is triggered by a certain event, rather than determined by relative clause positions.A good ECE model should be able to learn a correlation between an event and its associated emotion.This also motivates our proposal of a knowledge-aware model which leverages commonsense knowledge to explicitly capture event-emotion relationships.

Figure 1 :
Figure1: The distribution of positions of cause clauses relative to their corresponding emotion clauses in the ECE dataset(Gui et al., 2016).Nearly 87% of cause clauses are located near the emotion clause (About 55% are immediately preceding the emotion clause, 24% are the emotion clauses themselves and over 7% are immediately after the emotion clause).

BaiFigure 3 :
Figure 3: A document consisting of 8 clauses in the ECE dataset with extracted knowledge paths from the Concept-Net.Words in red are identified keywords.'happiness' is the emotion label of the emotion clause C5.For better visualization, we only display two extracted knowledge paths between 'adopt' and 'happiness' in the ConceptNet.

Figure 4 :
Figure 4: Emotion cause extraction when using relative, absolute or no clause positional information.Our model demonstrates most stable performance without the relative position information.
,[200,1]) to map the original w to a scalar attention score α, then perform a weighted aggregation to generate the clause representation Ĉi ∈ R 200 .b. clause → document We feed the clause representations into a Transformer.It has 3 stacked blocks, with the multi-head number set to 5, and the dimension of key, value, query is all set to 200.The query vector is the emotion clause representation ĈE ∈ R 200 , the key and value representations are candidate clause representations, also with 200 dimensions.Finally, the updated clause representations are aggregated via Dot-Attention to generate the document representation D ∈ R 200 .
knowledge paths, p 1 and p 2 , are extracted between C 1 and the emotion clause C 5 .(a) Document Encoding.Clauses are fed into a word-level Bi-LSTM and a clause-level Transformer to obtain the clause representations Ĉi .The document embedding D is generated by Dot-Attention between the emotion embedding ĈE and clause embeddings.(b) Path Representations.The extracted knowledge paths are fed into Bi-LSTM to derive path representations.Multiple paths between a clause pair are aggregated into s i based on their attention to the document representation D. (c) Clause Graph Update.A clause graph is built with the clause representations Ĉi used to initialise the graph nodes.The K-Edge weight e iE between a candidate clause Ĉi and the emotion clause ĈE are measured by their distance along their path s i .(d) Classification.Node representation h i of a candidate clause C i is concatenated with the emotion node representation h E , and then fed to a softmax layer to yield the clause classification result ŷi .

Table 1 :
Results of different models on the ECE dataset.Our model achieves the best Precision and F1 score.

Table 2 :
F1 score and relative drop (marked with ↓) of different ECE models on adversarial samples.The listed four ECE models are attacked by the adversarial samples generated from the respective discriminator.Our model shows the minimal drop rate comparing to other listed ECE models across all sets of adversarial samples.