Boundary Detection with BERT for Span-level Emotion Cause Analysis

Emotion cause analysis (ECA) has been an emerging topic in natural language processing, which aims to identify the reasons behind a certain emotion expressed in the text. Most ECA methods intend to identify the clause which contains the cause of a given emotion, but such clause-level ECA (CECA) can be ambiguous and imprecise. In this paper, we aim at span-level ECA (SECA) by detecting the precise boundaries of text spans conveying accurate emotion causes from the given context. We formulate this task as sequence labeling and position identiﬁcation problems and de-sign two neural methods to solve them. Experiments on two benchmark ECA datasets show that the proposed methods substantially outperform the existing ECA models 1 .


Introduction
The task of emotion cause analysis (ECA) (Lee et al., 2010a), which is to extract the causes of an emotion expression from a given context, has gained increasing attention recently.
Most existing studies (Gui et al., 2016(Gui et al., , 2017Li et al., 2018 formulate ECA as a clause-level task (dubbed as CECA), which tries to extract the clauses that contain the emotion cause content. In CECA, a clause is typically a text segment separated by punctuation marks (e.g., ',', '.', '?', '!', etc.) in the given context. In the following example, the clause [x 3 ] will be extracted as it contains the reason "the risk of infringement" that stimulates the emotion afraid. However, determining the clause containing the stimulus is sub-optimal and inaccurate for ECA. In Example 1, while among the 5 clauses [x 3 ] is the best, its main content "We can't give up . . . assistance" is not the cause of afraid. Such gap motivates a strong need of pinpointing more precise or finer-grained cause expressions which can convey the specific reasons of an emotion.
Some studies (Gao et al., 2015a,b;Neviarouskaya and Aono, 2013) try to extract emotion cause triples (a triple like (noun, verb, noun)) or emotion cause phrases. The extracted triples or phrases are usually not a complete emotion cause expression since the words in these triples or phrases are typically not continuous. Bi et al. (2020) proposed a new task for emotion and emotion cause span pair and classification task. Lee et al. (2010a) summarized seven groups of linguistic cues which could serve as indicators of cause events. Ghazi et al.. (2015) built a Conditional Random Fields (CRF) learner to extract emotion cause spans with a set of manually engineered features. These two approaches are labor-intensive and prone to the sub-optimal design of features. We study the Span-level ECA (SECA) based on state-of-the-art neural approach by focusing on detecting the boundaries of cause spans. We approach to the boundary detection in two different ways via sequence tagging and start/end position identification both in an end-to-end fashion. The emotion cause span is usually a sequence of consecutive tokens which convey exact reasons of the emotion and need to be inferred from the whole context. More specifically, we obtain the emotion cause span by 1) mapping each token in the input context to a label indicating the ranges of spans; 2) pointing to the start and end positions of the span directly in the given context. The main contributions of our work are three-fold: • We formulate SECA task as sequence tagging and position identification problems for boundary detection of emotion cause spans, and propose neural models to tackle this task.
• Different from previous ECA approaches, a pointer network is introduced to identify the start and end positions of emotion cause in our work. This is the first time to introduce the pointer network to ECA task.
• Experiments on two benchmark datasets show that our models substantially outperform the state-of-the-art ECA approaches at both span and clause levels.
2 Related Work Lee et al. (2010b) first proposed a task on ECA and constructed a small-scale Chinese emotion cause corpus. Based on this corpus,  proposed a rule based method for this task. Gui et al. (Gui et al., 2014) extended the rule-based method to Chinese microblog text. Gao et al. (Gao et al., 2015b,a) treated the emotion cause as a list of triples (a triple like (noun, verb, noun)). They structured a microblog post as a set of triples to determine which triple is the emotion cause. Shuntaro et al. (Yada et al., 2017) defined emotion cause as the nearest clause describing events of the cause of the emotion. Gui et al. (2016) constructed a public ECA dataset based on news corpus and proposed a multi-kernel-based method for CECA. All these methods rely on either manual rules or feature engineering. Gui et al. (2017) first proposed a deep neural model for CECA. Following that, many CECA methods were developed with deep learning (Gui et al., 2017;Li et al., 2018; Chen et al. (2020a) proposed two variant tasks of ECA for extracting emotion and cause clause pairs. And then, many deep learning models Chen et al., 2020b,c;Wei et al., 2020;Singh et al., 2021) were designed to tackle emotion cause pair extraction task.
Little work has been done for SECA. Lee et al. (Lee et al., 2010a) summarized the generalized rules for detecting the emotion causes of the emotion in Chinese. Based on manually crafted linguistic cues, Ghazi et al. (2015) built a CRF learner to identify emotion cause spans. However, these models are oversimplified and reliant on the ad-hoc features designed, thus are not generalized well.

Methodology
In this section, we first introduce two types definition of SECA. Then we describe our designed models.

Problem Formulation
Given a context S, which includes a sequence of n tokens S = [w 1 , w 2 , . . . , w n ], the emotion expression E = [e 1 , e 2 , . . . , e m ] and at least one emotion cause span, SECA task aims to detect the boundaries of emotion cause spans from S which stimulate the emotion expression.
Firstly, this task can be formulated as sequence labeling problem, of which the goal is to obtain the correct sequence of labels: where l i ∈ {B, I, O}, and B, I and O denote the beginning, inside of and outside of cause span, respectively, indicating the ranges of spans. Secondly, this task also can be formulated as a position identification problem, where the goal is to get a list of correct start-end positions of emotion cause spans: where s i and t i are the start and end token indices of the i-th cause span in the context, respectively.

Model Description
We use BERT (Devlin et al., 2019) as the backbone of our models due to it has strong contextualized representation ability. The context S and the emotion E are concatenated by forming a combined sequence as the input fed into BERT: where h 0 and h i ∈ R d b and d b is the vector dimension of the last layer of BERT. Based on the BERT encoding, we introduce the two types of span boundary detection models for SECA. An overview of our models are shown in Figure 1.

Sequence Labeling Models
We use three common ways to predict the labels of an input sequence. The first is the softmax function direct tag prediction which is straightforward to compute. The second is CRF, a well-known statistical graphical model which has demonstrated state-of-the-art accuracy on many sequence labeling tasks Jin and Yu, 2021). The third is a variant of RNN (Goller and Kuchler, 1996) (e.g. GRU, LSTM) which generates tags sequentially by predicting the next label considering its previously predicted labels.
Softmax. Based on the token representation obtained in Equation (3), the probability distribution over the label set can be computed as: where W 1 ∈ R d b ×d l and b 1 ∈ R d l are learnable parameters, p(l i ) ∈ R d l is the label probability distribution of the i-th word and d l is the size of label set.
CRF. CRF obtains the probability of a whole sequence: where L is the set of all candidate label sequences of the context. Here, we omit the detail of this classical prediction model (see Lafferty et al. (2001)). GRU. The GRU version of RNN is effective and easy to train (Ma et al., 2019). At each predicting step, we update the hidden state h i of token w i using:h whereh i ∈ R dg and d g is the number of the GRU units, (l i−1 ) ∈ R d is the embedding vector of previous label l i−1 and d is the size of label embedding. Then, the probability distribution of l i can be obtained by: where W 2 ∈ R dg×d l , b 2 ∈ R d l are training parameters.

Position Identification Model
Point network was first proposed by Vinyals et al. (2015), which is suitable for our positions identification task thanks to the fact that it is able to select positions from the input. Here we adopt it to generate the start and end positions of spans in turn. Because the number of cause spans is not fixed, we set a parameter C to control how many spans to generate, and allow the two end points in (s i , t i ) to take an integer value in [0, n]. For the k-th span (s k , t k ), we first get its start index s k . An attention mechanism is used to obtain the attention weight α kj of the j-th token (for all j ∈ [0, n]) as follows: where W 3 ∈ R d b ×dv , W 4 ∈ R dr×dv and v ∈ R dv are learnable parameters (d v is the size of v), r k is the hidden vector obtained by GRU: where t k−1 is the index of end position of the (k-1)-th span and r k−1 ∈ R dr (d r is the number of GRU units) is hidden vector obtained by GRU for predicting t k−1 (see below). Then, we set s k as the highest attention weight from attention vector a k = [a k0 , . . . , a kn ].
To predict the end index t k of the k-th span, the same attention mechanism is used to get the attention weight a kj : where W 3 , W 4 and v are learnable parameters. Similarly, the end index t k can be identified by taking the maximum value in a k .

Experiments and Results
We evaluate the proposed methods on the Chinese ECA (CHI) dataset (Gui et al., 2016) and English ECA (ENG) dataset (Ghazi et al., 2015). On both datasets, cause spans are manually annotated and each context contains an emotion expression (or emotion category) and at least one cause span. The CHI dataset contains 2,105 instances and 2,147 cause spans. The ENG dataset contains 820 instances, and each contains one span only.
We use pre-trained BERT-Base Chinese and BERT-Base Uncased to encode the CHI and ENG datasets, respectively. We also choose the pretrained span-level SpanBERT-Base-cased (n/a for Chinese) as the encoder on ENG dataset. We follow the default settings of BERT (Devlin et al., 2019) for fine-tuning. Adam (Kingma and Ba, 2015) optimizer is used with learning rate 1e-5. The epoch is set to 10. The size of label embedding is 50. C 3 is set to 3 on CHI and 1 on ENG which are the maximum number of cause spans in the respective training set. We follow the settings of previous works to split the datasets for train/dev/test (Ghazi et al., 2015;Gui et al., 2017). Hyper-parameters are tuned on the dev set. We have BERT (BERT base ) and SpanBERT (BERT span ) as pre-trained models, and Softmax, CRF, GRU and pointer network (Pointer) as prediction models.

SECA Result
On CHI dataset, we compare our model with the rule based model (Lee et al., 2010a) and the word ECA model that outputs cause words (Gui et al., 2017), which is somewhat similar to our work. On ENG dataset, the CRF-based model in (Ghazi et al., 2015) is the only span-level baseline. For evaluation, we use span precision (P s ), recall (R s ) and F1 score (F s 1 ) which are defined as P s = # of correct cause spans # of predicted cause spans R s = # of correct cause spans # of gold cause spans 3 If the number of emotion cause span(s) is less than C, (0, 0) is used for padding the correct list B. Table 1 shows that all our models strongly outperform the baselines, indicating our method is effective. More specifically, on CHI dataset our models BERT base +Softmax, BERT base +GRU, BERT base +CRF and BERT base +Pointer outperform the strong baseline model in Gui et al. (2017) by relative F s 1 improvements of 34.1%, 33.6%, 38.2% and 36.3%, respectively. The improvement is very significant with p-value less than 0.001 in ttest. With BERT encoder, Pointer performs the best on ENG and the second best on CHI, suggesting that pointer network is basically effective. Pointer is relatively worse than CRF on CHI because the number of spans it outputs is fixed as three while an instance on CHI may have less than three spans, rendering a small disadvantage. But generally, using pointer network for span boundary detection provides a strong alternative to classic CRF. It is a bit surprising that BERT span does not show advantage over BERT base here. We conjecture that given the small ENG dataset we cannot perform strong fine-tuning to make a big difference. In addition, the performance on ENG is much higher than on CHI because the context is generally much shorter making the task easier.

CECA Result
We can directly output clauses containing the predicted cause spans to compare with the rule based SECA model (2010a) and some strong CECA models (Gui et al., 2017;Li et al., 2018;Fan et al., 2019;Hu et al., 2021). We only use CHI here since no clause labels are available and no previous work for CECA was done on English data for us to compare with. Following previous works (Gui et al., 2017;, we use clause precision (P c ), recall (R c ), and F1 score (F c 1 ) as evaluation metrics: P c = # of correct cause clauses # of predicted cause clauses R c = # of correct cause clauses # of gold cause clauses Table 2 shows that our proposed models outperform all the state-of-the-art CECA baselines. We attribute this to the fact that the BERT's contextualized representation capacity and our SECA models are supervised by the finer-grained spanlevel annotations directly, which can effectively  guide model to learn more precise cause-related information. This advantage cannot be taken easily by the baseline approaches due to the nature of their clause-level supervision. Moreover, simply mapping the predicted spans to clauses for output makes Softmax become comparable with CRF and puts Pointer the last in terms of CECA performance. This is not surprising because an inaccurate span might still result in the right clause which just contains the predicted span. We also notice that rule based model performs worse than all the feature-learning models because that the manual rules hardly adopt to different datasets and feature-learning model can learn effective features according to the different datasets.

Conclusion
In this paper, we aim at span-level emotion cause analysis and propose neural sequence labeling and position identification models to detect the boundaries of emotion cause spans. The experiments conducted on two benchmark datasets of different languages demonstrate the effectiveness of the proposed approach, which achieves a new state-of-theart performance on both span-level and clause-level ECA tasks.