CasEE: A Joint Learning Framework with Cascade Decoding for Overlapping Event Extraction

Event extraction (EE) is a crucial information extraction task that aims to extract event information in texts. Most existing methods assume that events appear in sentences without overlaps, which are not applicable to the complicated overlapping event extraction. This work systematically studies the realistic event overlapping problem, where a word may serve as triggers with several types or arguments with different roles. To tackle the above problem, we propose a novel joint learning framework with cascade decoding for overlapping event extraction, termed as CasEE. Particularly, CasEE sequentially performs type detection, trigger extraction and argument extraction, where the overlapped targets are extracted separately conditioned on the specific former prediction. All the subtasks are jointly learned in a framework to capture dependencies among the subtasks. The evaluation on a public event extraction benchmark FewFC demonstrates that CasEE achieves significant improvements on overlapping event extraction over previous competitive methods.


Introduction
Event Extraction (EE) is an important yet challenging task in natural language understanding. Given a sentence, an event extraction system ought to identify event types, triggers and arguments appearing in the sentence. As an example, Figure 1(b) presents an event mention of type Share Reduction, triggered by "reduced". There are several arguments, such as "Fuda Industry" playing the subject role in the event.
However, events often appear in sentences complicatedly, where the triggers and arguments may have overlaps in a sentence. This paper focuses 1 The source code is available at https://github. com/JiaweiSheng/CasEE. on a challenging and realistic problem in EE: overlapping event extraction. Generally, we categorize all the overlapping cases into three patterns: 1) A word may serve as triggers with different event types across several events. Figure 1(a) shows the token "acquired" triggers an Investment event and a Share Transfer event at the same time. 2) A word may serve as arguments with different roles across several events. Figure 1(a) shows "Shengyue Network" plays an object role in the Investment event and a subject role in the Share Transfer event.
3) A word may serve as arguments playing different roles in one event. Figure 1(b) shows that "Fuda Industry" plays a subject role and a target role in an event. For simplicity, we call pattern 1) as overlapped trigger problem, and both pattern 2) and 3) as overlapped argument problem in the following sections. There are about 13.5% / 21.7% sentences having overlapped trigger/argument problems in the Chinese financial event extraction dataset, FewFC (Zhou et al., 2021).
Most existing EE studies assume that events appear in sentences without overlaps, which are not applicable to the complicated overlapping scenarios. Typically, current EE studies can be roughly categorized into two groups: 1) Traditional joint methods (Nguyen et al., 2016;Nguyen and Nguyen, 2019), which simultaneously extract triggers and arguments by a unified decoder labeling the sentence only once. However, they fail in extracting overlapped targets due to label conflicts, where a token may have several typed labels but only one label can be assigned. 2) Pipeline methods (Chen et al., 2015;Du and Cardie, 2020b), which sequentially extract triggers and arguments in separate stages.  attempts to tackle the overlapped argument problem in the pipeline manner, but overlooks the overlapped trigger problem. Nevertheless, the pipeline methods neglect the feature-level dependencies between the trigger and arguments, and suffer from error propagation. In our knowledge, existing researches in EE neglect overlapping problems or only focus on one overlapping problem. Few researches simultaneously solve all the three mentioned overlapping patterns.
To address the above issues, we propose CasEE, a joint learning framework with Cascade decoding for overlapping Event Extraction. Specifically, CasEE realizes event extraction with a shared textual encoder and three decoders for type detection, trigger extraction and argument extraction. To extract overlapped targets across events, CasEE sequentially decodes the three subtasks, conducting trigger extraction and argument extraction according to the former predictions. Such a cascade decoding strategy extracts event elements according to the different conditions, so that the overlapped targets can be extracted in separate phases. A condition fusion function is designed to explicitly model the dependencies between adjacent subtasks. All the subtask decoders are jointly learned to further build connections among subtasks, which refines the shared textual encoder with feature-level interactions among downstream subtasks.
The contributions of this paper are three-fold: (1) We systematically investigate the overlapping problems in EE, and categorize them into three patterns. To the best of our knowledge, this paper is among the first to simultaneously tackle all the three overlapping patterns.
(2) We propose CasEE, a novel joint learning framework with cascade decoding, to simultaneously solve all the three overlapping patterns.
(3) We conduct experiments on a public Chinese financial event extraction benchmark, FewFC.
Experimental results reveal that CasEE achieves significant improvements on overlapping event extraction over existing competitive methods.

Related Work
Current EE research can be roughly categorized into two groups: 1) Traditional joint methods (Li et al., 2013;Nguyen et al., 2016;Nguyen and Nguyen, 2019;Sha et al., 2018) that perform trigger extraction and argument extraction simultaneously. They solve the task in a sequence labeling manner, and extract triggers and arguments by tagging the sentence only once. However, these methods fail in solving overlapping event extraction since the overlapping tokens would cause label conflicts when forced to have more than one label. 2) Pipeline methods (Chen et al., 2015;Wadden et al., 2019;Du and Cardie, 2020b; that perform trigger extraction and argument extraction in separate stages. Though pipeline methods have the potential capacity to solve overlapping EE, they usually lack explicit dependencies between triggers and arguments, and suffer from error propagation. Among the researches,  and  solve the overlapped argument problem, but overlook the overlapped trigger problem, thus can not discern correct triggers for argument extraction. All the above methods can not simultaneously solve all the overlapping patterns in event extraction. The overlapping problem has also been explored in other information extraction tasks outside event extraction. Luo and Zhao (2020) tackles nested named entity recognization with bipartite flat-graph networks. Zeng et al. (2018) tackles overlapped relational triple extraction by applying a sequenceto-sequence paradigm with a copy mechanism.  and Yu et al. (2020) extract overlapped relational triples with a novel cascade tagging strategy, which inspire us to solve overlapping event extraction in the cascade decoding paradigm.  further discusses the propagation error in cascade decoding. All the above researches are proposed for other tasks, which can not be directly transferred for overlapping event extraction due to the complicated event extraction definition.

Our Approach
Given an input sentence, the goal of EE is to identify triggers with their event types and arguments with their corresponding roles, where triggers and arguments may overlap on some tokens. To tackle this problem, we propose a training objective at the event level. Formally, according to the pre-defined event schema, we have an event type set C and an argument role set R. The overall goal is to predict all events in gold set E x of the sentence x. We aim to maximize the joint likelihood of training data D: (1) where C x denotes the set of types occurring in x, T x,c denotes the trigger set of type c, and A x,c,t denotes the argument set of type c and trigger t. Note that each c is a type in C, each t is a trigger word, and each a r ∈ A x is an argument word corresponding to its own role r ∈ R. Eq. (1) exploits the fact of dependencies among the type, trigger and argument. Actually, it motivates us to learn a type detection decoder p(c|x) to detect event types occurring in the sentence, a trigger extraction decoder p(t|x, c) to extract triggers of type c, and an argument extraction decoder p(a r |x, c, t) to extract role-specific arguments with type c and trigger t.
Such a task decomposition solves all the event overlapping patterns claimed in the Introduction. Specifically, we first detect event types occurring in the sentence. When extracting triggers, we only predict the triggers with a specific type, thus the triggers overlapped across several events will be predicted in separate phases. Similarly, when extracting arguments, we predict the arguments with a specific type and trigger, thus the arguments overlapped across several events will also be predicted in separate phases. Since we adopt role-specific taggers in argument extraction, the overlapped arguments having several roles in an event can be predicted separately with specific taggers. All the predictions in type detection, trigger extraction and argument extraction form the final prediction. Figure 2 demonstrates the details of CasEE. CasEE adopts a shared BERT encoder to capture textual features, and three decoders for type detection, trigger extraction and argument extraction. Since all subtasks are jointly learned in contrast to previous pipeline methods , CasEE could capture feature-level dependencies among subtasks. For prediction, CasEE sequentially predicts event types, triggers and arguments in the cascade decoding process.

BERT Encoder
To capture the feature-level dependencies among subtasks, we share the textual representations of each sentence. As BERT has shown performance improvements across multiple NLP tasks, we adopt BERT (Devlin et al., 2019) as our textual encoder. BERT is a bi-directional language representation model based on transformer architecture (Vaswani et al., 2017), which generates textual representations conditioned on token context and remains rich textual information. Formally, the sentence with N tokens is denoted as x = w 1 , w 2 , ..., w N . We input the tokens into BERT, and then obtain the hidden states H = h 1 , h 2 , ..., h N as the token representations for the following downstream subtasks.

Type Detection Decoder
Since we tackle the overlapped trigger problem by extracting triggers conditioned on the type predictions, we devise a type detection decoder to predict event types. Inspired by event detection without triggers (Liu et al., 2019), we adopt attention mechanism to detect event types, capturing the most relative context for each possible type. Specifically, we randomly initialize embedding matrix C ∈ R |C|×d as the type embeddings. We define a similarity function δ to measure the relevance between the candidate type c ∈ C and each token representation h i . To fully capture the similarity information in different aspects, we achieve δ with an expressive learnable function. According to the relevance scores, we obtain the sentence representation s c adaptive to the type. The details are as follows: where W ∈ R 4d×4d and v ∈ R 4d×1 are learnable parameters, | · | is an absolute value operator, is the element-wise production, and [·; ·] denotes the concatenation of representations.
Finally, we predict event types by measuring the similarity of the adaptive sentence representation s c and the type embedding c with the same similarity function δ. Then, the predicted probability of each event type c occurring in the sentence is: where σ denotes sigmoid function. We select the event type withĉ > ξ 1 as results, where ξ 1 ∈ [0, 1] is a scalar threshold. All predicted types in sentence x form event type set C x . The decoder learnable parameter θ td {W, v, C}.

Trigger Extraction Decoder
To discern overlapped triggers with several types, we extract triggers conditioned on a specific type c ∈ C x . This decoder contains a condition fusion function, a self-attention layer, and a pair of binary taggers for triggers. To model the conditional dependency between type detection and trigger extraction, we devise a condition fusion function φ to integrate condition information into textual representation. Specifically, we obtain the conditional token representation g c i by integrating the type embedding c into the token representation h i as: Actually, φ can be achieved by concatenation, addition operator or gate mechanism. To fully generate conditional representations in the statistical aspect, we introduce an effective and general mechanism, conditional layer normalization (CLN) (Su, 2019;Yu et al., 2021), to achieve φ. CLN is mostly based on the well-known layer normalization (Ba et al., 2016), but can dynamically generate gain γ and bias β based on the condition information. Given a condition embedding c and a token representation h i , CLN is formulated as: where µ ∈ R and σ ∈ R are the mean and standard variance taken across the elements of h i , and γ c ∈ R d and β c ∈ R d are the conditional gain and bias, respectively. In this way, the given condition representation is encoded into the gain and bias, and then integrated into contextual representations.
To further refine representations for trigger extraction, we adopt a self-attention layer over the conditional token representations. Formally, the refined token representations are derived as: where G c is the representation matrix composed of g c i . For details of the self-attention layer, please refer to Vaswani et al. (2017).
To predict triggers, we devise a pair of binary taggers. For each token w i , we predict whether it corresponds to a start or end position of a trigger as:t where σ denotes sigmoid function, and z c i denotes the i-th token representation in Z c . We select tokens witht sc i > ξ 2 as the start positions, and those witht ec i > ξ 3 as end positions, where ξ 2 , ξ 3 ∈ [0, 1] are scalar thresholds. To obtain the trigger word t, we enumerate all the start positions and search the nearest following end position in the sentence, and the tokens between the start and end position form an entire trigger. In this way, the overlapped triggers can be extracted separately according to the type in separate phases. All the predicted trigger t of type c in sentence s forms the set T c,s . The decoder parameter θ te includes all the parameters in the condition fusion function, the self-attention layer and the trigger taggers.

Argument Extraction Decoder
To tackle the overlapped argument problem, we extract role-specific arguments conditioned on both the specific event type c ∈ C s and event trigger t ∈ T c,s . This decoder also contains a condition fusion function, a self-attention layer, and a group of role-specific binary tagger pairs for arguments.
We further integrate the trigger information into the typed textual representation g c i in Eq. (4) with function φ achieved by CLN. Here we take the average of the start and end position token representations of t as the trigger embedding. We also adopt a self-attention layer to derive the refined textual representations Z ct . To be aware of the trigger position, we adopt the relative position embedding as used in Chen et al. (2015), which indicates the relative distance from current token to the trigger boundary token. Finally, the token representations Z ct for argument extraction are derived as: where P ∈ R N ×dp is the relative position embeddings, d p is the dimension, and [·; ·] denotes the concatenation of representations.
To predict arguments in roles, we devise a group of role-specific tagger pairs. For each token w i , we predict whether it corresponds to a start or end position of an argument of the role r ∈ R as: where σ denotes sigmoid function, and z ct i denotes the i-th token representation in Z ct . Since not all roles belonging to the specific type c, we adopt an indicator function I(r, c) to indicate whether the role r belongs to the type c according to the pre-defined event scheme. To make the indicator function derivable, we parameterize I(r, c) to learn with the model parameters. Specifically, given the type embedding c ∈ C, we build the connection between the type and roles as: where σ denotes sigmoid function, w r , b r are parameters associated with the role r. For each role r, we select tokens withr sct i > ξ 4 as the start positions, and those withr ect i > ξ 5 as end positions, where ξ 4 , ξ 5 ∈ [0, 1] are scalar thresholds. To obtain the argument word a with role r, we enumerate all the start positions and search the nearest following end position in the sentence, and the tokens between the start and end position form an entire argument. In this way, the overlapped arguments can be extracted separately according to the different types and triggers with role-specific taggers. All the predicted argument a r with type c and trigger t in sentence x forms the set A t,c,x . The decoder parameter θ ae includes the type embedding matrix C, and all parameters in the condition fusion function, the self-attention layer and the argument taggers.

Experiments
In this section, we conduct experiments to evaluate the performance of CasEE.

Dataset and Evaluation Metric
We conduct experiments 2 on a Chinese financial event extraction benchmark FewFC (Zhou et al., 2021). We split data with 8:1:1 for training/validation/testing. Table 1 shows more details. For evaluation, we follow the traditional evaluation metrics (Chen et al., 2015;Du and Cardie, 2020b): 1) Trigger Identification (TI): A trigger is correctly identified if the predicted trigger span matches with a golden span; 2) Trigger Classification (TC): A trigger is correctly classified if it is correctly identified and assigned to the correct type; 3) Argument Identification (AI): An argument is correctly identified if its event type is correctly recognized and the predicted argument span matches with a golden span; 4) Argument Classification (AC): an argument is correctly classified if it is correctly identified and the predicted role matches with a golden role. We report Precision (P), Recall (R) and F measure (F1) for each of the four metrics.

Comparision Methods
Though various models have recently been developed for EE, few researches are investigated to solve overlapping event extraction. We attempt to develop the following baselines based on current solutions. For the realistic consideration, no candidate entities are previously known for EE.
Joint sequence labeling methods. This kind of method formulates event extraction into a sequence labeling task. BERT-softmax (Devlin et al., 2019) adopts BERT to learn textual representations and uses hidden states for classifying event triggers and arguments. BERT-CRF adopts conditional random field (CRF) to capture label dependencies, which is adopted in (Du and Cardie, 2020a) for document-level event extraction. BERT-CRFjoint borrows idea from joint extraction of entity and relation (Zheng et al., 2017), which adopts joint labels of the type and role as B/I/O-type-role. All the above methods can not solve the overlapping problem due to label conflicts.
Pipelined event extraction methods. This kind of method solves event extraction with a pipeline manner. PLMEE  solves overlapped argument problem by extracting rolespecific arguments according to the trigger. Motivated by current Machine Reading Comprehension (MRC) based EE studies Du and Cardie, 2020b;, we train multiple MRC BERTs for overlapping event extraction. We extend MQAEE  for multi-span extraction and re-assemble the following methods 3 to consider conditions in EE: 1) The method first predicts types, and then predicts overlapped triggers/arguments according to the type, termed as MQAEE-1.
2) The method first predicts overlapped triggers with types, and then predicts overlapped arguments according to the typed triggers, termed as MQAEE-2.
3) The method sequentially predicts types, predicts overlapped triggers according to the type, and predicts overlapped arguments according to the type and trigger, termed as MQAEE-3. All the above pipeline methods could solve (or partly solve) overlapping event extraction.

Implementation Details
We adopt source code for PLMEE with its best hyper-parameters reported in the original literature. To achieve other baselines, we implement the code based on the Transformers library (Wolf et al., 2020). For all the methods, we adopt Chinese BERT-Base model 4 as the textual encoder, which has 12 layers, 768 hidden units and 12 attention heads. We use the same value for the common hyper-parameters among the methods, including the optimizer, learning rate, batch size and epoch. For all the hyper-parameters, we adopt grid search  To avoid overfitting, we apply dropout to BERT hidden states with the rate tuned in [0, 1]. Besides, the thresholds ξ 1 , ξ 2 , ξ 3 , ξ 4 , ξ 5 for prediction are tuned in [0, 1]. We select the best model leading to the highest performance on the validation data. The optimal hyper-parameter settings are tuned by grid search, listed in the Appendix A.

Main Results
The performance of all methods on the FewFC dataset is shown in 2. The table reveals that: (1) Compared to the joint sequence labeling methods, CasEE achieves better performance on the F1 score. Specifically, CasEE achieves improvements of 4.5% over BERT-CRF and 4.3% over BERT-CRF-joint on F1 score of AC, respectively. Besides, CasEE produces higher results on the recall of the evaluation metrics, since the sequence labeling methods have label conflicts that only one label can be predicted for those multi-label tokens. The results demonstrate the effectiveness of CasEE on overlapping event extraction.
(2) Compared to the pipeline methods, our method also outperforms them on the F1 score. The results show that CasEE achieves 3.1% and 2.6% improvements on F1 score of TC and AC over PLMEE, indicating the importance of solving the overlapped trigger problem in EE. Though the MRC based baselines can extract the overlapped triggers and arguments, CasEE still achieves better   performance. Specifically, CasEE improves by a relative margin of 4.1% against the strong baseline MQAEE-2. The reason may be that CasEE jointly learns textual representations for subtasks, building helpful interactions and connections among the subtasks. The results demonstrate the superiority of CasEE over the above pipeline baselines.

Analysis on Overlap/Normal Data
To further understand the performance in testing, we divide the original test data into two groups: the sentences with overlapped elements and the sentences without overlapped elements.    Table 7: Results of argument extraction decoder variants. The evaluation metric is precision, recall and F1 score on AC metric with oracle results of type detection and trigger extraction.

Discussion for Model Variants
To investigate the effectiveness of each module, we conduct variant experiments for CasEE.
Detection Module Variants. Table 5 shows performance of type detection variants. Specifically, MaxP/MeanP aggregates textual representations by applying max/mean pooling over BERT hidden states; CLS utilizes the hidden state of the special token <CLS> as the sentence representation. The results show that our method outperforms all the above variants on F1 score, indicating that learning sentence representation adaptive to the event type produces better representation for type detection.
Extraction Module Variants. Table 6 and Table 7 show performance of decoder variants for trigger extraction and argument extraction, respectively. We remove the self-attention layer in the both extraction decoders, and remove the relative position embeddings and the indicator function in the argument extraction decoder. The results demonstrate the effectiveness of each module. Furthermore, we conduct experiments to explore the impact of condition fusion function φ. The experiments include: 1) we simply remove condition integrate function; 2) we achieve φ by concatenating the condition and token representations; 3) we achieve φ by simply adding the condition embedding to token representations; 4) we achieve φ by the gate mechanism, which adds the condition embedding to token representations according to a learnable trade-off factor. The results show that the performance without condition fusion function decline significantly on the F1 score in the two decoders, since the model can not discern different targets to extract in the sentence. Besides, empirical results also show that CLN performs bet-ter performance than other fusion functions on F1 scores in the two decoders, indicating that CLN can generate better conditional token representations for downstream subtasks.

Conclusion
This paper proposes a joint learning framework with cascade decoding for overlapping event extraction, termed as CasEE. Previous studies usually assume that events appear in sentences without overlaps, which are not applicable to the complicated overlapping scenarios. CasEE sequentially performs type detection, trigger extraction and argument extraction, where the overlapped targets are separately extracted conditioned on former predictions. All subtasks are jointly learned to capture dependencies among subtasks. Experiments on the public dataset demonstrate that our model outperforms previous competitive methods on overlapping event extraction. Our future work may further tackle the potential error propagation problem in the cascade decoding paradigm, and improve the performance for the general event extraction.

A Hyper-parameter Settings
Our implementation is based on PyTorch 5 . We trained our models with a NVIDIA TESLA T4 GPU. For re-implementation, we report our hyperparameter settings on the dataset in Table 8. Note that the hyper-parameter settings are tuned on the validation data by grid search with 3 trials.

B Details of MRC Based Baselines
Here we describe the details of the extended MRC baselines. Since the MRC paradigm could place condition information in the questions, we extend it to solve the overlapping event extraction.
MQAEE-1 contains two models: 1)A BERT classifier to detect event types; 2) A MRC BERT to extract triggers and arguments. The question template is like <type> to predict triggers with type type, and <role> to predict arguments with role role. Though this method neglects associations between the trigger and argument, it tackles overlapped trigger problem and overlapped argument problem since the overlapped targets are extracted separately according to different questions.
MQAEE-2 contains two models: 1) A MRC BERT to extract all triggers with types. The question template is a single word trigger to predict all typed triggers. 2) A MRC BERT to extract arguments in different roles. The question template is like <type>and<trigger> to predict all arguments associated with the type type and the trigger trigger. This method tackles overlapped trigger problem with multiple taggers, and tackles overlapped argument problem by extracting argument separately according to both the type and trigger. 5 https://pytorch.org/ MQAEE-3 contains three models: 1)A BERT classifier to detect event types; 2) A MRC BERT to extract triggers with different types. The question template is like <type> to predict triggers with type type. 3) A MRC BERT to extract arguments in different roles. The question template is like <type>and<trigger> to predict all arguments associated with the type type and the trigger trigger. This method tackles overlapped trigger problem by extracting triggers according to the type, and tackles overlapped argument problem according to both the type and trigger.