Extracting Trigger-sharing Events via an Event Matrix

,


Introduction
Event Extraction (EE) is a structured prediction task that aims to recognize and extract the events in the text.An event of a particular event type typically contains an event trigger and several relevant arguments.The EE task has long been challenging in the information extraction field as it involves various sub-tasks, e.g., named entity recognition (Li et al., 2021b,a), relation extraction (Ahmad et al., 2021).The primary studies (Sha et al., 2018;Chen et al., 2015;Du and Cardie, 2020;Xu and Sun, 2022;Wang et al., 2019) of EE focus on extracting single event by simply detecting and labeling the spans.Then several improved methods were proposed for extracting multiple events (Li et al., 2020a;Chen et al., 2020;Du et al., 2021;Sheng et al., 2021;Nguyen et al., 2021;Lin et al., 2020;Wang et al., 2019).These models typically formulate the extraction process as a sequence of conditional predictions, i.e., first recognize the event trigger, categorize the event type, and then extract the corresponding arguments in the events conditioned on the event trigger and event type.However, they still cannot extract multiple events jointly and mitigate the problem of multiple events extraction by multiple fine-grained conditional predictions.They make an implicit assumption that with a finegrained condition (trigger and event type) there can be only one event, but such an assumption is generally invalid.Even with the specified event trigger and event type, there can be multiple events.In fact, there are around 16% samples in the FewFC (Zhou et al., 2021) dataset violating this assumption.An example of such multiple events is illustrated in Figure 1.There is a common trigger "increased" for four events where two events have the same event type.Therefore, the capability of joint extraction of such multiple events with the same trigger is indispensable. !"#$%&'()*+,-.1000/012345300/06 Hillhouse increased its holdings of Sofia by 10 million shares and Smart Energy by 3 million shares in Q2.

Event-1
Event  x Figure 1: An example of multiple events (best viewed in color).In this case, even when the trigger and event type are given, i.e., "increased" and "equity investment", there still are two different events.
To this end, we propose a unified framework, MatEE, to predict trigger-sharing events in a single stage by using a novel predication formalism.Instead of predicting the argument spans independently, we turn to model the co-event relationship between the arguments as the token-token classification so that trigger-sharing events can be represented by a single event matrix.An example of an event matrix is illustrated in Figure 2. The grid in blue is used to indicate the argument boundary.The grid in other colors except white expresses two kinds of information: (1) two arguments coexist in the same event, and (2) the argument role of the entity is denoted by the color in the grid.By this definition, trigger-sharing events can be represented by several cliques in the event matrix.Therefore, we can recover the events by extracting the maximal cliques and identify the arguments by retrieving the value of grids.
: An example of event matrix of two events (best viewed in color), i.e., {sub: "Tian Yin", obj: "Wei Ke", stk: "20%"} and {sub: "Tian Yin", obj: "Tian Long", stk: "30%"}.The sentence is the translation of a Chinese example "天音收购唯科20%,天珑30%".The blue color indicates the argument spans, i.e., an grid (i, j) in blue denotes there is an argument with span (j, i).The grid (i, j) in other colors except white contains two meanings: (1) the two arguments co-exist in the same event where the former ends with i-th token and the latter starts with j-th token, (2) the role of the former argument is denoted by the color.The dashed lines indicate the cliques of the two events.
With the novel formalism, we present a neural framework for extracting those trigger-sharing events jointly (cf. the overview in Figure 3).First, the BERT (Devlin et al., 2019) is used to provide contextualized word representations, based on which we construct a trigger extractor for trigger extraction.Then an argument extractor is used to derive the event matrix by capturing the interactions between argument spans and the co-event rela-tionships.Both the trigger extractor and argument extractor are trained to maximize the likelihood of labeled data, along with a particular contrastive learning objective for the trigger extractor.
Overall, our main contributions are two-fold: • We present MatEE as a unified framework that represents the events by an event matrix for multiple trigger-sharing events extraction.
• Experimental results on three widely-used datasets reveal that our model achieves significant improvements over competitive methods.

Trigger-sharing Events Representation
Previous methods for multiple events extraction typically predict the arguments conditioned on the trigger and event type which are predicted ahead of arguments.For instance, MQAEE (Li et al., 2020a) predicts the arguments by constructing the questions with the trigger and event type.CasEE (Sheng et al., 2021) similarly predicts the spans of arguments with the conditional inputs.Although the predicted arguments may belong to multiple events, these methods are unable to distinguish various events from these arguments.Therefore, the capability of extracting multiple trigger-sharing events jointly is critical.Event Matrix.To this end, we propose a novel prediction formalism to represent multiple triggersharing events in a single event matrix, motivated by recent advances in the pair representation (Li et al., 2021b;Wang et al., 2020).The event matrix effectively represents the co-event relationship between the arguments.Specifically, for a text with L tokens x = {w 1 , w 2 , ..., w L }, the event matrix M = {m i,j } is of size R L×L .With N a types of arguments roles, the grid m i,j can take the value in {0, 1, ..., N a + 1} where 0 denotes the blank tag, 1 denotes the span tag and {2, ..., N a +1} correspond to the N a argument roles such as "sub", "obj" and "stk".The span tag is used to identify the argument spans in the events.m i,j is 1 if and only if there is an argument that starts with j-th token and ends with i-th token.m i,j in {2, ..., N a + 1} is used to indicate the co-event relationship of arguments.First, it means that there are two arguments (a 1 , a 2 ) co-exist in the same event, a 1 ends with j-th token and a 2 starts with i-th token.We assume that every argument has a unique boundary so that the spans of a 1 and a 2 can be directly inferred by (i, j).Second, its value indicates the argument role of a 1 so that we can align the spans with the argument roles.By this definition, the co-existence of argument pairs can determine all the arguments of a single event.It is because if the arguments are in the same event, every two arguments are connected and the arguments form a clique in the graph.Therefore, we can use a maximal clique algorithm to recover the events from the event matrix.
Figure 2 shows an event matrix where the value of m i,j is denoted by various colors.Let purple denote the argument role of "sub".The color of the grid ("Yin', "Wei") is purple since the argument ending with "Yin" and the argument starting with "Wei" are in the same event, and the former argument is a "sub".To derive the corresponding arguments "Tian Yin" and "Wei Ke" by "Yin" and "Wei", we can refer to the blue grids ("Yin", "Tian") and ("Ke", "Wei") which denote the spans of the arguments.Taking the whole event matrix into consideration, we can know the "Tian Yin", "Wei Ke" and "20%" are in the same event and their argument roles are "sub", "obj" and "stk" by the color of grids.Consequently, the two events {sub: "Tian Yin", obj: "Wei Ke", stk: "20%"} and {sub: "Tian Yin", obj: "Tian Long", stk: "30%"} can be represented by this single event matrix.
Representing Single Argument.There is a special case for such formalism.When the event has only one argument, we can not construct an argument pair to indicate the role of this argument.To mitigate this problem, we leverage the CLS token as a proxy to pair the individual argument.Therefore, for convenience, we let the first token w 1 in the text be the CLS token.

The MatEE Framework
The architecture of our framework is illustrated in Figure 3, which mainly consists of three components.First, the widely-used pre-trained language model BERT is used as the encoder to yield word representations.Then a trigger extractor is used to extract the triggers based on sequence labeling.Afterward, an argument extractor that contains a multi-head biaffine network is used to predict the event matrix by the maximal clique decoding.

Encoder Layer
We leverage BERT (Devlin et al., 2019) to process the text for our model due to its effectiveness in the event extraction (Sheng et al., 2021;Yang et al., 2021).The text with L tokens is en-coded by BERT to derive the vector representations H = {h 1 , h 2 , ..., h L } ∈ R L×D , where D is the dimension of the embedding.

Trigger Extraction
We formalize the trigger extraction as a sequence labeling task.Since previous studies (Li et al., 2020b(Li et al., , 2021a) ) have demonstrated that the spanbased sequence labeling can solve the problem of overlapping in the entity recognition, we apply it to extract the trigger.
Triggers Extractor.First, we extract the trigger spans for each event type from the text under the span-based framework.This module is mainly divided into two parts, i.e., start index prediction, and end index prediction.In the start index prediction, with the text representation H from the BERT encoder the module predicts the probability of each token being a start index of k-th event type: where T k s ∈ R L is the start index probability distribution of the k-th event type and MLP denotes a linear transformation layer.With the start index, in the end index prediction, this module concatenates the text representation H with the start index prediction T s and calculates the probability of each token being an end index by: where is the end index probability distribution of the k-th event type.When selecting the trigger spans during the prediction, for each event type and a trigger start index, we pick all end positions within the maximum length of triggers to obtain the candidate triggers for following argument extraction.In this way, our model can extract multiple triggers.The objective of trigger extraction is to maximize the likelihood: (3) where N e denotes the number of event types and T k * s , T k * e ∈ R L are the target start and end labels of k-th event type.CE is the cross-entropy loss.
Discounted Contrastive Learning.In practice, we found that there is a lot of confusion about event types, especially "收购", "股权股份转让" and "投  资" in FewFC.Therefore, we design a discounted contrastive learning for the confusing event type.
where E e ∈ R Ne×Ne is the prototype embedding of event types and E j e represents the prototype embedding of event type j.T ∈ R L×Ne is the start or end index prediction.τ is a temperature hyperparameter, T + (T − ) is the set of positive (negative) event types.The score function f is expressed as follows: where cosine_similarity is the cosine similarity measures.rank j ∈ [1, N e ] represents the similarity ranking of the j-th event type with T.

Argument Extraction
The argument extractor is responsible for extracting multiple events jointly conditioned on the event trigger and event type.As mentioned above, previous methods simply extract the arguments w.r.t. the trigger and event type, and cannot distinguish the events for the extracted arguments.In contrast, we leverage the event matrix to represent the multiple trigger-sharing events in a single stage.The right side of Figure 3 illustrates the framework of the argument extractor.It receives the trigger and event type from the trigger extractor, performs a series of computations to aggregate the information between the tokens, and then predicts an event matrix.Conditional Representation.To integrate the trigger and event type, we adopt three components, i.e., the conditional layer normalization, event type embedding, and trigger relative position embedding.Conditional Layer Normalization (CLN) (Yu et al., 2021) is a mechanism to effectively fuse the identified trigger and the textual representation h ∈ R D .The textual representation Ĉ = {ĉ 1 , ..., ĉL } ∈ R L×D used to extract the arguments is calculated as follows: where t ∈ R D is the vector representing the trigger by performing the average pooling over the trigger tokens.Refer to Yu et al. (2021) for more details of CLN.
To encode the event type, we vectorize the event type as e ∈ R D by retrieving an embedding matrix.Besides, since the relative position is shown to trigger (Equity acquisition) be significant in relevant tasks (Yang et al., 2016), we also use a relative position vector as shown in Figure 4 to indicate the position of the trigger.Denoting P t ∈ R L×D as the trigger relative position embedding of trigger t, the fused representation is obtained as follows: where E ∈ R L×D repeats event type embedding e L times so that we can directly concatenate these representations.The output of MLP is C ∈ R L×D .
Multi-Heads Biaffine Network.The representation C integrates the text with the information from the event trigger and the event type.Then we can use it to predict the event matrix.To fully aggregate the information of arguments and their relationships, inspired by Li et al. (2021b), we propose two feature extractors, i.e., a span extractor and a relation extractor, to extract argument span and the relationship between arguments respectively.Specifically, the span extractor obtains the representations of spans Ŝ ∈ R L×D as follows: where A e ∈ R L×2 is used to identify the end index of arguments.The two MLPs denote two different dense layers but we omit the subscripts for clarity.
To capture the relationships between the arguments, the relation extractor derives the vector representations for the tokens as: where W r ∈ R D×D is a learnable matrix for the bilinear transformation, A r ∈ R L×L is used to represent the relationship between arguments, ) is the input of the argument relation extractor and R ∈ R L×D is the output.
To jointly extract the argument span and relation, we apply a biaffine network Yu et al. (2020) to fuse the outputs of the span extractor and the relation extractor.However, the biaffine model is insensitive to the entity length and the boundaries, often rendering the predictions of entity boundaries incorrect.To mitigate this problem, we add the relative distance between the tokens (cf.Appendix A) as the input and vectorize them by an embedding matrix.With the relative position embedding of the arguments P a ∈ R L×L , we concatenate it with the argument span representation Ŝ and the argument relation representation R to gather the position information.To maintain the dimension, we use an MLP to map the concatenated vectors to the D-dimensional vectors.Then, the biaffine network is used to fuse the representations of two extractors: where U ∈ R D×F ×D and W b ∈ R F ×2D are the trainable parameters (F is the dimension of v i,j ), b ∈ R F is the bias, s i and r j denote the i-th and j-th element of S and R respectively.Motivated by Transformer (Vaswani et al., 2017), we use multiple biaffine heads to capture the interactions between argument spans and argument relations from different perspectives.Finally, the output of the multi-head biaffine layer is: where N h is the number of heads, V k = {v k i,j } denotes the output of k-th biaffine head.M ∈ R L×L×(Na+2) is the predicted probability of the event matrix after aggregating multiple biaffine heads, where N a + 2 is the number of categories in the event matrix (N a argument roles and two extra categories, i.e., span and blank).The overall objective of argument extraction of a data sample is as follows: where T k is the set of triggers of k-th event type, M * ∈ R L×L is the target event matrix.Maximal Clique Decoding.As shown in Algorithm 1, to decode multi-events from the prediction M, we construct an argument role matrix O to identify the argument role in each event, and an undirected relation graph G to represent the co-event relationship between arguments.The algorithm of graph construction is shown in Algorithm 1, which aims to find all argument spans and relations to reconstruct the events.
An example of constructed graphs is shown in Figure 5.With the graph G, we can decode all maximal cliques based on the Bron-Kerbosch algorithm (cf.Appendix B) to find the corresponding Algorithm 1: Graph Construction Input :The predicted event matrix M, the maximal number of arguments Q, the size of M L. Output :The undirected graph G and argument role matrix O.

Model Training
The overall training objective to be minimized is: where λ 1 , λ 2 , λ 3 are hyper-parameters to weight the components in the overall training objective, D denotes the training samples.The three losses are jointly trained in an end-to-end fashion.
4 Experimental Settings Dataset.Our experiments are conducted on 3 widely-used datasets (cf.Evaluation Metric.The evaluation metric is similar to previous studies (Sheng et al., 2021;Yang et al., 2021).TI, TC, AI, and AC denote trigger identification, trigger classification, argument identification, and argument classification, respectively.For more details, please refer to Appendix C. Implementation Details.The number of heads of the biaffine layer is 3 with the dimension of head F = 20.We set the maximum length of triggers to 12 and the maximum length of arguments to 25.The worst-case time complexity of maximal clique decoding is exponential.Therefore, we set the maximum number of identified arguments Q in Algorithm 1 to 30.The weights of loss is λ 1 = 1.0, λ 2 = 5.0, λ 3 = 0.5 and the temperature τ in the trigger contrastive learning is 0.1.For other hyper-parameters and details, please refer to Appendix D.

Overall Results
We evaluate our framework on three widely-used event extraction datasets.forms the best baseline by 2.06% on FewFC, 1.45% on DuEE, and 2.10% on iFLYTEK.There are two key points to the significant improvement of our model.Firstly, the previous work CasEE produces event types with multi-label classification, and then performs trigger identification according to event types, while the two-stage model suffers from cumulative error.Secondly, we apply the contrastive learning to learn an effective representation of sentences and triggers, thus reducing false positives.

Results For Trigger-sharing Events
We divide test data into three parts: 1) Triggersharing multiple events share a trigger; 2) Element-sharing multiple events share an argument or a trigger; 3) Normal elements are not shared between events.Take the FewFC dataset as an example, as shown in Table 3, BERT performs significantly worse in element-sharing events, the key reason is that they cannot extract elementsharing events.While PLMEE, MQAEE and CasEE can extract element-sharing events partially, but they cannot deal with multiple events sharing a trigger in the same event type.As can be seen from the table, the AC F1 score of our model in trigger-sharing events has improved by 11.84%.In element-sharing events, the AC F1 score of our model is greater than the AI F1 score.The major contribution is that our event matrix assigns the same argument to multiple events.

Model Ablation Studies
We ablate each part of our model on the FewFC, as shown in Table 4. First, without discounted contrastive learning (DCL), we observe performance drops of 2.29% on TC and 3.08% on AC, which verifies the usefulness of trigger contrastive learning.By removing CLN, event type embedding, and trigger relative position embedding, the performance drops slightly.Furthermore, after removing argument relative position encoding, the performance decrease is most significant.The main reason is that the argument span extractor and argument relation extractor are not position-sensitive, resulting in boundary errors.Instead of using biaffine, stack two extractors directly, the performance also drops obviously.This shows that the biaffine module increases the interaction of the two extractors, thereby increasing the effect of AC.At last, when head numbers are replaced by 1 or 2, we observe slight performance drops on this dataset.

Case Study
In addition to the quantitative results, we visualize several event matrices for better comprehension of the behavior of the model.Figure 6 shows an example of predicted event matrix (left) with the target event matrix (right).We use the black color to denote the blank tag and the white color to denote the span tag to make the contrast clear.We can see that the predicted matrix can correctly represent two events by two cliques and the argument roles are also correct.This example qualitatively verifies the ability of the model to handle triggersharing events.We also show an incorrect case where the model fails to aggregate the arguments in a single event in Figure 7. Several co-event relationships between the arguments are missed so that the resulting events are incomplete.Compared to previous models, our model not only predicts the arguments and their corresponding roles but also predicts whether they exist in the same event.The additional prediction task may make the learning objective more difficult but the experimental re- sults show that the including of such information is quite helpful for representing the events, and thus benefits the extraction process.We attach more visualization examples in Appendix H.

Related Work
Single Event Extraction.The primary studies typically considered the task as a sequence labeling problem, to assign each token a tag from a predefined scheme (e.g., BIO).Mainstream studies combine the CRF (Chen et al., 2015) with neural architecture, such as CNN (Chen et al., 2015), bi-directional LSTM (Sha et al., 2018), and Transformer (Du and Cardie, 2020).However, these methods fail to address the problem of multiple events.Multiple Events Extraction.There have been several studies (Li et al., 2020a;Chen et al., 2020;Ahmad et al., 2021;Liu et al., 2018;Hsu et al., 2022;Ma et al., 2022;Hwang et al., 2022;Feng et al., 2022) that cast multiple events extraction as a sequence of conditional predictions, i.e., recognizing trigger with event type first, and then extracting the corresponding arguments.Conditions are used in a variety of ways, such as Graph Aggregation (Liu et al., 2018), Graph Representation (Xu et al., 2018), Seq2Seq (Du et al., 2021), MRC (Zhou et al., 2021), andCascade Decoding (Sheng et al., 2021).However, the above methods make an implicit assumption that with a finegrained condition (trigger and event type) there can be only one event.Inspired by extracting overlapped NER (Li et al., 2021b) and SPO (Wang et al., 2020), we propose an event matrix to deal with trigger-sharing events.

Conclusion
We have presented a unified framework MatEE based on the event matrix which is a novel formalism to represent multiple events jointly.Our framework is useful for various events extraction, especially trigger-sharing events.The empirical comparison and the results of analytical experiments verify its effectiveness.Beyond EE, our work may shed light on other complicated structured prediction tasks where the components are hard to predict sequentially.In the future, our work will focus on generalizing the MatEE to the case with multi-role arguments and the incorporation of the inherent prior constraints.

Limitations
Nonetheless, these results must be interpreted with caution and several limitations should be borne in mind.First of all, limited by the definition of the event matrix, an argument in an event can play at most one role as the grid can take one value.Second, the event matrix is unable to represent two events when an event is a subset of another event, although we did not find such cases in the datasets.Third, the worst-case time complexity of the maximal clique decoding algorithm is O(3 n 3 ) for an n-vertex graph.Therefore, it is not suitable for document-level event extraction.Finally, we also notice that the predicted matrix may violate the definition of the event matrix.For instance in Figure 6, there is a dark yellow grid on the top-right corner but there is no bright grid on the bottom-left corner.Actually, the two grids form a pair and they should both be either blank or some colors of the argument roles.Taking such constraints into consideration can improve the confidence of the prediction and we leave it to further work.P a ∈ R L×L is expressed as follows:

A Argument Relative Position Encoding
where ω v ij = 1 10000 v ij /L , and v ij is the value of the i-th row and the j-th column in Figure 8.For the first case, we apply the Bron-Kerbosch algorithm to split arguments into different maximal cliques, and then sort and decode to generate multiple events.For the second case, in the trigger extraction module, we extract triggers of different event types, and then extract the arguments of these triggers respectively.

G Argument-sharing Events
As can be seen from Figure 10, there are two types of argument-sharing events.First, multiple events share an argument with the same role.Then, an event has only one argument.For the left case, we apply the Bron-Kerbosch algorithm to generate two maximal cliques directly.For the right case, some events have only one argument.In this case, we use CLS as a proxy firstly, and then the rows and columns of CLS are filled with the only argument role.In practice, <CLS,A> is filled with "sub" and <A,CLS> is filled with "sub".

H More Visualizations
We also show several predicated event matrices here.Figure 11 illustrates an example of an event with only one argument pairing with the CLS token, which is a special case mentioned in Secion 2.  close to the target event matrix.Figure 13 is an example of an incorrect event where an argument is missed in the prediction. arguments

Figure 4 :
Figure 4: The trigger relative position is based on the trigger position and relative distance.

Figure 5 :
Figure 5: Left: the argument matrix O is used to identify the argument role in each event.Right: the undirected relation graph G is used to represent the co-event relationship between arguments.
The relative positions of the argument span extractor and argument relation extractor are shown in Figure8.The argument relative position encoding

Figure 8 :
Figure 8: The argument relative position is based on all words position and relative distance.

Figure 9 :
Figure 9: Left: multiple events share a trigger with the same event type.Right: multiple events in other situations.

Figure 10 :
Figure 10: Left: multiple events share an argument with the same role.Right: a special case uses for an event with only one argument.

Figure 12 :
Figure 11: A correct prediction with only one argument.

FigureFigure 13 :
Figure12shows a case where the prediction is very

Table 3 :
Results of trigger-sharing, element-sharing and normal in FewFC test set.F1 scores are reported for each evaluation metric.