Enhancing Document-level Event Argument Extraction with Contextual Clues and Role Relevance

Document-level event argument extraction poses new challenges of long input and cross-sentence inference compared to its sentence-level counterpart. However, most prior works focus on capturing the relations between candidate arguments and the event trigger in each event, ignoring two crucial points: a) non-argument contextual clue information; b) the relevance among argument roles. In this paper, we propose a SCPRG (Span-trigger-based Contextual Pooling and latent Role Guidance) model, which contains two novel and effective modules for the above problem. The Span-Trigger-based Contextual Pooling(STCP) adaptively selects and aggregates the information of non-argument clue words based on the context attention weights of specific argument-trigger pairs from pre-trained model. The Role-based Latent Information Guidance (RLIG) module constructs latent role representations, makes them interact through role-interactive encoding to capture semantic relevance, and merges them into candidate arguments. Both STCP and RLIG introduce no more than 1% new parameters compared with the base model and can be easily applied to other event extraction models, which are compact and transplantable. Experiments on two public datasets show that our SCPRG outperforms previous state-of-the-art methods, with 1.13 F1 and 2.64 F1 improvements on RAMS and WikiEvents respectively. Further analyses illustrate the interpretability of our model.


Introduction
Event argument extraction (EAE) aims to identify the arguments of events formed as entities in text and predict their roles in the related event.As the key step of event extraction (EE), EAE is an important NLP task with widespread applications, such as recommendation systems (Li et al., 2020) Figure 1: A document from RAMS (Ebner et al., 2020) dataset.Event Conflict and Attack is triggered by wounded, with four arguments of different roles scattering across the document.Words in red are non-argument clue words meaningful for argument extraction.and dialogue systems (Zhang et al., 2020a) for presenting unstructured text containing event information in structured form.Compared with previous works (Liu et al., 2018;Wadden et al., 2019;Tong et al., 2020) focusing on sentence-level EAE, more and more recent works tend to explore documentlevel EAE (Wang et al., 2022;Yang et al., 2021;Xu et al., 2022), which needs to solve long-distance dependency (Ebner et al., 2020) and cross-sentence inference (Li et al., 2021) problems.Therefore, many works (Zhang et al., 2020b;Pouran Ben Veyseh et al., 2022) try to construct graphs based on heuristic rules (Xu et al., 2021) or syntactic structures (Xu et al., 2022) and model logical reasoning with Graph Neural Networks (Kipf and Welling, 2016;Zeng et al., 2023).However, all of stateof-the-art works ignore two crucial points: (a) the non-argument clue information; (b) the relevance among argument roles.
Non-argument clues are contextual text except target arguments that can provide important guiding information for the prediction of many complex argument roles.For example, in Figure 1, for the event Conflict and Attack, non-argument clues detonated, claim responsibility and terrorist attack can provide significant clue information for identifying arguments explosive belts and Islamic State.However, many previous works (Li et al., 2021;Xu et al., 2022) only utilize pre-trained transformer- based encoder to obtain global context information implicitly, ignoring that for different arguments appearing in events, they should pay attention to context information highly relevant to the entity (Zhou et al., 2021) and target event (Ebner et al., 2020).Therefore in this paper, we design a Span-Triggerbased Contextual Pooling (STCP) module, which merges the information of non-argument clues for each argument-trigger pair based on the their contextual attention product from pre-trained model, enhancing the candidate argument representation with additional relevant context information.Some argument roles have close semantic relevance that is beneficial for argument extraction.For example, in Figure 1, there is close semantic relevance between roles injurer and victim, which can provide significant information guidance for the argument extraction of these two roles in the target event Conflict and Attack.Moreover, many roles co-occur in multiple events (Ebner et al., 2020;Li et al., 2021), which may have close semantic relevance.Specifically, we count and visualize the frequency of co-occurrence between 15 most frequent roles in RAMS dataset in Figure 2.For example, roles attacker, target and instrument co-occur frequently, demonstrating that they are more semantically relevant than other roles.In this paper, we propose a Role-based Latent Information Guidance (RLIG) module, consisting of role-interactive encoding and role information fusion.Specifically, we design a role-interactive encoder with roles added into the input sequence, where role embeddings can not only learn latent semantic informa-tion of roles, but capture semantic relevance among roles.The latent role embeddings are then merged into candidate arguments through pooling and concatenating operations, providing information guidance for document-level EAE.
In this paper, we propose an effective documentlevel EAE model named SCPRG (Span-triggerbased Contextual Pooling and Role-based latent information Guidance) containing STCP module and RLIG module for the the aforementioned two problems respectively.Notably, these two modules leverage the well-learned attention weights from the pre-trained language model with no more than 1% new parameters introduced and are easily applied to other event extraction models, which are compact and transplantable.Moreover, we try to eliminate noise information by excluding argumentimpossible spans.Our contributions are summarized as follows: • We propose a span-trigger-based contextual pooling module, which adaptively selects and aggregates the information of non-argument clues, enhancing the candidate argument representation with relevant context information.
• We propose a role-based latent information guidance module, which provides latent role information guidance containing semantic relevance among roles.
• Extensive experiments show that SCPRG outperforms previous start-of-the-art models, with 1.13 F1 and 2.64 F1 improvements on public RAMS and WikiEvents (Li et al., 2021) datasets.We further analyse the attention weights and latent role representations, which shows the interpretability of our model1 .

Method
We formulate document-level event argument extraction as a multi-class classification problem.Given a document D consisting of N words, i.e.D = {w 1 , w 2 , ..., w N }, pre-defined event types set E, the corresponding role set R e and trigger t ∈ D for each event e ∈ E, this task aims at predicting all (r, s) pairs for each event in document D, where r ∈ R e is an argument role for event e ∈ E and s ⊆ D is a contiguous text span in D. Following (Ebner et al., 2020;Xu et al., 2022), we

Role-based Latent Information Guidance
Attention Multiplication

Figure 3:
The main architecture of SCPRG.The input sequence with roles is fed into the role-interactive encoder, with context representations, role representations and attention heads as output.STCP adaptively fuses non-argument contextual clues into a context vector based on the attention product between the trigger and arguments.RLIG constructs latent role embeddings through role-interactive encoding and fuses them into a latent role vector by pooling operation.The context vector and latent role vector are merged into the final span representation and the classification module predicts argument roles for all candidate spans.extract event arguments for each event in a document independently and Figure 3 shows the overall architecture of our SCPRG.

Role-interactive Encoder
Role Type Representation In order to capture semantic relevance among roles, we add role type information into the input sequence and make interaction among context and roles by multi-head attention, which obtains context and role representations in a shared knowledge space.Specifically, we construct latent embeddings of roles with different special tokens2 in the pre-trained model, where each role type has a specific latent representation.On account that role names also contain valuable semantic information (Wang et al., 2022), we wrap role names with special role type tokens and take the embedding of the start special toke as the role embedding.Taking the role P lace as an example, we finally represent it as [R 0 ] P lace [R 0 ], where [R 0 ] is the special role type token of P lace.Role-interactive Encoding For the input document D = {w 1 , w 2 , ..., w N }, the target event e and the corresponding role set R e = {r 1 , r 2 , r 3 , ...}, we concatenate them into a sequence as follows: where [E e ] is the special event token of event e.
[R 1 ] and [R 2 ] are the special role type tokens of r 1 and r 2 .We use the last [SEP] to represent none category.Next, we leverage the pre-trained language model as an encoder to obtain the embedding of each token as follows: H s = Encoder(S). (1) Then we can obtain event representation H e ∈ R 1×d of the start [E e ], context representation H w ∈ R lw×d , and role representation H r ∈ R lr×d respectively from H s , where l w is the length of word pieces list and l r is the length of role list.For input sequences longer than 512, we leverage a dynamic window to encode the whole sequence and average the overlapping token embeddings of different windows to obtain the final representation.Significantly, through role-interactive encoding, the role embeddings can capture semantic relevance and adapt to the target event and context, which better guides the argument extraction.

Span-Trigger-based Contextual Pooling
Argument-impossible Spans Exclusion In order to eliminate noise information of useless spans, we reduce the number of candidate spans by excluding some argument-impossible spans, e.g.spans with comma in the middle.With such improvement, we reduce a quarter of candidate spans on average and make our model pay attention to candidate spans with useful information.Span-Trigger-based Contextual Pooling For a candidate span ranging from w i to w j , most previous span-based methods (Zhang et al., 2020b;Xu et al., 2022) represent it through the average pooling of the hidden state of tokens within this span: However, average pooling representation ignores the significant clue information of other nonargument words.Although self-attention mechanism of the pre-trained encoder can model tokenlevel interaction, such global interaction is specific to the event and candidate arguments.Therefore, we propose to select and fuse useful contextual information highly related to each tuple consisting of a candidate span and the event trigger word, i.e. (s i:j , t).We directly utilize the attention heads of pre-trained transformer-based encoder for spantrigger-based contextual pooling, which transfers the well-learned dependencies from the pre-trained language model without learning new attention layers from scratch (Zhou et al., 2021).
Specifically, we use the token-level attention heads A w ∈ R H×lw×lw of context from the last transformer layer in the pre-trained language model.Then we can obtain the context attention A C i:j ∈ R lw of each candidate span ranging from w i to w j with average pooling: Then for span-trigger pair (s i:j , t), we obtain the contextual clue information c s i:j ∈ R d that are important to candidate span by multiplying the attentions followed by normalization: where A C t ∈ R lw is the contextual attention of trigger t and p c i:j ∈ R lw is the computed attention weight vector for context.T is the transpose symbol.

Role-based Latent Information Guidance
RLIG module constructs latent role embeddings through role-interactive encoding in Sec.2.1 and performs role information fusion through pooling operation, which provides valuable latent role information guidance.Role Information Fusion In order to make each candidate argument get the useful information guidance of roles, we modify our span-trigger-based contextual pooling method to select role information adaptively.We get the latent role information r s i:j ∈ R d for s i:j through contextual pooling, by modifying the operation in Eq. 2 and Eq.3: where A r ∈ R H×lw×lr are attention heads of roles from the last transformer layer in the pre-trained language model.A R i:j ∈ R lr is the role attention for each candidate span and A R t ∈ R lr is the role attention of trigger t. p r i:j ∈ R lr is the computed attention weight vector for roles.
For a candidate span s i:j , we fuse the average pooling representation, contextual clue information c s i:j and latent role information r s i:j as follows: (5) where W 1 ∈ R 3d×d is learnable parameter.

Classification Module
Boundary Loss Since we extract arguments in span level, whose boundary may be ambiguous, we construct start and end representation with fully connected neural networks to enhance the representation of candidate spans: where H s is the hidden representation of input sequence S. On this basis, we enhance the start and end representation by integrating context and role information with span-trigger-based contextual pooling as follows: where h start i and h end j are the i th and j th vector of H start and H end .p i:j is the computed attention vector for both context and roles which is calculated similarly to Eq. 3 or Eq. 4 and W 2 , W 3 ∈ R 2d×d are learnable parameters.Then we obtain the final representation s i:j for a candidate span as follows: where W s ∈ R 3d×d is the learnable model parameter.
Finally, the boundary loss is defined to detect the start and end position following (Xu et al., 2022): where y s i and y e i denote golden labels and ) and P e i = sigmoid(W 5 h end i ) are the probabilities of the word w i predicted to be the first or last word of a golden argument span.Classification Loss For a candidate span s i:j in event e, we concatenate the span representation s i:j , trigger representation h t , their absolute difference |h t − s i:j |, element-wise multiplication h t ⊙ s i:j , event type embedding H e and span length embedding E len and get the prediction P (r i:j ) of the candidate span s i:j via a feed-forward network: Considering most candidate arguments are negative samples and the imbalanced role distribution, we adopt focal loss (Lin et al., 2017) to make the training process focus more on useful positive samples, where α and γ are hyperparameters.
(10) Finally, we have the train loss consisting of L c and L b with hyperparameter λ:  (Devlin et al., 2019) and RoBERTa large (Liu et al., 2019) as the pre-trained transformer-based encoder.
Hyperparameters Setting We set the dropout rate to 0.1, batch size to 8, and train our SCPRG using Adam (Kingma and Ba, 2014) as optimizer with 3e-5 learning rate.The hidden dimension d is 768 for SCPRG base and 1024 for SCPRG large .
In order to mitigate imbalanced role distribution problem, we set the weight ratio α of empty class and other classes to 10:1.We set hyperparameters γ to 2 and boundary loss weight λ to 0.1 for both two datasets.We train SCPRG for 50 epochs for RAMS dataset and 100 epochs for WikiEvents dataset.

Main Results
Table 2  Moreover, we further validate our SCPRG on WikiEvents and achieve new state-of-the-art performance in both tasks with base and large pretrained models, which can be viewed in Table 3.Our SCPRG outperforms previous competitive methods like TSAR and EA 2 E. Compared with TSAR large , our SCPRG improves up to +0.64/+0.58Head/Coref F1 for argument identification and +1.22/+1.29 Head/Coref F1 for argument classification on the test set.Besides, SCPRG also outperforms recent competitive generationbased method EA 2 E large in argument identification (+2.64/+0.33Head/Coref F1) and argument classification (+2.31/+0.38Head/Coref F1) tasks.These experimental improvements demonstrate the great advantage of our framework fused with argumentevent specific context information and the helpful guidance of latent role information.

Ablation Study
To better illustrate the capabilities of our components, we conduct ablation study on RAMS dataset  as shown in Table 4.We also provide ablation study results on WikiEvents datasets in Appendix A.
First, when we remove span-trigger-based contextual pooling (STCP) module, both Span F1 and Head F1 score of SCPRG base / SCPRG large drop by 1.61/1.43and 1.42/2.09on test set, which indicates that our STCP plays a vital role in capturing the clue information of non-argument context that is crucial for document-level EAE.
Additionally, when removing role-based latent information guidance (RLIG) module5 , the performance of SCPRG base / SCPRG large drops sharply by 1.03/1.04Span F1 and 1.58/1.2Head F1 on RAMS test set.It suggests that our RLIG module effectively guides argument extraction with meaningful latent role representations containing semantic relevance among roles.When removing both STCP and RLIG module, the performance decay exceeds that when removing a single module, which explains that our two modules can work together to improve the performance.
Moreover, when removing argument-impossible spans exclusion (ASE) operation, both SCPRG base and SCPRG large have a performance decay, which indicates that excluding argument-impossible candidate spans eliminates noise information and contributes to argument extraction.Focal Loss helps to balance the representation of positive and negative samples, facilitating smooth convergence of the model during training.However, it does not contribute to improving the performance of the model.

Analysis of Context Attention Weights
To assess the effectiveness of STCP in capturing useful contextual information for candidate arguments, we visualize the contextual weights p c i:j in Eq. 3 of an example of Figure 1.As shown in Figure 5, our STCP gives high weights to nonargument words such as attack, responsibility and terrorist attack, which are most relevant to the span-trigger pair (Islamic State, wounded).Interestingly, our STCP also gives relatively high attention weights to words in other arguments like explosive, Dozens and Kabul, which means that these argument words provide important information for the role prediction of Islamic State.The visualization demonstrates that our STCP can not only capture the non-argument clue information that is related to candidate spans, but model the information interaction among related arguments in an event.
Additionally, we also explore the attention weights based on different span-trigger pairs in an event.In Figure 4, we randomly select 30 candidate spans in an event and draw the heat map based on their attention weights to the context.The heat map shows that different candidate arguments focus on different context information, indicating that our STCP can adaptively select contextual information according to candidate argument spans.

Analysis of Role Information Guidance
To verify that our model can capture semantic relevance among roles, we visualize the cosine similarity between latent role representations from two events in RAMS dataset in Fig 6 .As the figure shows, roles origin and destination, attacker and target have similar representations, which agrees with their semantics, demonstrating that our model can capture the semantic relevance among roles.
Moreover, in order to verify the beneficial guidance of role representations, we display the t-SNE (van der Maaten and Hinton, 2008) visualization of arguments belonging to two different roles that co-occur in 5 different documents, along with corresponding latent role embeddings.As Figure 7a shows, arguments belonging to the same role in different documents are scattered over the whole embedding space due to their different target events and context.Notably, fused with latent role embeddings, in Figure 7b, the representation of arguments belonging to victim or place is more adjacent, which illustrates our RLIG provides beneficial latent role information guidance.

Analysis of Complexity and Compatibility
SCPRG is a simple but effective framework for document-level EAE, where both STCP and RLIG introduce few parameters.Specifically, STCP leverages the well-learned attention heads from the pretrained encoder and makes multiplication and normalization operation, which only introduces about 0.28% new parameters as shown in Table 4.Our RLIG only introduces about 0.3% new parameters in the role embedding layer6 and feature fusion layer.This makes the parameter quantity of our model approximate to the transformer-based encoder plus a MLP classifier.
Additionally, the two proposed techniques STCP and RLIG have good transportability, which can be easily applied to other event extraction models, leveraging the attention heads of pre-trained transformer encoder such as BERT.

Sentence-level Event Extraction
Previous approaches focus on extracting the event trigger and its arguments from a single sentence.(Chen et al., 2015) firstly propose a neural pipeline model for event extraction and (Nguyen et al., 2016;Nguyen and Grishman, 2015;Liu et al., 2017;Zhou et al., 2020) further extend the pipeline model to recurrent neural networks and convolutional neural networks.To model the dependency of words in a sentence, (Liu et al., 2018;Yan et al., 2019;Fernandez Astudillo et al., 2020) leverage dependency trees to model semantic and syntactic relations.(Wadden et al., 2019) enumerates all possible spans and construct span graphs with graph neural networks to propograte information.Some methods using transformer-based pre-trained model (Wadden et al., 2019;Wang et al., 2019;Tong et al., 2020;Lu et al., 2021;Liu et al., 2022) also achieve remarkable performance.

Document-level Event Extraction
In real-world scenarios, a large number of event elements are expressed across sentences and therefore recent works begin to explore document-level event extraction (DEE).DEE focuses on extracting event arguments from an entire document and faces the challenge of the long distance dependency (Wang et al., 2022;Xu, 2022).
For document-level EAE, the key step of DEE, most of previous works mainly fall into three categories: ( 1  an encoder-decoder framework that extracts structured events in a parallel manner.Besides, (Ren et al., 2022) integrate argument roles into document encoding to aware tokens of multiple role information for nested arguments problem.Other span based methods (Ebner et al., 2020;Zhang et al., 2020b) predict the argument roles for candidate text spans with a maximum length limitation.Moreover, (Xu et al., 2022) propose a twostream encoder with AMR-guided graph to solve long-distance dependency problem.On another aspect, (Li et al., 2021) formulate the problem as conditional generation and (Du et al., 2021) regards the problem as a sequence-to-sequence task.(Wei et al., 2021) reformulate the task as reading a comprehension task.

Conclusion
In this paper, we propose a novel SCPRG framework for document-level EAE that mainly consists of two compact, effective and transplantable modules.Specifically, our STCP adaptively aggregates the information of non-argument clue words and RLIG provides latent role information guidance containing semantic relevance among roles.Experimental results show that SCPRG outperforms existing state-of-the-art EAE models and further analyses demonstrate that our method is both effective and explainable.For future works, we hope to apply SCPRG to more information extraction tasks such as relation extraction and multilingual extraction, where contextual information plays a significant role.

Limitations
Although our experiments prove the superiority of our SCPRG model, it is only applicable to document-level EAE tasks with known event triggers because both STCP and RLIG calculate the attention product of the trigger and candidate spans.However, in real-life scenarios, event triggers are not always available.In view of this problem, we have a preliminary solution and plan to improve our model in the next work.The core idea of our method is to select and integrate context and role information based on candidate arguments and target events.Based on this idea, we briefly provide two solutions for the above limitation.First, we can make the model predict the best candidate trigger words.Second, we can replace trigger words with special event tokens.In the next work, we plan to extend our model to document-level EAE tasks without trigger words and evaluate it through extensive experiments.

Figure 2 :
Figure 2: Visualization of the co-occurrence frequency between 15 most frequent roles on RAMS test set.we have reserved and set the co-occurrence number with itself to zero.The full figure is included in Appendix B.

Figure 4 :
Figure 4: Visualization on attention weights to the context based on different candidate spans in an event.
Figure 5: Context weights of an example from RAMS.We visualize the weight of context tokens based on the span-trigger pair (Islamic State, wounded).We use different shades of color to represent attention weights.

Figure 6 :
Figure 6: The visualization of cosine similarity between role representations from two examples in RAMS dataset.
) tagging-based methods; (2) span-based methods; (3) generation-based methods.(Wang et al., 2021; Du and Cardie, 2020a) utilize the sequence labeling model BiLSTM-CRF (Zhang et al., 2015) for DEE.(Zheng et al., 2019) propose a transformer-based architecture and model DEE as a serial prediction paradigm, where arguments are predicted in a predefined role order.Base on their architecture, (Xu et al., 2021) construct a heterogeneous graph and a tracker module to capture the interdependency among events.However, tagging-based methods are inefficient due to the restriction to the extraction of individual arguments, and the former extraction will not consider the latter extraction results.(Yang et al., 2021) propose(a) Without latent role guidance.(b) With latent role guidance.

Figure 7 :
Figure 7: A t-SNE visualization example from RAMS, where embeddings of arguments and roles are from 5 different documents.We use average pooling representations encoded by BERT for arguments in (a) and representations fused with latent role embeddings in (b).

Table 1 :
Detailed statistics of two datasets.

Table 2 :
Main results of RAMS.

Table 3 :
Main results of WikiEvents.