Contextualized Soft Prompts for Extraction of Event Arguments

Event argument extraction (EAE) is a sub-task of event extraction where the goal is to identify roles of entity mentions for events in text. The current state-of-the-art approaches for this problem explore prompt-based meth-ods to prompt pre-trained language models for arguments over input context. However, existing prompt-based methods mainly rely on discrete and manually-designed prompts that cannot exploit specific context for each example to improve customization for optimal performance. In addition, the discrete nature of current prompts prevents the incorporation of relevant context from multiple external documents to enrich prompts for EAE. To this end, we propose a novel prompt-based method for EAE that introduces soft prompts to facilitate the encoding of individual example context and multiple relevant documents to boost EAE. We extensively evaluate the proposed method on benchmark datasets for EAE to demonstrate its benefits with state-of-the-art performance.


Introduction
As an important task in Information Extraction (IE), Event Argument Extraction (EAE) aims to recognize event arguments and roles for given event mentions in text.For example, in the text "On the morning of 1 March 2019, Taliban gunmen and suicide bombers attacked Camp Shorabak."with the event trigger "attacked" of type Conflict.Attack, the goal of EAE systems is to identify "gunmen" and "bombers as the Attacker argument, and "Camp Shorabak" as the Target.Along with event detection, EAE has important applications for different natural language processing (NLP) tasks.
EAE research progress has been accelerated by deep learning architectures to significantly boost extraction performance.Early deep learning models for EAE have followed the traditional approaches to formulate EAE as classification or sequence labeling problems (Chen et al., 2015;Nguyen et al., 2016;Sha et al., 2018;Liu et al., 2018;Nguyen and Nguyen, 2018;Pouran Ben Veyseh et al., 2022a).Recently, there has been a growing interest in solving EAE in the new question answering or text generation frameworks to better exploit task-specific information (e.g., labels/descriptions of argument roles) via prompts for pre-trained language models (PLM).As such, question answering methods for EAE create a question for each argument role to perform span extraction over input context (Du and Cardie, 2020;Liu et al., 2020;Li et al., 2020;Liu et al., 2021a) while text generation models directly consume an input text and argument-specified prompt/template to generate arguments for each event mention (Li et al., 2021;Zeng et al., 2022).However, a common issue in current prompt-based methods for EAE involves the use of discrete and manually-designed prompts to present task information for the models, e.g., event types and argument roles.As such, these prompts often follow some pre-defined templates (Li et al., 2021;Ma et al., 2022) that are applied to extract arguments for all events in text.While convenient for human understanding, discrete and pre-defined prompts might not be ideal for all examples, causing sub-optimal performance (Liu et al., 2021b).The discrete nature also makes it challenging to achieve prompt customization for each example in EAE models.Further, due to the employment of PLMs, it has been observed that model performance can be very sensitive to specific formulations of discrete prompts (Zhao et al., 2021;Ma et al., 2022), leading to instability and less reliability when adapting to different datasets.
Another issue of hard prompts for EAE models concerns other relevant examples from training data that can provide helpful information to support argument prediction for current input text and event type.As such, a few recent work has retrieved related examples for an input text to combine with hard prompts to improve EAE (Du et al., 2022;Du and Ji, 2022).However, due to the input length limit of PLMs, the number of relevant examples in the prompts is also constrained, thus unable to fully leverage their advantages to boost performance.
To this end, our work proposes a novel promptbased method for EAE where learnable soft prompts are explicitly introduced to enable prompt customization for examples, stability improvement, and incorporation of relevant example context.In particular, based on the architecture of generative PLMs, our model directly utilizes input example representations to compute soft prompts for EAE, thus allowing the prompts to be specifically designed for each example for better customization.In addition, soft prompts facilitate the accumulation of representations of relevant examples for an input event type to consume more examples for richer prompts.To exploit this flexibility of soft prompts, our model extensively considers relevant examples as the texts in training data that contain similar event types to an input text, leading to comprehensive external event context to aid EAE.Accordingly, we introduce a graph structure to capture mentioning relations between documents and event types.This graph is then fed into graph neural networks to facilitate representation aggregation of relevant documents with similar events for soft prompt computation.We evaluate the proposed model for EAE on the benchmark datasets RAMS and WIKIEVENTS.The results demonstrate the benefits of the proposed method, leading to stateof-the-art performance for EAE.

Model
In EAE, given an event trigger/mention in a document, we need to identify argument spans and roles for the event.For convenience, let D = {D 1 , . . ., D |D| } be the set of documents in the training data and e k ∈ D i be the current event trigger in document D i with event type et for EAE.
Relevant Context Aggregation: Our EAE model follows the prompt-based framework (Ma et al., 2022) where a prompt is created for each event type and fed into the pre-trained language model BART (Lewis et al., 2020) to perform span extraction for the argument roles.As such, in contrast to hard and manually-designed prompts as in previous work, our model introduces soft prompts with example customization and relevant example aggregation to boost the performance.In particular, given the current event trigger e k ∈ D i , our model first aims to aggregate context representations from relevant documents in D for the event type et of e k to enrich soft prompt computation.Motivated by relevant documents via similar event types, we we first construct an event-type mentioning graph G between the documents in D and the event types T = {t 1 , . . ., t |T | } to facilitate representation aggregation.In particular, the node set V for our graph involves both documents and event types, i.e., V = D ∪ T .We only connect a document D u ∈ D and an event type t v in G if there exists an event mention of type t v in D u .In this way, we can link the documents in D via similar event type mentioning for convenient representation aggregation with graph neural networks.
To obtain representations for each document D u ∈ D, we introduce the markers <ET > and </ET > before and after each event trigger word in D u to generate the marker-augmented document Du .The augmented document is then sent into the encoder of BART to produce a representation for each word in Du (using the averages of hidden vectors in the last layer for sub-words).Afterward, the representation D 0 u for D u ∈ D is computed by performing mean pooling over the representations for the <ET > markers of event triggers, aiming to retain event-focused context in the representation.For the event types t v ∈ T , we initialize their representations t 0 v randomly.Afterward, the graph G and representations D 0 u and t 0 v for documents and event types are consumed by a graph attention network (Veličković et al., 2017) to aggregate the representations via the connections in G, producing richer representations D L u and t L v for D u and t v after L layers of transformation.Consequently, we treat the induced representation et for the current event type et as the aggregation for context information of relevant documents for prompt computation for e k ∈ D i in the next steps.
Soft Prompt Computation: For convenience, let e k be the representation for the event trigger e k after the marker-augmented document Di for D i is encoded by the BART encoder in the previous step.As such, our soft prompt for EAE for trigger e k ∈ D i will be a matrix P sof t of size M s × d where M s is a hyper-parameter and d is the dimension of the hidden vectors in our core model BART, thus allowing P sof t to be integrated into the computation of BART.To achieve a customized soft prompt P sof t for e k in our model, the contextual representation e k for e k will be utilized to compute P sof t .In addition, as discussed above, P sof t will also be conditioned on the aggregation of relevant context representations et for the the event type et of e k to enrich the prompt and facilitate argument extraction.To this end, we utilize a learnable feed-forward network F F to transform the concatenation e k and t L k into a vector of size M s × d.This vector will then be reshaped to form our soft prompt Prompt-based EAE: While soft prompts enable example customization and relevant context incorporation for the models, we further inherit the hard prompts to explicitly specify expected argument roles for span extraction.In particular, to achieve a fair comparison, we utilize the same hard prompt for each event type as in previous work (Li et al., 2021;Ma et al., 2022) that connects all argument roles with natural language.For example, for the event type Life.Consume.Unspecified, the hard prompt to indicate argument roles is: <Con-sumingEntity> consumed <ConsumedThing> at <Place> place.
For each event type t, we send its hard prompt into the embedding layer of BART to obtain a representation for each word (i.e., using averages of embeddings for sub-tokens), leading to the hard prompt representation P t hard of size M t h × d (M t h is the length of the hard prompt for t).We then concatenate the soft and hard prompt representations to create a single prompt P r for EAE with BART, i.e., P r t = [P t hard , P sof t ] of size (M s +M t h )×d.Next, we follow the prompt-for-extraction framework in (Ma et al., 2022) to use the BART encoder to encode input context for D i while the prompt P t will be employed to prompt the BART decoder for span extraction.Given the current trigger e k ∈ D i , we first inject the trigger markers <ET i > and </ET i > before and after e k in D i to create an input context D ′ i , which is then encoded by the BART encoder to return a sequence of representations D enc i for the words in D ′ i .In the next step, the BART decoder is employed to learn richer representations for the context and prompt using their interactions via cross-attention in multiple layers, returning the representations D dec i = BART-Decoder(D enc i ; D enc i ) and P r t = BART-Decoder(P r; D enc i ) for the context and prompt.
Afterward, for the j-th argument role for event type t, we obtain a role representation ϕ t j by mean-pooling its corresponding sub-token representations in the prompt representation P r t .Similar to (Ma et al., 2022), we employ two selection heads s start and s end (of d dimensions) to compute start and end span selectors ϕ t,start j = ϕ t j ⊙ s start and ϕ t,end j = ϕ t j ⊙ s end (⊙ is the element-wise multiplication).Each span selector tuple θ t j = (ϕ t,start j , ϕ t,end j ) then aims to select at most one argument span for the j-th role of t.Here, the golden span for this role is denoted by (a t,start j , a t,end j ).It will be set to (0, 0) if event e k does not have an argument of this role in D i .As such, the extractive prompt framework is utilized to estimate distributions over token positions in D for how likely each token in D i would serve as the start/end position for the argument span of the role: Finally, to train our model, we optimize the loss: L = − e k ∈D j (log(P t,start j (a t,start j )) + log(P t,end j (a t,end j ))) (i.e., over all events in D).Inference: At inference time, given an input text, event type, and argument role, we consider all possible argument spans for the role, ensuring that the start indexes are smaller than the end indexes (including (0, 0)) and their lengths do not exceed a maximum value computed over training data.A score for each span is obtained using the probability log(P t,start j (a t,start j )) + log(P t,end j (a t,end j ).
The span with the highest score will be chosen for prediction.Finally, the aggregations et of relevant context representations for event types, which are learned during training, are used in test time.

Experiments
Datasets and Hyper-parameters: Following previous work (Li et al., 2021;Ma et al., 2022), we employ two latest datasets for EAE to evaluate our model, i.e., RAMS (Ebner et al., 2020) and WIKIEVENTS (Li et al., 2021).Both datasets involve multiple events in a document where arguments can distribute over different sentences from the event triggers.We utilize the same train/dev/test splits, data pre-processing, and evaluation metrics for the datasets as in previous work (Ma et al., 2022) for fair comparison.In particular, our metrics include: Argument Identification F1 score (Arg-I) and Argument Classification F1 score (Arg-C) scores.For WIKIEVENTS, we also use Argument Head F1 score (Head-C) to only consider headword matching for arguments.Finally, we fine-tune the hyper-parameters for our model on the development data.
Comparison: We compare our method (called SPEAE for soft prompts for EAE) with the state-ofthe-art models for EAE.In particular, we consider two groups of baselines: (i) text generation-based models: BART-Gen (Li et al., 2021), and (ii) question answering-based models: FEAE (Wei et al., 2021), DocMRC (Liu et al., 2021a), EEQA (Du and Cardie, 2020), EEQA-BART (Du and Cardie, 2020), EA2E (Zeng et al., 2022), and PAIE (Ma et al., 2022).The performance of EA2E is obtained by running the provided code over our preprocessed data using the same evaluation metrics for a fair comparison.The performance for other baselines is inherited from (Ma et al., 2022) Table 1 shows the performance of the methods over the test datasets along with their corresponding PLM versions.The most important observation from the table is that the proposed method SPEAE significantly outperforms the baseline methods (with p < 0.01) for both base and larger versions of the PLM models (i.e., BERT and BART).It just clearly demonstrates the benefits of the proposed method for EAE with contextualized soft prompts for instances and relevant context.Ablation Study: To reveal the contribution of the designed components in SPEAE, we perform an ablation study over the WIKIEVENT test data.Table 2 presents the performance of the ablated models.
In particular, for soft prompts, we first exclude either the example-specific context representation e k or the relevant context aggregation et from the computation of the soft prompt P sof t .As the performance of the resulting models is significantly worse than SPEAE, it clearly testifies to the importance of such components for prompt-based models for EAE.The performance is furthered degraded when the soft prompt P sof t is completely eliminated from the prompt, thus suggesting the effectiveness of soft prompts for EAE.Additional, instead of computing the relevant context aggregations for event types et with a graph neural network, we explore a variant to directly obtain et from the average representation of the documents in D that contain an event mention of type et.The worse performance of the ablated model clearly confirms the benefits of the graph neural network for representation aggregation of relevant documents/event types for soft prompt computation for EAE.
Low-resource Learning: To better understand the operation of the proposed model SPEAE under low-resource training settings, we perform an evaluation when different ratios of training data are employed to train the models.In particular, we compare SPEAE with the previous state-of-theart models, i.e., EEQA (Du and Cardie, 2020), EEQA-BART (Du and Cardie, 2020), and PAIE (Ma et al., 2022) in this low-resource learning experiment.Table 3 demonstrates the performance of the models (based on Arg-C) on the development data of WIKIEVENTS.As can be seen from the table, the proposed model SPEAE is significantly better than the baseline methods over different ratios of training data, ranging from 1% to 50%.It just clearly highlights the advantages of our proposed method for low-resource learning settings.
We attribute these advantages to the introduction of context information from current example and relevant documents to enrich soft prompts, allowing SPEAE to better utilize available training data to boost performance.
Stability Study: One of the major issues with the discrete prompts in previous EAE models is that model performance can be sensitive to specific formats of the hand-designed prompts (Zhao et al., 2021;Ma et al., 2022).This raises an important concern for the applications of EAE models to different datasets and problems as optimal formats of the prompts might be unclear for new datasets, necessitating further laborious efforts for prompt development and selection.To understand the sensitivity/stability of EAE models over different formats of discrete prompts, this experiment explores three variants of discrete prompts for EAE as discussed in (Ma et al., 2022), i.e., Manual Template, Uncontextualized Soft Prompt, and Concatenate Template.In particular, Manual Template (MA) involves the discrete prompts we utilize in our work, (Li et al., 2021), and (Ma et al., 2022).It concatenates all argument roles for an event type using natural language.For Uncontextualized Soft Prompt (USP), the prompts link argument roles with rolespecific special tokens (Qin and Eisner, 2021;Liu et al., 2021b).These tokens are associated with learnable embedding vectors to help transform discrete prompts into representation vectors for further computation.Here, a key difference between these embeddings for argument role-specific tokens and our soft prompts is that our soft prompts are contextualized over current example context and relevant documents.In contrast, the learnable embeddings for role-specific tokens in USP are only initialized randomly, thus unable to contextualize over example context for better customization and richer prompts as in our soft prompts.Finally, in Concatenate Template (CA), all argument role names for an event type are simply concatenated to form prompts (Ma et al., 2022).
Using three variants of discrete prompts, we compare the performance (based Arg-C) of our proposed model SPEAE and the current state-ofthe-art discrete-prompt model PAIE for EAE.

Related Work
Multiple methods have been introduced to solve EAE, including early feature-based methods (Li et al., 2013;Yang and Mitchell, 2016) and recent deep learning models (Chen et al., 2015;Nguyen et al., 2016;Liu et al., 2018;Lin et al., 2020;Nguyen et al., 2022).While most previous methods have focused on the classification frameworks (Pouran Ben Veyseh et al., 2020;Nguyen et al., 2021;Pouran Ben Veyseh et al., 2022b), PLMs has enabled recent formulation of EAE via question answering (Du and Cardie, 2020;Liu et al., 2020;Li et al., 2020;Liu et al., 2021a;Wei et al., 2021) or text generation (Paolini et al., 2021;Li et al., 2021;Lu et al., 2021;Zeng et al., 2022) paradigms.At the core of such models involves questions/prompts to specify argument roles to prompt PLMs.However, the questions/prompts in previous methods are mainly discrete and hand-designed, making it hard to customize for each example and incorporate various relevant context.

Conclusion
We introduce a new prompt-based method for EAE that features soft prompts to achieve example customization and relevant context augmentation to enrich prompts.Extensive experiments demonstrate the advantages of the proposed method for EAE.In the future, we will explore soft prompts for other problems in IE for better understanding.

Limitations
In this work, we propose a novel method for EAE that introduces learnable soft prompts to capture specific-example context and relevant documents for prompt customization and enrichment.Although experiment results have demonstrated the benefits of the proposed model, there are several limitations that can be addressed for further improvement in future work.First, similar to previous EAE studies (Du and Cardie, 2020;Li et al., 2021;Ma et al., 2022), our EAE model assumes golden event triggers for event types that might not be available for real-world applications.As such, future work can develop more comprehensive research and models to accommodate predicted event triggers while still maintaining competitive performance for EAE.Second, to aggregate relevant document representations for soft prompt computation, our EAE method leverage an event type mentioning graph that capture documents, event types, and their occurrence in training data.On the one hand, the graph does not involve argument roles that are directly related to EAE and might provide richer information/context to obtain representation aggregation to augment soft prompts.On the other hand, our method only explores graph attention networks to perform representation aggregation while many other variants of graph neural networks have not been considered, e.g., deep graph convolutional networks (Chen et al., 2020).Future work can explore richer graphs and graph neural networks to learn better representations for soft prompts for EAE.Third, despite the introduction of soft prompts with important benefits, our method still needs to rely on discrete prompts to explicitly specify event types and argument roles.Although our experiments demonstrate better stability of the proposed method with different discrete prompt variants, adapting our method to new languages will still require some prompt development effort to achieve optimal performance.Finally, in contrast to the interpretability of discrete prompts, soft prompts are less explainable, which can be addressed in future work to make the proposed method more accessible to various users.The datasets in our paper are publicly accessible.Appendix C discusses license of the libraries we used for model implementation.
B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified?For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?Please see Section 3, Appendix A, and Appendix C. B4.Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?Our datasets are public and widely used in previous research on Event Argument Extraction.There have been no concerns for private information or offensive content in our datasets in multiple previous work.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.? Please see Section 3 and Appendix A.
B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created?Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.Please see Section 3 and Appendix A.
you describe the limitations of your work?Please see Section 6 in the paper.A2.Did you discuss any potential risks of your work?We do not observe significant risks in our work.A3.Do the abstract and introduction summarize the paper's main claims?Please see the abstract and Section 1 for introduction of the paper.A4.Have you used AI writing assistants when working on this paper?Left blank.B Did you use or create scientific artifacts?Please see Section 3, Appendix A, and Appendix C.B1. Did you cite the creators of artifacts you used?Please see Section 3 and Appendix A.B2. Did you discuss the license or terms for use and / or distribution of any artifacts?

Table 1 :
, which presents the model PAIE with current best-reported results for our datasets.Model performance on test data.

Table 2 :
Ablation study on test data.

Table 3 :
Low-resource learning performance (Arg-C) of the models on the development data of WIKIEVENTS.Models are trained on different ratios of training data.

Table 4 :
Table 4 presents model performance on the RAMS development data.It is clear from the table that the proposed model SPEAE performs significantly better than PAIE over different variants of discrete prompts, thus further demonstrating the benefits of SPEAE.Importantly, while PAIE exhibits diverse performance gaps across different discrete prompts, SPEAE maintains more stable performance.This suggests an important advantage of SPEAE that is less sensitive to specific discrete prompt formats to enable convenient extension to new applications with less development efforts for prompt design.Model performance over the development data of RAMS using different variants of discrete prompts: MA (Manual Template), USP (Uncontextualized Soft Prompts), and CA (Concatenate Template).Performance for PAIE is obtained by running the provided code from the original paper to achieve fair comparison.