Prompt for Extraction? PAIE: Prompting Argument Interaction for Event Argument Extraction

In this paper, we propose an effective yet efficient model PAIE for both sentence-level and document-level Event Argument Extraction (EAE), which also generalizes well when there is a lack of training data. On the one hand, PAIE utilizes prompt tuning for extractive objectives to take the best advantages of Pre-trained Language Models (PLMs). It introduces two span selectors based on the prompt to select start/end tokens among input texts for each role. On the other hand, it captures argument interactions via multi-role prompts and conducts joint optimization with optimal span assignments via a bipartite matching loss. Also, with a flexible prompt design, PAIE can extract multiple arguments with the same role instead of conventional heuristic threshold tuning. We have conducted extensive experiments on three benchmarks, including both sentence- and document-level EAE. The results present promising improvements from PAIE (3.5% and 2.3% F1 gains in average on three benchmarks, for PAIE-base and PAIE-large respectively). Further analysis demonstrates the efficiency, generalization to few-shot settings, and effectiveness of different extractive prompt tuning strategies. Our code is available at https://github.com/mayubo2333/PAIE.


Introduction
Understanding text by identifying the event and arguments has been a long standing goal in Natural Language Processing (NLP) (Sundheim, 1992). As shown in Fig. 1, we can quickly understand that the document is talking about a Sell event, with four involved arguments, i.e., Vivendi (Seller), Universal Studios (Artifact), parks (Artifact), and company (Artifact), where the argument role is in brackets. Since event detection has achieved great success in Figure 1: Examples of (top) sentence-level and (bottom) document-level event argument extraction. Trigger words are included in special tokens <t> and </t>. Underlined words denote arguments and arcs denote roles. recent years (Wang et al., 2021), the main challenge lies in Event Argument Extraction (EAE).
Typical efforts in EAE can be roughly classified into two groups. The first group of methods formulates it as a semantic role labeling problem (Wei et al., 2021). There are generally two steps -first identifying candidate spans and then classifying their roles. Although joint models are proposed to optimize them together, high dependence on candidates may still suffer from error propagation (Li et al., 2013). In the second group, recent studies tend to follow the success of Pre-trained Language Models (PLMs) and solve EAE by Question Answering (QA)/Machine Reading Comprehension (MRC) (Liu et al., 2021a;Wei et al., 2021;Du and Cardie, 2020;Liu et al., 2020;Li et al., 2020) and Text Generation (Lu et al., 2021;Li et al., 2021). QA/MRC-based models can effectively recognize the boundaries of arguments with rolespecific questions, while the prediction has to be one by one. Generation-based methods are efficient for generating all arguments, but sequential predictions degrade the performance on long-distance and more arguments. Besides, the state-of-the-art performance is still unsatisfactory (around 68% F1 on the widely used dataset ACE05 (Doddington et al., 2004)). Here raise an interesting question, is there any way to combine the merits of the above methods, as well as to boost the performance?
This paper targets real scenarios, which require the EAE model to be effective yet efficient at both sentence and document levels, and even under the few-shot setting without sufficient training data. To do this, we highlight the following questions: • How can we extract all arguments simultaneously for efficiency? • How to effectively capture argument interactions for long text, without knowing them in advance? • How can we elicit more knowledge from PLMs to lower the needs of annotation?
In this paper, we investigate prompt tuning under an extractive setting and propose a novel method PAIE that Prompting Argument Interactions for EAE. It extends QA-based models to handle multiple argument extraction and meanwhile takes the best advantage of PLMs. The basic idea is to design suitable templates to prompt all argument roles for PLMs, and obtain role-specific queries to jointly select optimal spans from the text. Thus, instead of unavailable arguments, each role in the template serves as a slot for interactions, and during learning, PLMs tend to fill these slots with exact arguments via a matching loss. By predicting arguments together, PAIE enjoys an efficient and effective learning procedure. Besides, the inter-event knowledge transfer between similar role prompts alleviates the heavy burden of annotation cost.
Specifically, for prompting extraction, we design two span selectors based on role prompts, which select start/end tokens among input texts. We explore three types of prompts: manual template, concatenation template, and soft prompt. They perform well at both sentence-level EAE (S-EAE) and document-level EAE (D-EAE) and ease the requirements of the exhaustive prompt design. For joint span selection, we design a bipartite matching loss that makes the least-cost match between predictions and ground truth so that each argument will find the optimal role prompt. It can also deal with multiple arguments with the same role via flexible role prompts instead of heuristic threshold tuning. We summarize our contributions as follow: • We propose a novel model, PAIE, that is effective and efficient for S-EAE and D-EAE, and robust to the few-shot setting. • We formulate and investigate prompt tuning under extractive settings, with a joint selection scheme for optimal span assignments. • We have conducted extensive experiments on three benchmarks. The results show a promising improvements with PAIE (1.1% and 3.8% F1 gains on average absolutely in S-EAE and D-EAE). Further ablation study demonstrates the efficiency and generalization to few-shot settings of our proposed model, as well as the effectiveness of prompt tuning for extraction.

Related Works
Event Argument Extraction: Event Argument Extraction is a challenging sub-task of event extraction (EE). There have been great numbers of studies on EAE tasks since an early stage (Chen et al., 2015;Nguyen et al., 2016;Huang et al., 2018;Yang et al., 2018;Sha et al., 2018;Zheng et al., 2019). Huang and Peng (2021)  A recent trend formulates EAE as an extractive question answering (QA) problem (Du and Cardie, 2020;Liu et al., 2020). This paradigm naturally induces the language knowledge from pretrained language models by converting EAE tasks to fully-explored reading comprehension tasks via a question template. (Wei et al., 2021) considers the implicit interaction among roles by adding constraint with each other in template, while (Liu et al., 2021a) leverages data augmentation to improve the performance. However, they can only predict roles one by one, which is inefficient and usually leads to sub-optimal performance.
To extract all arguments in a single pass, Lu et al. (2021) take EAE as a sequential generation problem with the help of the pre-trained Encoder-Decoder Transformer architecture, such as BART (Lewis et al., 2020) and T5 (Raffel et al., 2020). Li et al. (2021) target generation model by designing specific templates for each event type. In comparison, we prompt argument interactions to guide PLMs and optimize the multiple argument detection by designing a bipartite matching loss. This not only improves the understanding of longdistance argument dependencies but also enjoys an efficient procedure via prompt-based learning.
Prompt-based Learning: Prompt-based learning is a new paradigm emerging in the field of pretrained language models (Liu et al., 2021b). Unlike the pre-training and fine-tuning paradigm, which usually asks for an additional classifier, the promptbased methods convert the downstream tasks to the form more consistent with the model's pre-training tasks. By finding a mapping from a particular word to a category, a classification task can be treated as a Masked LM task (Schick and Schütze, 2021). Recent work found that a small difference in prompt templates may lead to a huge performance gap. Thus, many works explore automatic template generation (Shin et al., 2020), discrete and continuous representations of prompts (Liu et al., 2021c), multiple prompt slots (Qin and Eisner, 2021), etc.. Different from the above prompt tuning method, our proposed method focuses on extraction tasks and prompts PLM for better span selectors.

Methodology
PAIE considers multiple arguments and their interactions to prompt PLMs for joint extraction. Our model, as illustrated in Fig. 2, contains three core components: prompt creation, span selector decoding, and span prediction. In the following sections, we will first formulate prompt for extraction, and describe each component in turn.

Formulating Prompt for Extraction
Existing prompt-based methods mainly focus on classification and generation tasks. Conventional extraction objectives are converted into a generation task. This brings inefficiency issue that the model has to enumerate all of extraction candidates. For example, (Cui et al., 2021) design the prompt for named entity recognition: [candidate span] is [entity type/not a] entity. The models need to fill the first slot with candidate entities, and check the outputs of LM for the second slot for extraction. Can prompt-based methods directly be applied on extraction? since the basic idea is similar with classification/generalization -comparing the slot embeddings with label vocabulary/input tokens. Here, we give a formulation about general extractive prompting method, and then apply it on EAE for case study.
(1) Prompt Creation. Given context X and a series of queries Q = {q 1 , q 2 , ..., q K }, we create a joint prompt containing all these queries, where f prompt is the prompt creator.
P t = f prompt (Q) (2) Prompted Selector Decoding. Given a PLM L, context X, and prompt P t, we decode a queryspecific (answering) span selector as follows: where q k is the k-th query in the prompt and h L is the outputs of PLMs.
(3) Prompted Span Selection. To find the optimal span, we design two selectors for the start and end tokens from context: where (s, e) q k is the span about k-th query and g L is the span selector. Clearly, such formulation is better than generative extraction by mainly considering the adjacent constraints of span.
Task Definition We formulate EAE task as a prompt-based span extraction problem on dataset D. Given an instance (X, t, e, R (e) ) ∈ D, where X denotes the context, t ⊆ X denotes the trigger word, e denotes the event type and R (e) denotes the set of event-specific role types, we aim to extract a set of span A. Each a (r) ∈ A is a segmentation of X and represents an argument about r ∈ R (e) .

Prompt Creation for EAE
We create a set of prompts for each event type e in dataset D. Each prompt contains all roles r ∈ R (e) . For example in Fig.2, given event type e as negotiate and R (e) as {Participant, Topic, Place}, the prompt P t (e) may be defined as follows: Participant communicated with Participant about topic at Place .
We call the mentions of roles in the prompt as slot, and there are four slots underlined in this example (and colored in Fig. 2). Such design allows our model to capture the implicit interactions among different roles.
To avoid threshold tuning for multiple arguments with the same role, the prompt is flexible to use multiple slots for the same role, such as role Participant in the above example. The number of slots for the role is heuristically determined according to the maximum number of arguments of each role in  training dataset. We design three different prompt creators f prompt , the mapping from a set of roles to a prompt as follows: We give one example of these three types of prompt in Table 1 and list more examples in Appendix A.6. Further analysis can be found in Section 5.2.

Span Selector Decoding
Given context X and prompt P t, this module generates the role-specific span selector θ k , for each slot k of the prompt. Here we choose pre-trained language model L as BART (Lewis et al., 2020). We first define text markers as 〈t〉/〈/t〉 then insert them into context X before and after the trigger word respectively.
Instead of concatenating the processed context X and prompt P t directly, we feed the context into BART-Encoder and the prompt into BART-Decoder separately, as illustrated in Fig. 2. The prompt and context would interact with each other at the cross-attention layers in decoder module.
where H X denotes the event-oriented context representation and H pt denotes context-oriented prompt representation. For k-th slot in the joint prompt we mean-pool its corresponding representations from h pt and obtain role feature ψ k ∈ R h , where h denotes the dimension of hidden layer in BART. Note that a role may have multiple slots and correspondingly, multiple role features and span selectors.
We adopt a simple but effective modification on previous methods by deriving role-specific span selector θ k from every role feature in the prompt. Given role feature ψ k , we have: where θ = [w (start) ; w (end) ] ∈ R h×2 is learnable parameters shared among all roles, and • represents element-wise multiplication.
] is exactly the span selector for k-th slot in the prompt. With only one meta-head θ and simple operations, our method enables to generate arbitrary number of role-specific span selectors to extract related arguments from context. Recall the generation process of role feature ψ k from prompt h pt , it is obvious that both the interaction among different roles and the information aggregation between context and roles are considered under this paradigm.

Learning with Bipartite Matching
Bipartite matching aims to find the optimal span assignments for all arguments with the least-cost match. It considers two aspects: argument-role match and same-role argument match. Span Prediction This module considers argumentrole match and aims to detect multiple argument spans for every role simultaneously.
Given the representation of context H X and all role-specific selectors {θ k }, we follow the extractive prompt formulation in Section 3.1 to calculate the distribution of each token being selected as the start/end of argument for each role feature.
where logit (start) k and logit (end) k represent start and end position distributions over the context tokens for each slot k, and L denotes the context length.
Then we apply greedy search on predicted start and end position distributions to select the local optimal span for each role-specific selector.
(4) Same-role Argument Assignment For multiple arguments with the same role, we insert multiple slots about this role and each slot generates one prediction. It is a canonical bipartite matching problem that matches predictions and ground truth as much as possible. Following Carion et al. After finding the optimal assignmentσ, we calculate probabilities where the start/end positions locate: Then we define the loss function of the slot k as: where s k and e k represent the ground truth of start/end positions of the arguments.
For inference, our model efficiently following Eq.4 to extract arguments of each slot, since at most one span is predicted by each slot in the prompt, which avoids the exhaustive threshold tuning.

Experiments
In this section, we explore the following questions: • Can PAIE better utilize PLMs for joint extraction to boost the performance of S-EAE and D-EAE? • How do different prompt training strategies affect the results? • How does PAIE perform in various practical settings, including efficiency and generalization to few-shot, long-distance, and multiple arguments?

Experimental Setup
Datasets We conduct experiments on three common datasets in Event Argument Extraction task: RAMS (Ebner et al., 2020), WIKIEVENTS (Li et al., 2021) and ACE05 (Doddington et al., 2004). RAMS and WIKIEVENTS are latest documentlevel EAE benchmarks, while ACE05 is a classical dataset commonly used for sentence-level EAE task. We leave the dataset details in Appendix A.1.

Evaluation Metric
We adopt two evaluation metrics.
(1) Argument Identification F1 score (Arg-I): an event argument is correctly identified if its offsets and event type match those of any of the argument mentions.
(2) Argument Classification F1 score (Arg-C): an event argument is correctly classified if its role type is also correct. For WIKIEVENTS dataset, we follow (Li et al., 2021) and additionally evaluate Argument Head F1 score (Head-C), which only concerns the matching of the head word of an argument.   Table 2 compares our approach with all baselines. We observe that PAIE performs best on all datasets. For S-EAE, our base model achieves an absolute Arg-C improvement of 3.6%. For D-EAE, our base model obtains 2.4% and 6.3% Arg-C gains on RAMS and WIKIEVENTS, respectively. Similarly, our large-version model achieves 4.3% and 3.3% gains. This demonstrates a good generalization ability of our proposed method on dealing with varying lengths of context. We also find that QA-based model sometimes performs well even in document-level EAE tasks.

Overall Performance
The EEQA-BART model shows almost the same Arg-C with BART-Gen (Li et al., 2021) on RAMS dataset. Other QA-based models (especially those considering interactions among arguments, like FEAE (Wei et al., 2021)) also have competitive performance. As for WIKIEVENTS, however, QAbased models are inferior to sequential-generation models significantly. We speculate that the perfor-  Next, we conduct further analysis with the strongest baseline EEQA-BART and our PAIE. We use the base-version BART for a fair comparison.

Ablation Study
In this section, we investigate the effectiveness of our main components by removing each module in turn. (1) bipartite matching. We drop out the bipartite matching loss and ignore the global optimal span assignment. (2) multi-arg prompt. We additionally replace the prompt containing multiple roles with several single templates in which include only one role. (3) role-specific selector. The selector is not role-specific anymore but is shared among all roles. This variant degrades to EEQA-BART.
We summarize the results of ablation studies in Table 3. (1) EEQA-BART outperforms EEQA significantly, which demonstrates that even conventional QA-based methods have substantial space for improvement with a better PLM and span selection strategy.
(2) The role-specific selector further im-  proves Arg-C scores in RAMS and WIKIEVENTS, while taking a slightly negative effect on ACE05.
Since the former two datasets are document-level and have more role types (65 in RAMS, 59 in WIKIEVENTS, and 36 in ACE05), we speculate that role-specific selector plays a critical role when identifying and disambiguating roles with complicated ontology structures in long documents.
(3) Joint multi-argument prompt achieves consistent improvement on all three datasets, especially on ACE05 and RAMS. It indicates that the joint prompt has the potential to capture implicit interaction among arguments. (4) Bipartite matching loss has an average improvement of 0.4%, and shows a stable optimizing ability due to its permutationinvariance property, which is further discussed in Appendix A.5. Table 4 reports average Arg-C scores of 4 random seeds. We can see that concatenating context and prompt slightly impairs the model performance. It seemingly indicates that the over-interaction between context and prompt is not of benefit. Furthermore, the prompt squeezes the limited input length of the encoder kept for a document if it concatenates with the document. The experiments support our strategy feeding context and prompt separately without concatenation to PAIE.

Prompt Variants
We investigate how different types of prompts affect the performance, as shown in Fig. 3. We find that (1) All three joint prompts outperform the single template (except for the soft prompt on RAMS dataset), which validates the effectiveness of the joint prompt.
(2) Manual template has the most stable performance and usually the better result than others.   achieves comparable result with manual template. We claim this observation inspiring because the creation of the manual template is laborious and a simple concatenation prompt almost avoids such a handcrafted process. (4) A little frustratingly, soft prompt performs relatively poor and unstable, though still slightly better than single template on WIKIEVENTS dataset. It contradicts the current trend of creating distinct continuous prompts which usually perform better than manual ones. We leave this for future work exploring whether there exists competitive continuous prompts in EAE task.
6 Analysis on Real Scenario

Long-range Dependencies
In D-EAE task, arguments could span multiple sentences. Therefore, the model is required to capture long-range dependencies. For better evaluating PAIE and comparing with others, we list their performance breakdown on different sentence distances between arguments and the given trigger word in Table 5. We can see that PAIE significantly improves the ability to extract arguments with long distances, especially for those behind the given trigger words. We may conclude that PAIE leverages the implicit interaction among roles, and roles conditioning on each other lowers the difficulty to extract long-distance arguments.  EEQA-BART 58.0 (−6) 59.7 (−2) 28.6 (−9) 10.0 (−18) PAIE (Ours) 64.7 61.4 38.1 28.6 Table 6: Arg-C F1 on WIKIEVENTS breakdown by argument number n of one role. The case number is given in the square bracket.

Same-role Argument Assignment
Multiple arguments may share the same role in the same event. To solve this problem, QA-based methods usually adopt the thresholding strategy, which compares the score of each text span with a manually tuned threshold. We do a coarse grid search for span threshold on WIKIEVENTS dataset using EEQA model, as shown in Fig. 4. Obviously, the choice of threshold highly affects the performance of the model. In addition, models with the same architecture but different PLMs have totally different optimal thresholds even on the same dataset, not to mention on distinct datasets. Therefore, it consumes lots of time and computational resources for finding a good threshold and usually ends with sub-optimal results.
In PAIE, there is no threshold tuning required since each slot in the prompt only predicts a unique argument span guaranteed by bipartite matching. We evaluate on WIKIEVENTS containing diverse multi-argument cases. Table 6 shows that PAIE outperforms significantly better than QA-based method, especially when dealing with multiple arguments of one role. For roles with three and four or more arguments, PAIE gains an absolute improvement of 9.5% and 18.6%, respectively.

Few-shot Setting
We analyze how PAIE performs without sufficient annotations on the large-scale RAMS. We also compare with DocMRC, which introduces additional data via data augmentation. Fig. 5   BART and DocMRC performance. Along with the decreasing number of training data, the gains become larger than EEQA-BART. It indicates that PAIE can better utilize PLMs for few-shot settings.

Inference Speed
All previous sections emphasize the superiority of PAIE from the perspective of accuracy performance. PAIE also has much better extraction efficiency compared with other approaches. In Table 7, we report the overall inference time for different models on single NVIDIA-1080Ti GPU. PAIE runs 3-4 times faster than EEQA, since PAIE predicts multiple roles simultaneously, while EEQA predicts roles one by one. Other QA-based models are likely to have similar speeds with EEQA due to their sequential prediction structure and training process. PAIE is even more advantageous under practical application scenarios since it avoids the heavy threshold tuning.

Conclusion
We propose a novel model PAIE that effectively and efficiently extracts arguments at both sentence and document levels. It prompts multiple role knowledge for PLMs for extraction objectives. Also, a bipartite matching loss guarantees the optimal assignment for joint identifying all arguments of the same role. Extensive experiments on three common benchmarks demonstrate our proposed model's effectiveness and the generalization ability in both sentence and document level EAE. We have also conducted ablation studies on the main components, the extractive prompting strategy, and several real scenarios. In the future, we are interested in investigating co-reference as an auxiliary task of EAE and introducing entity information to better determine argument boundaries.  dington et al., 2004). RAMS is a document-level dataset annotated with 139 event types and 65 semantic roles. Each sample is a 5-sentence document, with trigger word indicating pre-defined event type and its argument scattering among the whole document. WIKIEVENTS is another document-level dataset providing 246 documents, with 50 event types and 59 argument roles. These documents are collected from English Wikipedia articles that describe real-world events and then follow the reference links to crawl related news articles. They also annotate the coreference links of arguments, while we only use the annotations of their conventional arguements in this task.

References
ACE 2005 is a joint information extraction dataset providing entity, value, time, relation, and event annotation for English, Chinese, and Arabic. We use its event annotation with 33 event types and 35 argument roles for sentence-level EAE tasks. We follow the pre-processing procedure of (Lin et al., 2020), and collect 4859 arguments in the training set, 605 and 576 in the development and test dataset respectively. Table 8 shows detailed statistics.

A.2 Details of baseline models
We compare our model with following representative superior models.    Lin et al., 2020): We use their code 3 and re-train the model to get the performance of the model on event argument extraction task (with golden triggers). We don't report its performance on RAMS and WIKIEVENTS because OneIE achieves abnormally low performance on them. Since OneIE is a joint model extracting entity, relation and event, and there is no entity and relation annotation in RAMS and relation annotation in WIKIEVENTS dataset, comparing OneIE with other models is unfair to some extent.
For all the re-trained models mentioned above, we keep all other hyper-parameters the same with default settings in their original papers and search the learning rate in [1e-5, 2e-5, 3e-5, 5e-5]. We report test set performance for the model that performs the best on the development set.

A.3 PAIE model implementation and training setup
PAIE is an extended version of BART-style encoder-decoder transformer. The optimization procedure for one datum is shown in the pseudo code 1. We use pre-trained BART models to initialize the weights of encoder-decoder in PAIE. We train large models on NVIDIA-V100 and base models on NVIDIA-1080Ti. In each experiment, we train the model with 5 fixed seeds (13,21,42,88,100) and 4 learning rates (1e-5, 2e-5, 3e-5, 5e-5), and vote for the best learning rate for each seed with the best dev-set Arg-C performance. We report the averaged Arg-C performance on the test set for selected checkpoints. For model variations mentioned in Section 5.1, we only change the input strategy and leave other parts constant. We list other important hyperparameters in Table 9.

A.4 Details of Bipartite Matching loss
We formulate the details of bipartite matching loss in the following. Let us denote y k r = [(s 0 , e 0 ), ..., (s n , e n )] as ground truth spans of argument role r for datum k, andŷ k r = [(ŝ 0 ,ê 0 ), ..., (ŝ m ,ê m )] as predicted spans, where m is the number of occurrence of argument role r in the corresponding prompt.
With the candidate spans for each argument role, we define the bipartite matching between the candidates and ground truth annotations as finding the 3 http://blender.cs.illinois.edu/software/oneie/ lowest cost of a permutation Γ of N elements: We introduce classical Hungarian algorithm (Kuhn, 1955) for efficient optimal assignment. In Eq.7, N is chosen to the minimum value between m and n, if length of ground truth spans n is larger than number of candidates n, only the optimally matched gold spans are used for loss calculation. Inversely, we will insert (0, 0) to golden answer set to represent a "no answer" case. The (0, 0) span can penalize the over-confidence of span predictions.
After finding the optimal assignment, we calculate span loss for all paired matches. The cross entropy between the start/end logits and the ground truth span is formulated as: The full model can be optimized in an end-to-end manner. The bipartite matching is only applied in training. For inference, the model will output all non-zero spans with corresponding argument role as predictions.

A.5 Further analysis of Bipartite Matching
To discuss the effectiveness of bipartite matching, we further go through the annotations in the dataset. Although the task does not guarantee the order of annotations under the same argument role while training, we find the annotators prefer to annotate them in ascending order. This introduces additional annotation bias, which can be captured by model i.e. a late extraction entry in our joint prompt only needs to extract a later annotated span. The modeling towards subordinate relation between same-role arguments in the prompt is downgraded to extracting by their position order. Thus, for a complete analysis, besides an ablation study on the standard training set, we also train on the set in which the order of argument annotations are shuffled. This shuffling process only appears once before the training start with a unique seed for a fair evaluation.
As shown in Table 11, with the standard train set, there is an average drop of 0.4% on Arg-C. With a pre-shuffled train set, the average drop extends to 1%, but at the same time, there is only minimal drop for the model with bipartite loss (second row). PAIE model shows its robustness on permutation. If we further look through the error cases in Table 10, the weakness of multi-argument extraction becomes a typical failure case for the no-bipartite model.
We expected to observe the violation of unique matching towards target span (homogeneous prediction) for the no-bipartite model, especially when permuting role orders. But we did not see cases like that, and this surprises us with the effectiveness of positional embedding for distinguishing tokens even with the same id and unstable loss.
In addition, existing datasets are not designed for evaluating N-to-1 problems, and the cases only appear 8.9% in ACE05, 6.1% in RAMS, 10.9% in WIKIEVENT. The ambiguity of annotations could further reduce the number of effective training data. The importance of bipartite matching in the argument extraction tasks cannot be sufficiently ver-  ified (no significant performance gap observed). The multi-argument cases in the dataset are almost coordinated, it is a lack of data for evaluating the effectiveness of modeling subordinate relationships between same-role arguments. We expect a largescale dataset in the future for this purpose.

A.6 Prompt Examples
We first compare our prompt with others used in EAE task in Table 12. The first row gives a standard QA-based prompt, which the model the extraction task as a question answering and expects a question template that can prompt knowledge from PLM, especially for those pretrained on question answering tasks. The second row shows a standard description template for event extraction task, it is usually defined in meta file, works such as (Li et al., 2021) modified the definition with exchangeable placeholders for augment the prompts. Row 3-5 show our three types of prompt respectively. We also show 10 manual template examples about each dataset at Table 13, and the complete version of the prompt would be published with codes later.

B Error Analysis
In this section, We analysis the remaining errors types. We manually check 100 wrong predictions of RAMS dataset and show the distribution in Fig. 6. We only discuss the main categories with examples here.
Annotation Error and Ambiguity. We find that about 27% of the errors are caused by the annotation problem in RAMS dataset. The annotation issues usually contain wrong labeling, missing annotations, and ambiguity of concept. The third issue usually comes from the fungibility of a concept. For instance, "Washington" and " the United States" can be treated to represent the same political concept in a contact.collaborate.meet or contact.discussion.meet event. In RAMS, only one of them is annotated and the model may predict the other one.
Uncertain Definition. This indicates the uncertainty of whether to include a potential entity of an event. For example, in the following sentences: "The Syrian government stressed ... and "preventing these organisations from strengthening their capabilities or changing their positions", in order to avoid wrecking the agreement." a artifactexistence.damagedestroy.n/a event has not happened but has potential to do so. The potential role damagerdestroyer is mentioned and the model tends to extract it, but this is not annotated in the gold annotations.
Co-reference In multi-sentence-level argument extraction, pronouns are usually used for coreference. For example, in the following sentences: "... Patients don't feel great, but they're not sick enough to stay home in bed or to be hospitalized ..." Our model predicts "they" as an answer for the argument role victim, though "them" is a reference of the gold annotation "Patients", it is considered as an error according to the current evaluation protocol.
Span Partially Match We find that 21% of the error cases are the partial matching of a span. A large percentage of them fall on the matching of concept but mismatching of span, such as "the center" versus "center". Besides, we also include the cases of multiple correct answers in this category. The correct text can appear in multiple positions in the sentence. The model does not output the exact span but an alternative one. This issue can be mitigated by evaluating on a normalized text (Rajpurkar et al., 2018), this evaluation metric can boost the performance for at least 3% in the benchmarks. Another partial matching happens to an incorrect concept understanding. We find cases for a long human name (contain "-" or ","), the model only extracts the part before ",". This can be mitigated by introducing stronger entity recognition prior.

Wrong Prediction
The wrong prediction of our model mainly contains three categories: • Over-Detection, which indicates that there is actually no answer contained in the sentences, but our model tends to output a specious result. For example, for the query of some "place" argument, our model finds words representing places that appear in the sentences, even if it does not refer to the place of the current event. • Under-Detection, which indicates that there is an answer contained in the sentences, but our model fails to extract the related arguments and outputs "No Answer". Some rare arguements only appears in a small amount of training data, which leads the model to give "No Answer" as the ouput. • Cross-Sentence Extraction, which indicates that the augment extraction needs cross-sentence reasoning, e.g., some entities are mentioned in multiple sentences, the interaction between them needs to be further modeled.