Global Constraints with Prompting for Zero-Shot Event Argument Classification

Determining the role of event arguments is a crucial subtask of event extraction. Most previous supervised models leverage costly annotations, which is not practical for open-domain applications. In this work, we propose to use global constraints with prompting to effectively tackles event argument classification without any annotation and task-specific training. Specifically, given an event and its associated passage, the model first creates several new passages by prefix prompts and cloze prompts, where prefix prompts indicate event type and trigger span, and cloze prompts connect each candidate role with the target argument span. Then, a pre-trained language model scores the new passages, making the initial prediction. Our novel prompt templates can easily adapt to all events and argument types without manual effort. Next, the model regularizes the prediction by global constraints exploiting cross-task, cross-argument, and cross-event relations. Extensive experiments demonstrate our model’s effectiveness: it outperforms the best zero-shot baselines by 12.5% and 10.9% F1 on ACE and ERE with given argument spans and by 4.3% and 3.3% F1, respectively, without given argument spans. We have made our code publicly available.


Introduction
Event Argument Classification2 (EAC), finding the roles of event arguments, is an important and challenging event extraction sub-task.As shown in Figure 1, a "Transfer-Money" event whose trigger is "acquiring" has several argument spans (e.g., "Daily Planet").By determining the role of these arguments (e.g., "Daily Planet" as "Beneficiary"), we can obtain a better understanding of the event, thus benefiting related applications like stock price prediction (Ding et al., 2015) and biomedical research (Zhao et al., 2021).
Many previous EAC works require numerous annotations to train their models (Lin et al., 2020;Hsu et al., 2022;Liu et al., 2022), which is not only costly as the annotations are labor-intensive but also difficult to be generalized to datasets of novel domains.Accordingly, some EAC models adopt a few-shot learning paradigm (Ma et al., 2022;Hsu et al., 2022).However, they are sensitive to the fewshot example selection and they still require costly task-specific training, which hinders their real-life deployment.There have been some zero-shot EAC models based on transfer learning (Huang et al., 2018), or label semantics (Zhang et al., 2021;Wang et al., 2022), or prompt learning (Liu et al., 2020;Lyu et al., 2021;Huang et al., 2022;Mehta et al., 2022).However, these models' corresponding limitations impede their real-life deployment.The model based on transfer learning can be ineffective when new event types are very different from the observed one.As for models using label semantics, they require a laborious preparation process and have unsatisfactory performance.Regarding models adopting prompt learning, they need tedious prompt design customized to every new type of events and arguments, and their performance is also limited.
To address the aforementioned issues, we propose an approach using global constraints with prompting to tackle zero-shot EAC.Global constraints can be viewed as a type of supervision signal from domain knowledge, which is crucial for zero-shot EAC since supervision from annotations is inaccessible.Moreover, our model's constraints module provides abundant global insights across tasks, arguments, and events.Prompting can also be regarded as a supervision signal as it induces abundant knowledge from Pre-Trained Language Models (PTLM).Unlike previous zero-shot EAC works, which need a tedious prompt design for every new type of events and arguments, the novel prompt templates of our model's prompting module can be easily adapted to all possible types of events and arguments in a fully automatic way.Specifically, given an event and its passage, our model first adds prefix prompt, cloze prompt, and candidate roles into the passage, which creates a set of new passages.The Prefix prompt describes the event type and trigger span.Cloze prompt connects each candidate to the target argument span.Afterwards, our model adopts a PTLM to compute the language modeling loss for each of the new passages, whose negative value would be the respective prompting score.The role with the highest prompting score is the initial prediction.Then, our model uses global constraints to regularize the initial prediction.The global constraints are based on the domain knowledge of the following relations: (1) cross-task relation, where our model additionally performs another one or more classification task on target argument span, and our model's predictions on EAC and other task(s) should be consistent; (2) cross-argument relation, where arguments of one event should collectively abide by certain constraint(s); (3) cross-event relation, where some argument playing a certain role in one event should play a typical role in another related event.
We conduct comprehensive experiments to demonstrate the effectiveness of our model.Particularly, our approach surpasses all zero-shot baselines by at least 12.5% and 10.9% F1 on ACE and ERE, respectively.When argument spans are not given, our model outperforms the best zero-shot baseline by 4.3% and 3.3% F1 on ACE and ERE, respectively.Besides that, we also conduct experiments to show that both the prompting and constraints modules contribute to the final success.

Methodology
We first present an overview of our approach.Then we introduce the details by describing its prompting module and global constraints regularization module.We follow (Liu et al., 2021) to name a prompt inserted before input text as prefix prompt, and a prompt with slot(s) to fill in and insert in the middle of input text as cloze prompt.

Overview
As shown in Figure 2, given a passage with a target argument span, our model infers the target's role without annotation and task-specific training.Our model has two modules.The first module is the prompting module that creates and scores several new passages.During creation, the model adds prefix prompt, cloze prompt, and candidate roles into the passage, where the prefix prompt contains information about event type and trigger, and the cloze prompt joins each candidate with a target argument span. 3Afterwards, the model uses a PTLM to score the new passages.Our novel prompt templates can easily adapt to all possible events and arguments without manual work.Initial prediction is the role with the best prompting scores.The second module is the global constraints regularization module, where the model regularizes the prediction by three types of global constraints: cross-task constraint, cross-argument constraint, and crossevent constraint.All global constraints are based on event-related domain knowledge about intertask, inter-argument, and inter-event relations.

Prompting Module
In this section, we describe the prompting module in detail.Given a passage, we first add a prefix prompt containing information about the event type and trigger span to the beginning.Such a prompt can guide a PTLM to: (1) accurately capture the input text's perspective related to the event; (2) have a clear awareness of the trigger.Based on the definitions of events and triggers (Grishman et al., 2005), we create the following prefix prompt: "This is a [] event whose occurrence is most clearly expressed by []." where the first and second pairs of square brackets are the placeholders of event type and trigger span respectively.We also con- ducted some experiments comparing different prefix prompts in Section A, and the results showed that the prefix above is the most effective.
Second, for each candidate role, the module inserts the cloze prompt behind the target argument span, and the role fills the prompt' slot.The cloze prompt adopts the hypernym extraction pattern "M and any other []" (Dai et al., 2021), where "M" denotes the argument span and the square bracket is the placeholder of the candidate role.We did not try other hypernym extraction patterns as (Dai et al., 2021) had shown that our pattern is the most effective.The motivation for adopting the hypernym extraction pattern for cloze prompt is that, to some extent a role can be regarded as a context-specific hypernym of the respective argument span of the associated event (e.g., "Beneficiary" can be seen as a context-specific hypernym of "Daily Planet"of the Transfer-Money event described by the example in Figure 1).Hence, such a prompt induces the linguistic and commonsense knowledge stored in PTLM to help identify which candidate role is the most reasonable.
After adding the previous two types of prompts, we get several new passages.For instance, suppose the passage is"In Baghdad, a bomb was fired at 17 people."whose event type is "Conflict:Attack", trigger is "fired", target argument span is "bomb", and candidate roles are {"Attacker", "Instrument", "Place", "Time", "Target"}.The created passages would be: (1) "This is a Attack event whose occurrence is most clearly expressed by "fired."In Baghdad, a bomb and any other attacker was fired at 17 people.";(2) "This is a Attack ... "fired."... bomb and any other instrument was ..."; and simi-lar text for other roles. 4or each new passage, we apply a PTLM to compute the language modeling loss.The negative value of the loss would be the prompting score of the respective passage, where a higher value indicates higher plausibility according to the PTLM.Since our model's prompt templates are independent of event type and argument role, their adaptation to any new type of events and arguments is trivial and fully automatic.Hence, our prompting method is more scalable and generalizable than those of previous zero-shot EAC models, since, for every new type of events and arguments they need to design a customized prompt.For instance, for every type of events/arguments, Lyu et al. (2021) manually design a unique prompt as text entailment/question answering template.The initial prediction would be the role with the highest prompting score.Since the steps of obtaining scores for each candidate role are independent of other candidate roles, we implement the steps of different candidate roles in parallel.Such a parallel implementation significantly improves our model's efficiency.

Global Constraints Regularization Module
This module regularizes the prediction by the following three types of global constraints. 5Cross-task constraint exploits the label dependency between EAC and auxiliary task(s) so that our model can get global information from the auxiliary task(s) about event arguments.We use Event Argument Entity Typing (EAET) as the auxiliary task.The task aims to classify an argument into its context-dependent entity type (e.g., PER).As specified in ACE2005 ontology, an argument of a certain role in an event can only be one of several respective entity types (e.g., an argument of "Attack" role in a Conflict:Attack event can only be "ORG," "PER," or "GPE").Based on this domain knowledge, we design the cross-task constraint as follows: (1) For each input passage, our model performs prompting for EAET, where the prompting is the same as in Section 2.2 except that candidate entity types replace the candidate roles in cloze prompt.; (2) After obtaining the scores and prediction of EAET, the model check the consistency between the predictions of EAC and EAET; (3) If the consistency is violated and the score of EAC's predicted role is lower, then discard the current role, use the role with the highest score in the remaining ones, and check the consistency again; (4) The constraint ends when the labels of two tasks are consistent.An example illustrating this type of constraint is shown in Figure 3.
Cross-argument constraint is based on domain knowledge about relationships between arguments within an event.Specifically, our model constrains the number of particular arguments for some or all events.For instance, it is very unlikely that an event mentioned is associated with multiple "Time" arguments.Such constraints offer a global understanding of event arguments to our model.The cross-argument constraint we adopt is "A Personnel:End-POSITION event has at most one Position argument."Given a Personnel:End-POSITION event, our model first checks the number of "Position" argument.If the number is more than one, then our model will first collect the arguments whose roles are "Position" and remove the one with the highest score among these arguments.Then for each remaining argument, our model would change the role to its candidate with the second highest score.An example illustrating this type of constraint is shown in Figure 4.
Cross-event constraint regularizes predicted roles of arguments shared by related events.A model with such a constraint can have global insights into event arguments, because while they are making inferences for the arguments of one event, they are aware of the information of other Figure 3: An Example of cross-task constraint.The text in bold face is the trigger, underlined text is target argument span, and a tuple denotes a predicted label with its prompting score (e.g., "(Target, -3.5)" denotes the predicted label "Target" with its prompting score"-3.5").Similar notations are adopted in all remaining figures.related event(s) and cross-event relations.The cross-event constraint we adopt is "If a Life:Injure event and a Conflict:Attack event share arguments, then Injure.Place is the same as Attack.Place, Injure.Victim is the same as Attack.Target, Injure.Instrument is the same as Attack.Instrument, Injure.Time is the same as Attack.Time, Injure.Agent is the same as Attack.Attacker".Given a passage containing an Injure and an Attack event sharing arguments, the model imposes the constraint by checking the consistency between the respective roles of each shared argument as specified in the constraint.Any inconsistency would be fixed by changing the role with a lower prompting score to the new one satisfying the consistency.An example illustrating this type of constraint is shown in Figure 5.
Our constraint modeling method can be easily generalized to other datasets/ontologies by simply using the knowledge about corresponding crosstask, cross-argument, and cross-event relations to design new constraints.The design processes are not costly as we could easily find such knowledge

Experiments
We first present the experimental settings, baselines used for comparison, and some implementation details.Next, we show and analyze the experiment results.Then we present a detailed analysis of the prompting module and global constraints regularization module.Finally, we conduct an error analysis.

Settings
We use ACE (2005-E + )6 (Doddington et al., 2004;Lin et al., 2020) and ERE(-EN) (Song et al., 2015) as datasets.In total, ACE has 33 event types and 22 roles, whereas ERE has 38 event types and 21 roles.We pre-process all events to keep only the event subtypes whenever applicable, as done in (Lin et al., 2020).Following the pre-processing in (Zhang et al., 2021), for each dataset, we merge all splits into one test set since our approach is zero-shot.When argument spans are not given, we pipeline our model with an argument identification module adapted from (Lyu et al., 2021).Specifically, we replace the QA model in (Lyu et al., 2021) with a more powerful PTLM with a span classification head on top, and the whole model has been fined-tuned for extractive QA tasks.Then for a passage, we prompt each role using the new QA model as in (Lyu et al., 2021).We collect the prompt results for all roles (ignoring the "None" result) as candidate spans for the passage.We use the F1 score for evaluation following (Ji and Grishman, 2008), where argument spans are evaluated on the head level when not given.Regarding PTLMs, We use GPT-J (6B) (Wang and Komatsuzaki, 2021) instances from Huggingface (Wolf et al., 2020), where an instance for causal language modeling is used for prompting, and an instance for QA is used for argument identification.In all the following sections except Section 3.2, we conduct experiments on ACE, assuming that argument spans are given.

Main Results
We report the main results comparing our models with three previous powerful zero-shot models (Liu et al., 2020;Lyu et al., 2021;Zhang et al., 2021).Moreover, we also report the results of a SOTA supervised model (Hsu et al., 2022).We obtain the results of all compared methods from our own experiments to ensure a fair comparison on the same datasets and same settings.From Table 1, we have the following observations: • Our model achieves superior performance on both datasets under both settings compared with all zero-shot baselines.Specifically, our model surpasses the best zero-shot baselines (Zhang et al., 2021) 1 and Figure 6.
• Compared with the supervised SOTA model (Hsu et al., 2022), there is still a significant gap between our model's performance and that it.Specifically, (Hsu et al., 2022) outperforms our model by 13.2% and 17.0% on ACE and ERE, respectively.When argument spans are not provided, (Hsu et al., 2022) outruns our model by 40.6% and 42.9% on ACE and ERE, respectively.We can see that the advantage of supervised SOTA over our zeroshot method is much more distinct when argument spans are not given in advance.This is probably because our zero-shot argument identification module described in Section 3.1 is not powerful enough, which causes severe error propagation to our EAC model.

Analysis of Prompting Module
We conduct experiments to examine the effects of different configurations of prefix prompt templates.Specifically, we compare our model's complete prefix prompt with the following configurations: (1) removing event type information from the prefix; (2) removing trigger information from the prefix; (3) removing the whole prefix.For instance, suppose the passage is "In Baghdad, a bomb was fired at 17 people."mentioned in Section 2.2, the prefix in configuration (1) would be "This event's occurrence is most clearly expressed by 'fired'.", the prefix in configuration (2) would be "This is a Attack event.",and in configuration (3) there would be no prefix.The corresponding results are shown in Table 2, where we have the following observations.First, removing either event type or trigger from the prefix prompt will cause a performance drop, which indicates that both kinds of information have contributions to the prompting process.Second, event type plays a more significant role than trigger does in prefix prompt, and the joint effect of them is greater than the sum of their respective effects.
In addition, we examine the effects of using different PTLMs in the prompting module.We compare the following PTLMs with GPT-J (6B): BERT (large, uncased) (Devlin et al., 2019), RoBERTa (large) (Liu et al., 2019), BART (large) (Lewis et al., 2020), GPT-2 (xl) (Radford et al., 2019), T5 (11B) (Raffel et al., 2020).The results are shown in Figure 6, where we have the following observations.First, the instance using GPT-J has the best performance, surpassing other instances by 4.2% to 7.9%.This shows that GPT-J has a better ability to understand events and their associated arguments compared to other PTLMs.Second, as PTLMs are listed in ascending order based on their numbers of parameters, we can see that for the first five models, the performance increases as the sizes of PTLMs become larger, which is consistent with the widely accepted notion that the larger model has a better capability of solving language tasks.However, the instance using the largest PTLM, T5 (11B), has a worse performance than GPT-2 and GPT-J.This is probably because autoregressive language modeling is more suitable for capturing information related to event arguments than mask language modeling is.

Analysis of Global Constraints Regularization Module
We conduct experiments to study the individual effect of each global constraint on the overall performance.The results are shown in Table 3, where we have the following observations.First, every  global constraint used by our model is beneficial to overall performance, which demonstrates that exploiting the domain knowledge about cross-task, cross-argument, and cross-event relations indeed provides our model with global understanding of event arguments.Second, the contribution of crosstask constraint is the most significant, which suggests that the global insights from the entity typing tasks are more effective in improving our model's reasoning ability about event arguments.Third, the cross-argument constraint is less effective than the other constraints, which shows that the global insights provided by the cross-argument constraint is less informative than those provided by the other constraints.

Model
Apart from the three global constraints described above, we have designed another 11 global constraints, which rely on cross-argument or crossevent relations.We add each of them into our model to check their respective effects on the overall performance.The results of three of them are in Table 4, whereas the results of all of them are in Section B. From the results, we can find that each of these constraints either brings minor improvement or even has a negative influence on the overall performance.Hence, we do not incorporate these constraints in our model to maintain our model's efficiency and effectiveness.

Error Analysis
We manually checked 100 wrong predictions of our model and found that most of the errors are caused by too general roles of some event types.Specifically, some roles' linguistic meanings are so general that a model, not knowing their detailed eventtype-dependent semantics, tends to assign them to some arguments which should have been assigned other roles.An example is shown in Figure 7.The example describes a Justice:Arrest-Jail event, which is associated with the following roles: "Person," "Agent," "Crime," "Time," and "Place.""Person" refers to the person who is jailed or arrested, whereas "Agent" refers to the jailer or the arresting agent.In the example, the argument span's true role should be "Agent" according to the detailed event-type-dependent semantics of "Person" and "Agent."However, our approach is zero-shot and directly models all role labels as natural language words, without incorporating the detailed eventtype-dependent semantics of those roles, which are too general (e.g., "Person").Therefore, our model assigns "Person" to "Police" since it is reasonable

Global constraint Effect on overall performance
There is at most one Time-Arg in each event.
0.4 A TRANSPORT event has at most one ORIGIN argument.
-0.1 If an Arrest-Jail event and a Charge-Indict event share arguments, 0.3 Arrest-Jail.Person is the same as Charge-Indict.Defendant, they share the same Crime argument.from the perspectives of linguistic and commonsense knowledge, and "Person" is much more common than "Agent" in the pre-training corpus of the PTLM in the prompting module, which makes it have much higher likelihood in the language modeling process.Incorporating event-type-dependent semantics of the roles which are too general into our model is left as future work.

Related Work
In this section, we introduce related works about constraint modeling, event extractions, and promptbased Information Extraction (IE).

Constraint Modeling
Constraint modeling, as an important technique in machine learning and NLP, aims to improve a model's performance by incorporating domain knowledge as constraints (Ganchev et al., 2010;Chang et al., 2012Chang et al., , 2013;;Deutsch et al., 2019;Chang et al., 2008Chang et al., , 2010;;Graça et al., 2010).One of the most significant advantages of constrained modeling is that it enables a model to capture the expressive and complex dependency structure in structured prediction problems like EAC (Chang et al., 2012).Especially in zero-shot scenarios, constrained modeling can provide useful indirect supervision to a model, which further boosts performance (Ganchev et al., 2010).Some previous works have adopted constraints based on eventrelated domain knowledge to classify event arguments (Lin et al., 2020;Zhang et al., 2021).However, their constraints either require laborintensive annotations (Lin et al., 2020) or consider limited global information (e.g., cross-event relations) (Zhang et al., 2021).In this paper, our model uses global constraints to regularize prediction by incorporating global insights from cross-task, crossargument, and cross-event relations.

Event Extraction
Event extraction is a fundamental information extraction task (Sundheim, 1992;Grishman and Sundheim, 1996;Riloff, 1996;Grishman et al., 2005;Chen et al., 2021;Du and Cardie, 2020;Liu et al., 2020), which can be further divided into four subtasks: trigger identification, trigger classification, argument identification, and argument classification.Traditional efforts mostly focus on the supervised setting (Ji and Grishman, 2008;Liao and Grishman, 2010;Liu et al., 2016;Chen et al., 2015;Nguyen et al., 2016;Liu et al., 2018;Zhang et al., 2019;Wadden et al., 2019;Lin et al., 2020).However, these works could suffer from the huge burden of human annotation.In this work, we focus on the argument classification task and propose a model using prompting and global constraints, without annotation and task-specific training.

Prompt-based IE
With the fast development of large PTLMs like T5 (Raffel et al., 2020), GPT-3 (Brown et al., 2020), and Pathway Language models (Chowdhery et al., 2022), the prompt-based method has been an efficient tool of applying those giant models into downstream NLP tasks (Liu et al., 2021).IE is not an exception.People have been using leverage prompts and giant models to solve IE tasks like named entity recognition (Cui et al., 2021), semantic parsing (Shin et al., 2021), and relations extraction (Chen et al., 2022;Han et al., 2021) in a zero-shot or few-shot way.However, previous prompting methods for IE need a tedious prompt design for every new type of events and arguments.
In contrast, our model's prompt templates can be adapted to all possible types of events and arguments in a fully automatic way.

Conclusion
We propose a zero-shot EAC model using global constraints with prompting.Compared with previ-ous works, our model does not require any annotation or manual prompt design, and our constraint modeling method can be easily adapted to any other datasets.Hence, our model can be easily generalized to any open-world event ontologies.Experiments on two standard event extraction datasets demonstrate our model's effectiveness.

Limitations
Our work has the following limitations.One limitation is that our model is not aware of the detailed event-type-dependent semantics of those roles which are too general, as discussed in Section 3.5.In the future, we will work on enabling our model to capture the event-type-dependent semantics of the roles which are too general.Another limitation is that our model's performance is still unsatisfactory compared with SOTA supervised model when argument spans are not given, as discussed in Section 3.2.In the future, we will work on designing a more powerful zero-shot event argument identification module for our model, so that we can obtain satisfactory zero-shot EAC performance even when argument spans are not given.

Figure 1 :
Figure 1: An example of EAC.The trigger is in bold face.Arguments are underlined and connected to their roles by arrows.

Figure 2 :
Figure 2: Model overview using prediction for one argument span as an example.[T ] 1 and [T ] 2 are the parts of the input passage before and after the span, respectively.k is the number of candidate roles of the event type.

Figure 4 :
Figure 4: An Example of cross-argument constraint.

Figure 5 :
Figure 5: An Example of cross-event constraint.

Figure 6 :
Figure 6: Comparison between the performance of using different PTLMs in prompting module.

Figure 7 :
Figure 7: An Example of the wrong prediction caused by too general argument roles.The text in bold face denotes trigger and the underlined text denotes target argument span.

Table 1 :
Performance of supervised model, zero-shot baselines, and our model.The best scores among the ones of zero-shot methods are in bold font.
by 12.5% and 10.9% on ACE and ERE, respectively.Without argument spans, our model outperforms the respective best zero-shot baselines(Lyu et al.,

Table 2 :
Results of using different configurations of prefix prompt.

Table 3 :
Results of using different configurations of global constraints.

Table 4 :
Results of three other global constraints.Results of all other global constraints are in Section B