From Simple to Complex: A Progressive Framework for Document-level Informative Argument Extraction

Document-level Event Argument Extraction (EAE) requires the model to extract arguments of multiple events from a single document. Considering the underlying dependencies between these events, recent efforts leverage the idea of"memory", where the results of already predicted events are cached and can be retrieved to help the prediction of upcoming events. These methods extract events according to their appearance order in the document, however, the event that appears in the first sentence does not mean that it is the easiest to extract. Existing methods might introduce noise to the extraction of upcoming events if they rely on an incorrect prediction of previous events. In order to provide more reliable memory, we propose a simple-to-complex progressive framework for document-level EAE. Specifically, we first calculate the difficulty of each event and then, we conduct the extraction following a simple-to-complex order. In this way, the memory will store the most certain results, and the model could use these reliable sources to help the prediction of more difficult events. Experiments on WikiEvents show that our model outperforms SOTA by 1.4% in F1, indicating the proposed simple-to-complex framework is useful in the EAE task.

1 Introduction Document-level Event Argument Extraction (EAE) aims at identifying the participants of multiple events from a document and classifying them into proper roles (Li et al., 2021;Du et al., 2022;Xu et al., 2022;Huang et al., 2022;Yang et al., 2023).Understanding events in documents is crucial for a line of downstream tasks, such as machine reading comprehension (Han et al., 2021) and dialogue systems (Zhang et al., 2020).
Generation-based document-level EAE methods are widely used in recent works (Li et al., 2021;Du et al., 2022;Du and Ji, 2022;Huang et al., 2022).Among them, one line of studies (Li et al., 2021;Huang et al., 2022) treats each event independently and ignores the underlying correlations between events in real-world documents.Other works (Du et al., 2022;Du and Ji, 2022) start to consider inter-event dependencies and model them by introducing the idea of "memory", where event predictions (e.g., arguments, roles) are cached and can be retrieved to help the prediction of the upcoming events in a document.However, since these methods still use front-to-back prediction-extracting events according to their appearance order in a document, an event can only rely on the predictions of events that appeared before it.Besides, the prediction of an event is cached regardless of its quality, whereas false predictions may be cached first, misleading the prediction of the following events.
In general, using current retrieval-based methods to model inter-event dependencies faces two main challenges: (1) front-to-back prediction only partially models inter-event dependencies, where the dependency links from an event to all the events that appeared after it are ignored; (2) incorrect predictions may be cached first and retrieved by the upcoming events.
Considering the challenges, we propose the simple-to-complex framework.First, we calculate the difficulty of each event, where the difficulty is defined as the average probability that the model assigns to the arguments of an event.We also calibrate the argument probabilities before using them to ensure they truly reflect how certain the model is on each argument.Second, we reorder events in a document from simple to complex and predict them accordingly.In this way, the model could use the simple instance to help the prediction of the difficult ones, no matter whether the simple events appear before or after the difficult ones in the original document.
We conduct experiments on widely used benchmarks WIKIEVENTS (Li et al., 2021), and our proposed simple-to-complex framework outperforms the previous SOTA by 1.4% in F1, illustrating the effectiveness of our method.Further analyses show the calibration of probability is very important when calculating the difficulty of different events and the success of our framework relies on the better usage of dependency between an event and the events that appear after it in the document.

Task Definition
In this work, we focus on document-level Informative Argument Extraction1 (IAE) (Li et al., 2021), where informative arguments are far more distant than local/uninformative ones and provide more useful information about an event.We formulate document-level IAE as a generative template-filling task following Li et al. (2021) and Du et al. (2022).Given a document D with triggers marked (using a special token <tgr>), our goal is to extract all the arguments of E to fill in the slots of the event template T .
Each event consists of (1) an event trigger, which has a specific type E (we use E to represent an event); (2) a series of event arguments, each corresponding to a specific role.In the event ontology, event types and argument roles are pre-defined, and event templates depicting the relationship between the argument roles of an event are also provided.For example, the template for E = Attack in the KAIROS ontology2 is: <arg1> attacked <arg2> using <arg3> at <arg4> place where each slot <argx> is a placeholder for arguments with a specific role.We replace all the <argx>s in a template with a special token <arg> before using them as input.If the model extracts an argument, then <arg> will be replaced by the argument.If no, the placeholder <arg> remains.

Method
In this section, we illustrate our framework based on simple-to-complex prediction (Figure 1).First, we introduce our memory-enhanced IAE model (Section 3.1).Here, we use retrieval to augment model input and apply constrained decoding to improve model output, both leveraging inter-event dependencies to benefit prediction.Second, we elaborate on how to define and calculate the difficulty of an event, and how to reorder events in a document from simple to complex for simple-tocomplex prediction (Section 3.2).

Memory-Enhanced IAE Model
Our memory-enhanced IAE model is based on a generative model.When calculating the difficulty of each event as well as generating (the arguments of) events in a document from simple to complex, we use the same generative model.In this study, we assume that each event has a prediction order and events in a document are predicted according to that order.After reordering, the prediction order of an event may change and further change the retrieved prediction of that event.

Model Input & Output
The generation of an event E in a document D is conditioned on the (1) prediction order o ∈ {1, 2, . . ., n e }: the order of predicting (the arguments of) E, where n e denotes the number of events in D; (2) event context c: the concatenation of E's context words (a continuous span in D close to E's trigger) and E's template; (3) retrieved prediction m R : the prediction of an event appeared before E retrieved from the document memory, a data structure that caches the predictions of already predicted events in D.
To sum up, the input of event E for the model is: where x 1 , x 2 , . . ., x n are the context words and T is E's unfilled template, and these two parts form the event context c.The prediction of E is a filled template, with each <arg> placeholder replaced by the predicted argument (or not), as shown in Figure 1.If there are multiple arguments for the same slot, the arguments are connected with "and".
Retrieval-Augmented Generation In the input stage (both for training and testing), we augment our model with similarity-based retrieval following Du et al. (2022) to make it capable of finding argument mentions beyond the context of an event, especially informative ones (Li et al., 2021).When predicting the i-th event E i in a document D, the snapshot of the document memory is m = {m 1 , m 2 , . . ., m i−1 }, where m k denotes the prediction of the k-th event.We calculate the cosine Figure 1: Our simple-to-complex progressive framework for document-level IAE.First, we calculate the difficulty of each event in a document D and obtain a new prediction order for that event.Second, we reorder events in D from simple to complex, and predict them accordingly.S2C denotes Simple-to-Complex, while F2B denotes Front-to-Back.Here, we plot the process of predicting the arguments of E 2 .
similarity between E i 's context c i and each prediction in m using S-BERT (Reimers and Gurevych, 2019) embeddings, and select the prediction with the highest score as additional input to help the prediction of E i : where SBERT() denotes S-BERT encoding, m R i denotes the retrieved prediction that E i relies on.
Constrained Decoding In the output stage, we introduce argument pair constraints following Du et al. (2022) to constrain the decoding of arguments with conflicting roles.For example, if the DE-TAINEE of event E a is "Mike", then "Mike" can not be decoded as the ATTACKER of another event E b (happened after E a ), since "Mike" is already in jail.Here, "Mike" as DETAINEE and "Mike" as AT-TACKER is an argument pair constraint.However, once an incorrect prediction is used to constrain another, it may cause more errors (Du et al., 2022).In the example above, "Mike" will never be decoded as the ATTACKER of E b once it is decoded as the DETAINEE of E a , even if the prediction of DETAINEE is incorrect.To alleviate such error propagation, we disable the constraints when the model is not certain about the prediction of an argument.The certainty of an argument can be measured by the calibrated probability of decoding it.Low probability intuitively implies the model is not confident about this prediction, while we have also found that high probability (e.g.≥ 0.8) corresponds to low prediction accuracy, which is shown in Figure 3. Therefore, we set both lower and upper bounds for argument probabilities to exclude possibly incorrect constraints, and we refer to our pruned constraints as bounded constraints.
The heuristics of selecting bounds are discussed in Appendix A.3.

Simple-to-Complex Reordering
Since an event usually contains multiple arguments, we reckon the difficulty of an event lies in the average difficulty of predicting its arguments.In this section, we will elaborate on how to calculate the difficulty of an event as well as how to reorder events in a document by the difficulty and predict them from simple to complex.Suppose the set of prediction orders of events in a document D be o = {o 1 , o 2 , . . ., o ne }, where o i represents the i-th appeared event is the o i -th to be predicted, then front-to-back prediction satisfies o i = i, i = 1, 2, . . ., n e .Suppose the probabilities of decoding the arguments of the i-th event E i in a document D be: where n a denotes the number of predicted arguments of E i , and p i denotes the probability that the generative model assigns to the j-th argument of the i-th event.The argument probability reflects the certainty of the generative model on predicting an argument, inversely proportional to our desired difficulty.Thus, we define the difficulty of predicting the arguments of E i as: i denotes the difficulty of predicting the j-th argument of the i-th event.The difficulty of E i is defined as the average difficulty of predicting its arguments, so we take the average of d arg i and obtain the difficulty of E i : If no arguments of E i are predicted, then d evt i = 2.That means E i will be placed to the rear, since it provides no arguments/roles that might benefit prediction.After calculating the difficulty of events in D, we obtain a new set of prediction orders Then, we can predict (the arguments of) events in D from simple to complex according to o ′ .
The method described above assumes that the model providing the probabilities is well-calibrated, where confidence (the probability that a model assigns to a prediction) equals or nearly equals accuracy (the real correctness of a prediction).In other words, high confidence corresponds to high accuracy, and vice versa.However, some studies reveal that current Deep Neural Networks (DNNs) are prone to over-confidence, which implies that the model's confidence is not reliable (Guo et al., 2017).We have also found similar problems in our model, which is discussed in Section 5.1.Therefore, we should calibrate these probabilities before using them for our simple-to-complex reordering.Specifically, we adopt temperature scaling (Guo et al., 2017;Desai and Durrett, 2020), a simple and effective method for calibration.In this work, the temperature T is selected by minimizing the Expected Calibration Error (ECE) (Pakdaman Naeini et al., 2015) on the validation set, and we denote the temperature with the lowest ECE as T ′ .
Accounting for calibration, there should be a revision on the calculation of p (j) i .Suppose the logits vector of the j-th predicted argument of the i-th event be z (j) i , then: where k traverses each dimension of z (j) i .

Dataset and Evaluation Metrics
We evaluate our framework on WIKIEVENTS (Li et al., 2021) as it annotates all the events in a document (averagely 16 events per document), while existing document-level datasets such as DocEE (Tong et al., 2022), RAMS (Ebner et al., 2020) and MUC-4 (Sundheim, 1992) only annotate at most 3 events per document.Also, it provides complete coreference annotation for document-level IAE.Its statistics are shown in Table 1.We measure the Argument Identification (Arg-I) and Argument Classification (Arg-C) capabilities of our model following Li et al. (2013).If an argument span matches any of the gold informative arguments of the event, the argument is correctly identified.If the semantic role also matches, the argument is considered correctly classified.Following previous studies on document-level IAE (Li et al., 2021;Du et al., 2022), we adopt Head Word Match (Head F1) (Huang and Riloff, 2021) and Coreferential Match (Coref F1) (Ji and Grishman, 2008) to judge whether the predicted argument span matches the gold argument span.Head Word Match demands the first word of the predicted argument to match that of the gold argument, while Coreferential Match only needs the extracted argument to be coreferential with the gold argument.We report the micro-P/R/F1 averaged on three different seeds.

Baselines
We compare our framework with a series of competitive baselines: (1) BERT-CRF (Shi and Lin, 2019), a simple BERT-based model without incorporating lexical or syntactic features for argument identification and classification.(2) BART-Gen (Li et al., 2021), a conditional neural text generation model that generates a filled template for each event given the event template and context words.(3) BART-Gen (w/ M+C) (Du et al., 2022), a framework based on BART-Gen, which utilizes retrieval to augment model input and constructs argument pair constraints for decoding.It is the SOTA model on document-level IAE, but still extracts events according to their appearance order in the document.We also report the results of BART-Gen (w/ M) for comparison.

Main Results
The main results for document-level IAE are presented in Table 2. From the results, we can conclude that: • Our S2C-CD model outperforms all previous methods on WIKIEVENTS as to documentlevel IAE, with an average gain of 1.4% in F1 on all four settings.
• All models augmented with retrieval (i.e., w/ M) perform better compared with BERT-CRF and raw BART-Gen, showing the importance of modeling inter-event dependencies.
• Compared to BART-Gen (w/ M), the addition of simple-to-complex reordering (S2C model) greatly improves F1, where F1 on average increases by 1.34 in Arg-I and 1.42 in Arg-C.This improvement can be mainly attributed to our simple-to-complex prediction paradigm, since it allows more inter-event dependency links, i.e., from an event to all the events that appeared after it.
• After applying our bounded constraints (S2C-CD model), there is an additional improvement in P and F1, which shows that incorrect constraints are effectively pruned.

Is Calibration Necessary?
What to Calibrate?For each argument, we focus on its first token probability, and this probability is what we aim to calibrate.The reason for using the first token probabilities is that the generation of the remaining tokens is highly dependent on the first token.As shown in Figure 2, 87% of the non-first token probabilities are ≥ 0.9, while first token probabilities are better distributed, with only 53% of them ≥ 0.9.
Why to Calibrate?Modern DNNs are prone to over-confidence, which implies that the model's  confidence is not reliable (Guo et al., 2017).We have also found similar problems in our model.As shown in Figure 2, the first and non-first token probability distribution both exhibit a severe overconfidence phenomenon before calibration, with most probabilities ≥ 0.9.This suggests that the model tends to assign a high (i.e., → 1) probability to nearly all of the arguments, which can not truly reflect how sure the model is of each argument.Also, over-confidence leads to miscalibration, which is reflected in the reliability diagram in Figure 3.The reliability diagram plots the relation between confidence and accuracy, and its definition is discussed in Appendix A.2.As shown in Figure 3, most points on the orange curve (before calibration) are far from the zero error curve where confidence exactly equals accuracy, demonstrating that uncalibrated probabilities are unreliable.After calibration, the first token probability distribution becomes flat (Figure 2) and calibrated (Figure 3).Therefore, we should calibrate probabilities (con- fidence) to align them with accuracy before using them for our simple-to-complex reordering.

Influence of Uncalibrated Probabilities
We conduct an ablation study to further explore what will happen if we reorder events using uncalibrated probabilities.As shown in Table 3, if we order events by uncalibrated probabilities, F1 is only comparable to excluding simple-to-complex reordering from our S2C model.The performance is maximized only when the probabilities are calibrated.Therefore, calibration is essential.

Difficulty Calculation Needs Memory?
In this section, we present two possible ways of calculating the difficulty of an event to explore their impact on performance.In Figure 4, we define the first inference as step 1-2 (calculating the difficulty), and the second inference as step 3 (predicting the arguments of reordered events).The first way is utilized in our framework, where we use the same model for both inferences.When calculating the difficulty at the first inference, we also use retrieval.There are two reasons for this.On the one hand, the model input is augmented with retrieval during training, so the input/prompt format should be consistent during testing.On the other hand, the model is trained to predict the arguments of an upcoming event conditioned on the prediction of an already predicted event.Therefore, we should calculate "the difficulty of an event conditioned on the prediction of an earlier predicted event".However, the retrieved predictions of the same event at both inferences are usually different.
The second way is to use separate models for each inference.At the first inference, we use a model trained without retrieval to calculate the "raw" difficulty of an event (i.e., do not condition on the prediction of an earlier predicted event).At the second inference, we train a retrieval-augmented model.This way removes the possible influence of retrieval on calculating the difficulty of an event, but the training overhead doubles.
We conduct an experiment to compare these two ways, as shown in Table 4. R1 and R2 represent one model and two models, respectively.R1 is comparable to R2 in Arg-I, while R1 is notably better than R2 in Arg-C.The results suggest that using one model is generally better as to performance, so we should calculate the difficulty of an event conditioned on the prediction of an earlier predicted event.Besides, using one model is more time-efficient.

Influence of Bounded Constraints
In this section, we first compare our bounded constraints with those presented in Du et al. (2022), then analyze the impact of the lower and upper bounds individually.
In Table 5, we observe that when applying the original constraints (Du et al., 2022) performs only comparably with our S2C model.This implies the number of correct and incorrect constraints is nearly equal when we predict events from simple to complex.By contrast, our bounded constraints perform well, suggesting that the number of incorrect constraints is indeed reduced and the correct constraints are (mostly) kept.Using the steps presented in Appendix A.3, we obtain the lower bound 0.5 and the upper bound 0.8, so we will disable a constraint if the probability of decoding the argument is ≤ 0.5 or ≥ 0.8.Based on this, we individually analyze the influence of the lower and upper bounds, as shown in Table 6.We have found that whether we remove the lower bound or the upper bound, the performance drops, indicating that both bounds are useful for reducing the number of incorrect constraints.

Case Study
In the case presented in Figure 5, we would like to predict the arguments of event E = Damage, which describes the mental damage that Dzhokhar Tsarnaev brought to the victims as a bomber.In front-to-back prediction, E can only access the predictions of earlier appeared events and retrieves E 1 's prediction as additional input.The death of Dzhokhar Tsarnaev in E 1 happened after E, wrongly restricting the prediction of "Dzhokhar Tsarnaev" as the DAMAGER of E. By contrast, with our simple-to-complex prediction, E has the chance to rely on the INJURER argument of a later appeared event E 2 and obtain the correct DAM-AGER argument "Dzhokhar Tsarnaev".E 2 describes that Dzhokhar Tsarnaev injured 264 people in the bombing as the INJURER, thus bringing mental damage to them as the DAMAGER.
Comparing these two prediction paradigms, we find that simple-to-complex prediction is better, mainly because it allows more inter-event dependency links.In this example, it is intuitive that E is more similar to E 2 , since they respectively depict the physical damage and mental damage Dzhokhar Tsarnaev brought to the victims.However, the dependency link from E to E 2 is disabled when predicting events from front to back.

Error Analysis
Table 7 summarizes the error types of our S2C-CD model.The errors mainly come from the inability to recognize an argument span (around half), while only about 8% of identified arguments are assigned incorrect semantic roles.Therefore, identifying the argument span is more important than assigning a more accurate role to already extracted arguments.
6 Related Work

Document-level EAE
Unlike sentence-level EAE (Li et al., 2014;Du and Cardie, 2020;Xiangyu et al., 2021) and their participants usually spread across the document in document-level EAE.We focus on document-level IAE (Li et al., 2021) (Tong et al., 2022), RAMS (Ebner et al., 2020) and MUC-4 (Sundheim, 1992) that only annotate at most 3 events per document, WIKIEVENTS annotates all the events in a document, with an average of 16 events per document.Also, it provides complete coreference annotation for evaluating document-level IAE.Recently, generation-based methods have been proposed for document-level EAE.Among them, template generation-based approaches (Li et al., 2021;Huang et al., 2022;Du et al., 2022) are widely utilized.BART-Gen (Li et al., 2021) conditioned generation on event templates and context words but considered each event independently.Further, Du et al. (2022); Du and Ji (2022) introduced the idea of "memory" to document-level EAE, where predictions of already predicted events were utilized as additional input.Although these methods can model inter-event dependencies to some extent, they ignore the dependency links from an event to all the events that appeared after it in a document.Besides, uncertain/false event predic-tions may be cached first and retrieved by future events, misleading their prediction.

Confidence Calibration
Studies on the calibration of natural language models have been drawing attention recently (Desai and Durrett, 2020;Park and Caragea, 2022;Kim et al., 2023).Among modern calibration approaches, temperature scaling is a simple and effective method (Desai and Durrett, 2020) which can produce low ECE (Guo et al., 2017;Chen et al., 2023).Due to its low time overhead and low ECE property, we adopt it in our work.Other works focus on methods such as label smoothing (Pereyra et al., 2017) and data augmentation (Hendrycks* et al., 2020), but these methods cannot produce as low ECE as temperature scaling (Chen et al., 2023).More recent studies started to treat calibration as an additional task, which needs collecting data and training extra models (Ye and Durrett, 2022;Zhang et al., 2021).In order to reduce time overhead, we do not consider these approaches.

Conclusion
In this work, we propose the idea of simple-tocomplex prediction for events in a document, where events in a document are reordered from simple to complex and predicted accordingly.Besides, we introduce retrieval to augment model input and apply constrained decoding to improve model output.Empirical results and analysis demonstrate that our best model outperforms prior methods by a notable margin and our simple-to-complex prediction is beneficial since it allows more inter-event dependency links, i.e., from an event to all the events appeared after it.

Limitations
Firstly, our framework requires two inference processes, where the first inference is to calculate the difficulty of events in a document and the second inference is to predict the arguments of these events from simple to complex.Secondly, the way of setting lower/upper bounds is a hard pruning strategy that disables constraints where the argument probability is too low/high.However, this strategy rigidly excludes constraints for which the model is not sufficiently certain or less reliable, without really taking into account the wrong constraints caused by the incorrectly predicted arguments.We leave the problems for future work.

Figure 2 :
Figure 2: Uncalibrated/Calibrated probability distribution.The left two diagrams respectively show the uncalibrated first and non-first (remaining) token probability distribution, while the diagram on the right shows the calibrated first token probability distribution.

Figure 3 :
Figure 3: The reliability diagram before/after calibration.The dashed line represents zero error.

Figure 4 :
Figure 4: Two ways of calculating the difficulty.

Figure 5 :
Figure 5: Case study on simple-to-complex reordering.
, the model

Table 6 :
Influence of the lower and upper bounds.

Table 7 :
, events Errors made by our framework under Head Match (HM) and Coref Match (CM).