The Devil is in the Details: On the Pitfalls of Event Extraction Evaluation

Event extraction (EE) is a crucial task aiming at extracting events from texts, which includes two subtasks: event detection (ED) and event argument extraction (EAE). In this paper, we check the reliability of EE evaluations and identify three major pitfalls: (1) The data preprocessing discrepancy makes the evaluation results on the same dataset not directly comparable, but the data preprocessing details are not widely noted and specified in papers. (2) The output space discrepancy of different model paradigms makes different-paradigm EE models lack grounds for comparison and also leads to unclear mapping issues between predictions and annotations. (3) The absence of pipeline evaluation of many EAE-only works makes them hard to be directly compared with EE works and may not well reflect the model performance in real-world pipeline scenarios. We demonstrate the significant influence of these pitfalls through comprehensive meta-analyses of recent papers and empirical experiments. To avoid these pitfalls, we suggest a series of remedies, including specifying data preprocessing, standardizing outputs, and providing pipeline evaluation results. To help implement these remedies, we develop a consistent evaluation framework OMNIEVENT, which can be obtained from https://github.com/THU-KEG/OmniEvent.


Introduction
Event extraction (EE) is a fundamental information extraction task aiming at extracting structural event knowledge from plain texts. As illustrated in Figure 1, it is typically formalized as a two-stage pipeline (Ahn, 2006). The first subtask, event detection (ED), is to detect the event triggers (keywords or phrases evoking events, e.g., quitting in Figure 1) and classify their event types (e.g., End-Position). The second subtask, event argument extraction (EAE), is to extract corresponding event arguments and their roles (e.g., Elon Musk and its argument role Person) based on the first-stage ED results.
Since events play an important role in human language understanding and broad applications benefit from structural event knowledge (Ji and Grishman, 2011;Glavaš and Šnajder, 2014;Hogenboom et al., 2016;Zhang et al., 2020a), EE has attracted much research attention, and novel models have been continually developed. Beyond the conventional paradigms like classification  and sequence labeling (Nguyen et al., 2016;Chen et al., 2018), new model paradigms such as span prediction Du and Cardie, 2020b) and conditional generation (Lu et al., 2021;Li et al., 2021b) are proposed. These sophisticated models push evaluation results to increasingly high levels.
However, due to the complex input/output formats and task pipeline of EE, there are some hidden pitfalls in EE evaluations, which are rarely noted and discussed in EE papers . These pitfalls make many competing EE methods actually lack grounds for comparison, and the reported scores cannot reflect real-world model performances well.
In this paper, we summarize three major pitfalls: (1) Data preprocessing discrepancy. If two EE works conduct evaluations on the same dataset but adopt different preprocessing methods, their results are not directly comparable. Since EE datasets have complex data formats (involving multiple heteroge-neous elements including event triggers, arguments, entities, temporal expressions, etc.), data preprocessing methods of existing works often disagree on some design choices, like whether to include multi-token triggers, which results in major data discrepancies. For instance, for the widely-used English subset of ACE 2005 (Walker et al., 2006), the preprocessing of  gets 5, 055 event triggers, but  have 5, 349.
(2) Output space discrepancy. Different model paradigms have inconsistent output spaces, which makes the evaluation metrics of different-paradigm models often not calculated on the same bases. For example, the phrase Elon Musk is one argument candidate in the output space of conventional classification-based methods, and it is regarded as one error case when the model misclassifies it. But other model paradigms, like the sequence labeling, have more free output formats and can make two independent predictions for the two tokens Elon and Musk, which will account for two error cases in the evaluation metric calculation. Larger output spaces of the new model paradigms also result in unclear mappings between predictions and annotations in some cases, which are often overlooked in EE evaluation implementations and lead to problematic results. These details are presented in § 3.3.
(3) Absence of pipeline evaluation. Recent works handling only the EAE subtask often evaluate the performances based on gold event triggers Xi et al., 2021;. In contrast, conventional EE works often conduct pipeline evaluation, i.e., evaluate EAE performances based on triggers predicted at the ED stage. The absence of pipeline evaluation makes these EAE-only works hard to be directly compared with EE works. This has discouraged the research community from considering all the EE subareas in a holistic view. Moreover, only using gold triggers in evaluation cannot evaluate EAE models' resistance to the noise of predicted triggers, which is important in real-world application scenarios.
We conduct systematic meta-analyses of EE papers and empirical experiments, demonstrating the pitfalls' broad and significant influence. We suggest a series of remedies to avoid these pitfalls, including specifying data preprocessing methods, standardizing outputs, and providing pipeline evaluation results. To help conveniently achieve these remedies, we develop a consistent evaluation framework, OMNIEVENT, which contains implementa-tions for data preprocessing and output standardization, and off-the-shelf predicted triggers on widelyused datasets for easier pipeline evaluation.
To summarize, our contributions are two-fold: (1) We systematically analyze the inconspicuous pitfalls of EE evaluations and demonstrate their significant influence with meta-analyses and experiments.
(2) We propose corresponding remedies to avoid the pitfalls and develop a consistent evaluation framework to help implement them.

Related Work
Traditional methods (Ji and Grishman, 2008;Gupta and Ji, 2009;Hong et al., 2011;Li et al., 2013) rely on human-crafted features and rules to extract events. Most modern EE models automate feature learning with neural networks (Nguyen and Grishman, 2015;Nguyen et al., 2016;Nguyen and Grishman, 2018) and adopt different model paradigms to model the EE task. The most common classification-based methods view EE as classifying given trigger and argument candidates into different labels Feng et al., 2016;Chen et al., 2017;Liu et al., 2018b;Lai et al., 2020;. Sequence labeling methods (Nguyen et al., 2016;Chen et al., 2018;Araki and Mitamura, 2018;Ding et al., 2019;Ma et al., 2020;Nguyen et al., 2021;Guzman-Nateras et al., 2022) do EE by labeling every word following a certain tagging schema such as BIO (Ramshaw and Marcus, 1995). Recently, some works (Du and Cardie, 2020b;Li et al., 2020a;Liu et al., , 2021b propose to cast the task formalization of EE into resource-rich machine reading comprehension tasks and adopt the span prediction paradigm to predict the starting and ending positions of event trigger and argument spans. With the development of generative pre-trained language models (Lewis et al., 2020;Raffel et al., 2020;Brown et al., 2020), there have been works (Lu et al., 2021;Xi et al., 2021;Li et al., 2021bLiu et al., 2022c;Du et al., 2022;Hsu et al., 2022; exploring the conditional generation paradigm to generate sequences indicating EE results.
A few previous works Lai et al., 2020; have noted that data preprocessing discrepancy may influence evaluation results, but they did not especially study its impact with in-depth analyses. To the best of our knowledge, we are the first to study all three kinds of pitfalls of EE evaluation and propose comprehensive remedies for them.

Pitfalls of Event Extraction Evaluation
We first introduce our investigation setup for metaanalysis and empirical analysis ( § 3.1). Then we analyze the three pitfalls: data preprocessing discrepancy ( § 3.2), output space discrepancy ( § 3.3), and absence of pipeline evaluation ( § 3.4).

Investigation Setup
We adopt the following two investigation methods to analyze the influence of the observed pitfalls.

Meta-Analysis
To comprehensively understand the research status and investigate the potential influence of the evaluation pitfalls, we analyze a broad range of recent EE studies in the metaanalysis. Specifically, we manually retrieve all published papers concerning EE, ED, and EAE tasks at four prestigious venues from 2015 to 2022 via keyword 1 matching and manual topic rechecking by the authors. The complete paper list is shown in appendix C, including 44 at ACL, 39 at EMNLP, 19 at NAACL, and 14 at COLING.
We conduct statistical analyses of these papers and their released codes (if any) from multiple perspectives. These statistics will be presented to demonstrate the existence and influence of the pitfalls in the following sections, respectively.
Empirical Analysis In addition to the metaanalysis, we conduct empirical experiments to quantitatively analyze the pitfalls' influence on EE evaluation results. We reproduce several representative models covering all four model paradigms mentioned in § 2 to systematically study the influence. Specifically, the models contain: (1) Classifcation methods, including DMCNN (Chen et al., 2015) , DMBERT (Wang et al., 2019a,b), and CLEVE . DMCNN and DM-BERT adopt a dynamic multi-pooling operation over hidden representations of convolutional neural networks and BERT (Devlin et al., 2019), respectively. CLEVE is an event-aware pre-trained model enhanced with event-oriented contrastive pre-training. (2) Sequence labeling methods, including BiLSTM+CRF  and BERT+CRF , which adopt the conditional random field (Lafferty et al., 2001) 1 We use event and extraction as keywords for searching.  (Walker et al., 2006) 2 in their experiments. Hence we also adopt this most widely-used dataset in our empirical experiments to analyze the pitfalls without loss of generality. The reproduction performances are shown in Table 1. Following the conventional practice, we report precision (P), recall (R), and the F1 score. In the following analyses, we show the impact of three pitfalls by observing how the performances change after controlling the pitfalls' influence.

Data Preprocessing Discrepancy
Due to the inherent task complexity, EE datasets naturally involve multiple heterogeneous annotation elements. For example, besides event triggers and arguments, EE datasets often annotate entities, temporal expressions, and other spans as argument candidates. The complex data format makes the data preprocessing methods easily differ in many details, which makes the reported results on the same dataset not directly comparable. However, this pitfall has not received extensive attention.
To carefully demonstrate the differences brought by data preprocessing discrepancy, we conduct de-    . In addition to these scripts, there are 6 other open-source preprocessing scripts that are only used once. The utilization rates and data statistics of the different preprocessing methods are shown in Table 2. From the statistics, we can observe that: (1) The data differences brought by preprocessing methods are significant. The differences mainly come from the different preprocessing implementation choices, as summarized in Table 3. For instance, ACE-DYGIE and ACE-OneIE ignore the annotated temporal expressions and values in ACE 2005, which results in 13 fewer argument roles compared to ACE-Full. Intuitively, the significant data discrepancy may result in inconsistent evaluation results.
(2) Each preprocessing script has a certain utilization rate and the majority (63%) papers do not specify their preprocessing methods. The high preprocessing inconsistency and Unspecified rate both show that our community has not fully recognized the significance of the discrepancies resulting from differences in data preprocessing.

ACE-DYGIE ACE-OneIE ACE-Full
Metric To further empirically investigate the influence of preprocessing, we conduct experiments on ACE 2005. Table 4 shows the F1 differences keeping all settings unchanged except for the preprocessing scripts. We can observe that the influence of different preprocessing methods is significant and varies from different models. It indicates that the evaluation results on the same dataset are not necessarily comparable due to the unexpectedly large influence of different preprocessing details.
Moreover, besides ACE 2005, there are also data preprocessing discrepancies in other datasets. For example, in addition to the implementation details, the data split of the KBP dataset is not always consistent (Li et al., 2021a, and some used LDC 3 datasets are not freely available, such as LDC2015E29. Based on all the above analyses, we suggest the community pay more attention to data discrepancies caused by preprocessing, and we propose corresponding remedies in § 4.1.

Output Space Discrepancy
As shown in Figure 3, the diversity of adopted model paradigms in EE studies has substantially increased in recent years. Figure 2 illustrates the different paradigms' workflows in EAE scenario 4 . The paradigms inherently have very different output spaces, which results in inconspicuous pitfalls in the comparative evaluations across paradigms.

Inconsistent Output Spaces between Different
Paradigms As shown in Figure 2, there are substantial differences between the model output spaces of different paradigms. CLS-paradigm models only output a unique label for each candidate in a pre-defined set. While models of SL and SP paradigms can make predictions for any consecutive spans in the input sequence. The output space of CG-paradigm models is even larger, as their vanilla 5 output sequences are completely free, e.g., they can even involve tokens unseen in the input. The inconsistent output spaces make the evaluation metrics of different-paradigm models calculated on different bases and not directly comparable. For instance, when calculating the confusion matrices for the prediction as Chief Executive of in  Indicates excluding tricks like vocabulary constraint, etc. ure 2, the CLS paradigm takes it as one true positive (TP) and two false positives (FP), while the remaining paradigms only count it as one FP. The CLS paradigm may also have an advantage in some cases since it is constrained by the pre-defined candidate sets and cannot make illegal predictions as other paradigms may have.

Unclear Mappings between Predictions and Annotations
Implementing the mappings between model predictions and dataset annotations is a key component for evaluation. The larger output spaces of SL, SP, and CG paradigms often produce unclear mappings, which are easily neglected in the EE evaluation implementations and influence the final metrics. As shown in Figure 2 (bottom right), we summarize three major unclear mapping issues: 1 ⃝ Prediction span overlaps the gold span. A prediction span of non-CLS paradigm models may overlap but not strictly align with the annotated span, bringing in an unclear implementation choice. As in Figure 2, it is unclear whether the predicted role Position for the span as Chief Executive of should be regarded as a correct prediction for the contained annotated span Chief Executive.

2
⃝ Multiple predictions for one annotated span. If without special constraints, models of SP and CG paradigms may make multiple predictions for one span. Figure 2 presents two contradictory predictions (Company and Person) for the annotated span Elon Musk. To credit the correct one only or penalize both should lead to different evaluation Table 5: The precision, recall, and F1 (%) differences between evaluation with and without our output standardization. The results are evaluated on ACE-OneIE, and the results for other preprocessing methods are in appendix B. Output standardization aligns the output spaces of the other paradigms into that of the CLS paradigm, and thus we do not include the CLS-paradigm models here, whose results are unchanged.
results. 3 ⃝ Predictions without positions for nonunique spans. Vanilla CG-paradigm models make predictions by generating contents without specifying their positions. When the predicted spans are non-unique in the inputs, it is unclear how to map them to annotated spans in different positions. As in Figure 2, the CG model outputs two Twitter predictions, which can be mapped to two different input spans.
To quantitatively demonstrate the influence of output space discrepancy, we conduct empirical experiments. Specifically, we propose an output standardization method (details in § 4.2), which unify the output spaces of different paradigms and handle all the unclear mapping issues. We report the changes in metrics between the original evaluation implementations and the evaluation with our output standardization in Table 5. We can see the results change obviously, with the maximum increase and decrease of +2.8 in ED precision and −3.5 in EAE recall, respectively. It indicates the output space discrepancy can lead to highly inconsistent evaluation results. Hence, we advocate for awareness of the output space discrepancy in evaluation implementations and suggest doing output standardization when comparing models using different paradigms.

Absence of Pipeline Evaluation
The event extraction (EE) task is typically formalized as a two-stage pipeline, i.e., first event detection (ED) and then event argument extraction (EAE). In real applications, EAE is based on ED and only extracts arguments for triggers detected by the ED model. Therefore, the conventional evaluation of EAE is based on predicted triggers and considers ED prediction errors, which we call pipeline evaluation. It assesses the overall performance of an event extraction system and is consistent with real-world pipeline application scenarios.
However, as shown in Figure 4, more and more works have focused only on EAE in recent years. For convenience and setting a unified evaluation base between the EAE-only works, 95.45% of them only evaluate EAE taking gold triggers as inputs. We dub this evaluation setting as gold trigger evaluation. The conventional pipeline evaluation of EE works is absent in most EAE-only works, which poses two issues: (1) The absence of pipeline evaluation makes the results of EAE-only works hard to be directly cited and compared in EE studies. In the papers covered by our meta-analysis, there is nearly no direct comparison between EE methods and EAE-only methods. It indicates that the evaluation setting difference has created a gap between the two closely connected research tasks, which hinders the community from comprehensively understanding the research status.
(2) The gold trigger evaluation may not well reflect the real-world performance since it ignores the EAE models' resistance to trigger noise. In real-world applications, the input triggers for EAE models are noisy predicted triggers. A good EAE method should be resistant to trigger noise, e.g., not extracting arguments for false positive triggers. The gold trigger evaluation neglects trigger noise.
To assess the potential influence of this pitfall, we compare experimental results under the gold trigger evaluation and pipeline evaluation of various models in Table 6. We can observe different trends from the results of gold trigger evaluation and pipeline evaluation. For example, although DMBERT performs much better than BERT+CRF under gold trigger evaluation, they perform nearly the same under pipeline evaluation (47.2 vs. 47.1). It suggests that the absence of pipeline evalua-  tion may bring obvious result divergence, which is rarely noticed in existing works. Based on the above discussions, we suggest also conducting the pipeline evaluation in EAE works.

Consistent Evaluation Framework
The above analyses show that the hidden pitfalls substantially harm the consistency and validity of EE evaluation. We propose a series of remedies to avoid these pitfalls and develop a consistent evaluation framework, OMNIEVENT. OMNIEVENT helps to achieve the remedies and eases users of handling the inconspicuous preprocessing and evaluation details. It is publicly released and continually maintained to handle emerging evaluation pitfalls. The suggested remedies include specifying data preprocessing ( § 4.1), standardizing outputs ( § 4.2), and providing pipeline evaluation results ( § 4.3). We further re-evaluate various EE models using our framework and analyze the results in § 4.4.

Specify Data Preprocessing
As analyzed in § 3.2, preprocessing discrepancies have an obvious influence on evaluation results. The research community should pay more attention to data preprocessing details and try to specify them. Specifically, we suggest future EE works adopt a consistent preprocessing method on the same dataset. Regarding the example in § 3.2, for the multiple ACE 2005 preprocessing scripts, we recommend ACE-Full since it retains the most comprehensive event annotations, e.g., multi-token triggers and the time-related argument roles, which are commonly useful in real-world applications. If a study has to use different preprocessing methods for special reasons, we suggest specifying the preprocessing method with reference to public codes. However, there are no widely-used publicly available preprocessing scripts for many EE datasets, which makes many researchers have to re-develop their own preprocessing methods. In our consistent evaluation framework, we provide preprocessing scripts for various widely-used datasets, including ACE 2005 (Walker et al., 2006), TAC KBP Event Nugget Data 2014(Ellis et al., 2014, TAC KBP 2017(Getman et al., 2017, RichERE (Song et al., 2015), MAVEN , LEVEN (Yao et al., 2022), DuEE (Li et al., 2020b), and FewFC . We will continually add the support of more datasets, such as RAMS (Ebner et al., 2020) and WikiEvents (Li et al., 2021b), and we welcome the community to contribute scripts for more datasets.

Standardize Outputs
Based on the discussions about output space discrepancy in § 3.3, we propose and implement an output standardization method in our framework.
To mitigate the inconsistency of output spaces between paradigms, we project the outputs of non-CLS paradigm models onto the most strict CLSparadigm output space. Specifically, we follow strict boundary-matching rules to assign the non-CLS predictions to each trigger/argument candidate in pre-defined candidate sets of the CLS paradigm. The final evaluation metrics are computed purely on the candidate sets, and those predictions that fail to be matched are discarded. The intuition behind this operation is that given the CLS-paradigm candidate sets are automatically constructed, the illegal predictions out of this scope can also be automatically filtered in real-world applications.
Regarding the unclear mappings between predictions and annotations, we consider the scenario of real-world applications and propose several deterministic mapping rules for consistent evaluations. We respond to the issues mentioned in § 3.3 as follows. 1 ⃝ Prediction span overlaps the gold span. We follow strict boundary-matching rules and discard such overlapping predictions. For example, the SL prediction of as Chief Executive of cannot strictly match any candidate in the candidate set of the CLS paradigm. Hence it is discarded after output standardization. 2 ⃝ Multiple predictions for one annotated span. If the outputs are with confidence scores, we choose the prediction  with the highest confidence as the final prediction, otherwise, we simply choose the first appearing prediction. The remaining predictions are discarded.

3
⃝ Predictions without positions for non-unique spans. We assign such predictions to the annotated spans simply by their appearing order in the output/input sequence to avoid information leakage. We encourage designing new models or postprocessing rules to add positional information for CG predictions so that this issue can be directly solved by strict boundary-matching.

Provide Pipeline Evaluation Results
The absence of pipeline evaluation ( § 3.4) creates a gap between EE and EAE works, and may not well reflect EAE models' performance in real-world scenarios. Therefore, in addition to the common gold trigger evaluation results, we suggest future EAEonly works also provide pipeline evaluation results. However, there are two difficulties: (1) It is an extra overhead for the EAE-only works to implement an ED model and get predicted triggers on the datasets.
(2) If two EAE models use different predicted triggers, their evaluation results are not directly comparable since the trigger quality influences EAE performance. To alleviate these difficulties, our consistent evaluation framework releases off-the-shelf predicted triggers for the widely-used EE datasets, which will help future EAE works conduct easy and consistent pipeline evaluations. The released predicted triggers are generated with existing top-performing ED models so that the obtained pipeline evaluation results shall help the community to understand the possible EE performance of combining top ED and EAE models.

Experimental Results
We re-evaluate various EE models with our consistent evaluation framework. The results are shown in Table 7, and we can observe that: (1) If we are not aware of the pitfalls of EE evaluation, we can only understand EE development status and compare competing models from the "Original Evaluation" results in Table 7. After eliminating the influence of the pitfalls with our framework, the consistent evaluation results change a lot in both absolute performance levels and relative model rankings. This comprehensively demonstrates the influence of the three identified evaluation pitfalls on EE research and highlights the importance of awareness of these pitfalls. Our framework can help avoid the pitfalls and save efforts in handling intensive evaluation implementation details.
(2) Although the changes in F1 scores are minor for some models (e.g., CLEVE), their precision and recall scores vary significantly. In these cases, consistent evaluation is also necessary since real-world applications may have different precision and recall preferences.

Conclusion and Future Work
In this paper, we identify three pitfalls of event extraction evaluation, which are data preprocessing discrepancy, output space discrepancy, and absence of pipeline evaluation. Meta-analyses and empirical experiments present a huge impact of these pitfalls, which urges the attention of our research community. To avoid the pitfalls, we suggest a series of remedies, including specifying data preprocessing, standardizing outputs, and providing pipeline evaluation results. We develop a consistent evaluation framework OMNIEVENT, to help future works implement these remedies. In the future, we will continually maintain it to well handle more emerging EE datasets, model paradigms, and other possible hidden evaluation pitfalls.

Limitations
The major limitations of our work are three-fold: (1) In the empirical experiments, we only train and evaluate models on English datasets. As the analyzed pitfalls are essentially language-independent, we believe the empirical conclusions could generalize to other languages. The developed consistent evaluation framework now includes multiple English and Chinese datasets, and we will extend it to support more languages in the future. (2) The three pitfalls analyzed in this paper are identified from our practical experiences and may not cover all the pitfalls of EE evaluation. We encourage the community to pay more attention to finding other possible hidden pitfalls of EE evaluation. We will also continually maintain the proposed consistent evaluation framework to support mitigating the influence of newly-found pitfalls.
(3) Our meta-analysis only covers papers published at ACL, EMNLP, NAACL, and COLING on mainstream EE research since 2015. Although we believe that we can obtain representative observations from the 116 surveyed papers, some EE works published at other venues and at earlier times are missed.

Ethical Considerations
We discuss the ethical considerations and broader impact of this work here: (1) Intellectual property. The copyright of ACE 2005 belongs to LDC 6 . We access it through our LDC membership and strictly adhere to its license. We believe the established ACE 2005 dataset is desensitized. In our consistent evaluation framework, we will only provide preprocessing scripts rather than preprocessed datasets for those datasets whose licenses do not permit redistribution. The ACE-DYGIE preprocessing script 7 and the used code repositories for DM-CNN 8 , DMBERT 8 , BiLSTM+CRF 8 , BERT+CRF 8 , EEQA 9 , and Text2Event 10 are released under MIT license 11 . These are all public research resources.
We use them for the research purpose in this work, which is consistent with their intended use.
(2) Intended use. Our consistent evaluation framework implements the suggested remedies to avoid the identified pitfalls in EE evaluation. Researchers are supposed to use this framework to conduct consistent evaluations for comparing various competing EE models.
(3) Misuse risks. The results reported in this paper and the evaluation results produced by our consistent evaluation framework should not be used for offensive arguments or interpreted as implying misconduct of other works. The analyzed pitfalls in this work are inconspicuous and very easy to be accidentally overlooked. Hence the community is generally unaware of them or underestimates their influence. The contribution of our work lies in raising awareness of the pitfalls and helping to avoid them in future works. (4)

A.2 Reproduction Details
In this section, we introduce the reproduction details of all the reproduced models and provide some explanations for the results' differences between our reproduction and the originally reported results. All the reproduction experiments adopt their original evaluation settings, respectively. The number of parameters for each reproduced model is shown in Table 8 (2015) adopts a different EAE evaluation setting: Only the argument annotations of the predicted triggers are included in the metric calculation, while the argument annotations of the false negative trigger predictions are discarded. This setting is also adopted in some other early works like DMBERT , and we call it "legacy setting". Compared to the common evaluation setting now, which includes all the argument annotations, the recall scores under the legacy setting are typically higher. When re-evaluating our reproduced DMCNN under the legacy setting, the EAE F1 score (53.9) is consistent with the originally reported result (53.5).
DMBERT Our DMBERT implementation is mainly based on the codes 18 provided by  We think the differences come from randomness. When only using the same random seed reported by the authors, the reproduction results are nearly the same as the original results.

A.3 Training Details
We run three random trials for all the experiments using three different seeds (seed=0, seed=1, seed=2). The final reported results are the average results over the three random trials. All hyperparameters are the same as those used in the origi-

B Additional Experimental Results
The section shows additional experimental results on different preprocessed ACE 2005 datasets.
Output Space Discrepancy Table 9 shows the metrics' differences with and without output standardization on the ACE-DYGIE and ACE-Full preprocessed datasets. We can observe that all evaluation metrics change obviously, which is consistent with the observations in § 3.3. Table 10 shows the results using gold trigger evaluation and pipeline evaluation on the ACE-DYGIE and ACE-Full preprocessed datasets. We can observe that the phenomena are consistent with those in § 3.4. Table 11 shows the results using our consistent evaluation on ACE-DYGIE, ACE-OneIE, and ACE-Full. We can observe that the phenomena on ACE-DYGIE and ACE-OneIE are consistent with those in § 4.4.

C Papers for Meta-Analysis
The complete list of papers surveyed in our metaanalysis is shown in Table 12.

D Authors' Contribution
Hao Peng, Feng Yao, and Kaisheng Zeng conducted the empirical experiments. Feng Yao conducted the meta-analyses. Xiaozhi Wang, Hao Peng, and Feng Yao wrote the paper. Xiaozhi Wang designed the project. Lei Hou, Juanzi Li, Zhiyuan Liu, and Weixing Shen advised the project. All authors participated in the discussion. +0.3 +0.0 +0.1 +0.5 −2.5 −0.9 +2.0 +0.0 +1.0 +6.2 −3.7 +0.1 Table 9: The precision, recall, and F1 (%) differences between evaluation with and without our output standardization. The results are evaluated on the ACE-DYGIE and ACE-Full preprocessed datasets. Output standardization aligns the output spaces of the other paradigms into that of the CLS paradigm, and hence we do not include the CLS-paradigm models here, whose results are unchanged.  Table 11: Experimental results (%) under our consistent evaluation on ACE-DYGIE, ACE-OneIE, and ACE-Full. We report averages and standard deviations over three runs. All the results are under pipeline evaluation.