COFFEE: A Contrastive Oracle-Free Framework for Event Extraction

Event extraction is a complex task that involves extracting events from unstructured text. Prior classification-based methods require comprehensive entity annotations for joint training, while newer generation-based methods rely on heuristic templates containing oracle information such as event type, which is often unavailable in real-world scenarios. In this study, we consider a more realistic task setting, namely the Oracle-Free Event Extraction (OFEE) task, where only the input context is given, without any oracle information including event type, event ontology, or trigger word. To address this task, we propose a new framework, COFFEE. This framework extracts events solely based on the document context, without referring to any oracle information. In particular, COFFEE introduces a contrastive selection model to refine the generated triggers and handle multi-event instances. Our proposed COFFEE outperforms state-of-the-art approaches in the oracle-free setting of the event extraction task, as evaluated on two public variants of the ACE05 benchmark. The code used in our study has been made publicly available.


Introduction
The event extraction task aims to identify events and their arguments from the given textual input context (Nguyen et al., 2016;Wadden et al., 2019;Yang et al., 2019).Conventionally, this task can be decomposed into four sub-tasks (Nguyen et al., 2016): (i) detecting the trigger word that most directly describes the event; (ii) event type classification for defining its event-specific attributes; (iii) argument identification and (iv) argument classification that maps the argument entities to the corresponding role attributes based on the structure of each event type, namely event schema.For instance, Figure 1 shows the input context of an event extraction example that contains two events: 1 https://github.com/meiru-cam/COFFEEFigure 1: An event extraction example with two events: Transport and Attack.In the 'Transport' event, 'went' is the trigger word, and 'home' is the 'Destination' argument.In the 'Attack' event, 'killed' is the trigger word while 'father-in-law' and 'home' are the 'Agent' and 'Place' arguments, respectively.a 'Transport' event triggered by the trigger word 'went' and an 'Attack' event triggered by the trigger word 'killed', where 'Transport' and 'Attack' are two event types.
Many prior studies formulate the event extraction task as a token-level classification problem, which extracts event triggers and arguments using sequence tagging models based on tailor-designed neural networks (Nguyen et al., 2016;Liu et al., 2018;Li et al., 2019;Yang et al., 2019;Wadden et al., 2019;Huang et al., 2020;Lin et al., 2020;Nguyen et al., 2021).However, such methods cannot leverage rich label semantics since the target outputs (e.g., event triggers and arguments) are fixed tagging labels.Recently, with advances in generative pre-trained language models, several generation-based approaches (Hsu et al., 2022;Huang et al., 2022;Li et al., 2021;Zhang et al., 2021) have been applied to solve this problem.These approaches transform the event extraction task into a conditional generation task.By utilizing the autoregressive generation nature of generative pre-trained language models (e.g., BART-Gen (Li et al., 2021), DEGREE (Hsu et al., 2022)) and some manual prompts, it becomes possible to harness the semantics of labels and conduct both entity extraction and classification in an autoregressive manner simultaneously.
While impressive results are reported, we identify two major limitations of the current generationbased event extraction methods.Firstly, most of these methods rely on heuristic templates and extensive human knowledge engineering.According to the experiments conducted by Hsu et al. (2022), a slight change in the template might lead to significant performance changes, thus raising the issue of using sub-optimal templates.Secondly, most of these generation-based approaches still require certain oracle information, such as event type and event schema, which necessitate extensive manual annotations.For example, the DEGREE model's inference process, as demonstrated by Hsu et al. (2022), requires manually designed event-specific templates for each example and iterates over all event types.On the other hand, Text2Event (Lu et al., 2021) also constrains the generation with manually designed templates, which require event schema to be given.However, obtaining this oracle information, such as event type and schema, is unrealistic for a real-world inference system to achieve automatically.Hence, this paper aims to address the Oracle-Free Event Extraction (OFEE) task where only the input context is given.
In this study, we propose a novel Contrastive Oracle-Free Framework for Event Extraction (COFFEE), which addresses the event extraction task without using any oracle information.Our COFFEE consists of two parts, a generator that performs the extraction of events and a selector that aims to refine the generated results.The generator of our COFFEE generates both the candidate triggers and event arguments, where the shared generator allows for cross-task knowledge sharing between these sub-tasks.The selector of our COFFEE learns to re-rank and select the candidate triggers to obtain more accurate trigger predictions, which is inspired by (Su et al., 2021).One challenge of the sentence-level event extraction is that a sentence may contain more than one event record (Si et al., 2022;Subburathinam et al., 2019) (e.g., the example in Figure 1), and event specific templates can help the model to identify and extract events in a targeted manner.Prior approaches tackling this challenge have necessitated either multilabel tagging (Ramponi et al., 2020;Lin et al., 2020), event-specific templates (Hsu et al., 2022), or multi-turn question answering techniques (Du and Cardie, 2020;Li et al., 2020).In contrast, our proposed model can concurrently generate and se-lect multiple event candidates, encompassing both the event trigger and its associated type, thereby effectively addressing the aforementioned challenge.
The contribution of this work is as follows: • We highlight the challenge of the current event extraction task setting and introduce the oraclefree setting of this task that requires the model to produce the structural event without using oracle information beyond the context.
• We propose COFFEE, a novel Contrastive Oracle-Free Framework for Event Extraction which use a generator and a selector to generatively obtain structural event information from context without using any oracle information.
• We conduct experiments on two variants of the ACE05 benchmark under the oracle-free setting to evaluate our COFFEE.The results demonstrate that the template-based baselines heavily rely on the additional oracle information, whereas our COFFEE exhibits superior empirical performance over these baselines in the absence of an oracle.
• Input Context: The input sentence or sentences that contain one or more events.
• Trigger Word: The main word that most clearly expresses the occurrence of an event (e.g., words 'went' and 'killed' in Figure 1).
• Event Type: The event type that defines the semantic structure of a specific event (e.g., events 'Transport' and 'Attack' in Figure 1).
• Event Argument: Event arguments identify the entities involved in events and their roles based on their relationships with the event triggers.An entity can be an object, place or person that participates in the event.For example, 'home' is an entity that serves as both the 'Place' argument of the 'Transport' event and the 'Destination' argument of the 'Attack' event in Figure 1.
Given the input context c, which is a sequence of tokens [c 1 , • • • , c n ], the conventional event extraction task aims to identify the trigger words, classify the events triggered by these words and extract the arguments in each of the events with their Figure 2: Overview of our proposed COFFEE framework.We train G to generate trigger candidates ŷt that contain trigger word and event type first.These trigger candidates then used to train S to select the final trigger predictions ỹt .In the argument prediction stage, the trained generator is re-used to generate arguments ỹa based on ỹt selected by S.Only the input context c is required to predict events.corresponding roles (Nguyen et al., 2016;Chen et al., 2015).Assume that an input sentence context c contains |e| different events, then its ground truth triggers y t can be represented as , where y t i denotes the i-th trigger word and event type of the given context sentence.For each event, there is a list of ground truth arguments, denoted by y a , which is a list of ⟨role, argument⟩ pairs, i.e., y a = [⟨r 1 , a 1 ⟩, • • • , ⟨r m , a m ⟩], where a j is the j-th entity participating in the event and r j is the corresponding role type for that entity.
For this conventional event extraction task, the current state-of-the-art generation-based approaches rely on manual templates, which require trigger words or event types to be given, to simplify this task (Hsu et al., 2022;Lu et al., 2021).However, in a realistic scenario, although argument roles are event-specific, gold trigger words or event type information may not be readily available during event argument extraction We focus on the Oracle-Free Event Extraction (OFEE) task, which presents a more practical scenario by only providing the input context during inference.The goal of OFEE is to infer event triggers and arguments without relying on pre-defined event specific templates, making it more challenging to solve due to the absence of external guidance or oracle information.

Methodology
As mentioned in the task definition, our goal is to extract event frames without using any templates.This adds complexity to the generation model, particularly when dealing with contexts containing multiple events, such as the example given in Fig-ure 1.To address the challenging OFEE task, we propose a novel contrastive framework called COF-FEE, which comprises two primary components: a generator G, responsible for generating event frames present in the provided context, and a selector S, which re-ranks and selects the triggers generated by G.In our proposed COFFEE framework, G is fine-tuned using ground truth triggers and arguments (i.e.y t and y a ) to generate candidate triggers ŷt and arguments ỹa (see §3.1).At the inference stage, S is fine-tuned to refine and select final trigger predictions ỹt based on the generated candidate triggers ŷt and gold triggers y on the training set (see §3.2).The final trigger predictions are forwarded to G for argument prediction (see §3.1).We next present the details of COFFEE's components, i.e. the generator and the selector.

Generator
The generator is fine-tuned on both trigger prediction and argument prediction simultaneously by training on the pairs of instances with different prefixes 'TriggerEvent: ' and 'Argument: ' (see §4.4).
In order to take the context as input and generate structured event frames, the generator G of COF-FEE is employed using an encoder-decoder transformer model, such as BART, T5 and mT5 (Lewis et al., 2020;Raffel et al., 2020;Xue et al., 2021).We resort T5 (Raffel et al., 2020) as the base model and encode only ' [and]' and '[none]' as additional special tokens based on experimental results.
During the inference stage, we apply beam search (Jelinek, 1976) to generate candidate triggers ŷt and output the beam score of these trig-gers.Given the context c, the generator outputs the top-l triggers with the highest beam scores, denoted by ŷt = g(c; G), where ŷt is a list of triggers and ŷt i represents a generated candidate trigger in the context c.After obtaining the list of candidate triggers, we use a contrastive-learning based selector S (see §3.2 for details) to further re-rank the generated candidates using f (c, ŷt ; S) and select the final set of trigger predictions ỹt .The predicted trigger words are then concatenated to the context iteratively, and the generator performs argument prediction on each event using h(c,

Selector
In our approach, we employ contrastive learning to re-rank the candidate triggers ŷt .Contrastive learning (Chen et al., 2020) is a technique that aims to learn meaningful representations by maximizing the similarity between positive pairs while minimizing the similarity between negative pairs.In the context of our problem, we define the ground truth triggers y t for context c as the positive anchors, while the negative samples are the other incorrect candidates generated, i.e., ŷt j ̸ ∈ y t .To apply contrastive learning for re-ranking, we first encode the context and candidate triggers using a shared encoder.Specifically, given a list of candidate triggers ŷt , for each ŷt i ∈ ŷt we concatenate it to the context and use S to map the concatenated text [c : ŷt i ] into a real-valued ranking score by preforming linear projection f (c, ŷt i ; S).In this study, we employ RoBERTa (Liu et al., 2019) as the backbone selector model to encode the text input, and S predicts the ranking score for each of the candidate triggers in ŷt through optimizing over a contrastive objective L S , which encourages S to predict higher scores for true trigger candidates and lower scores for false trigger candidates.
Formally, given a context c and the generated candidate triggers ŷt , S is fine-tuned to optimize: where ŷt j ̸ ∈ y t , ρ ∈ [−1, 1] is a pre-defined margin and k represents the number of negatives sampled from ŷt .By taking into account the implicit correlation between the context and generated candidates, S captures the semantic relevance between Since the number of events in the context is unknown, we use a threshold to automatically control the number of events predicted.Let α represent the weight parameter and θ represent the threshold parameter in our model.These hyperparameters are used for combining the beam score b i with the ranking score s i and filtering out the false candidate triggers, respectively.We determine the threshold θ and the weight α on the development set, which is exclusively utilized for hyperparameter tuning, to ensure an unbiased evaluation on the test set.The final set of trigger predictions is defined as where σ denotes the softmax function.

Dataset
In this work, we evaluate our COFFEE based on a public event extraction benchmark ACE05 (Walker and Consortium, 2005), which consists of 599 English documents, 33 event types, and 22 argument roles.Building upon previous works (Wadden et al., 2019;Lin et al., 2020) that split and preprocess this dataset, we use two variants for the event extraction dataset, namely ACE05-E and ACE05-E+.
Detailed split and statistics of the two datasets can be found in Table 1.

Evaluation Metrics
The evaluation of trigger identification, event type classification, argument identification, and argument role classification tasks utilizes the F1-score metric, consistent with the previous studies (Zhang et al., 2019;Wadden et al., 2019).A correct trigger classification prediction requires accurate trigger word and event type prediction, i.e., ỹt i = y t i .Correct argument identification necessitates accurate classification of the event type and argument entity, while a correct argument role classification demands accurate identification of the argument and role type prediction.Specifically, a predicted event type te , argument ã, and role type r are considered correct if (ã, r, te ) = (a, r, t e ).

Baselines
To validate the effectiveness of our proposed method, we compared our COFFEE with five stateof-the-art baselines: • OneIE (Lin et al., 2020) is a joint neural model that simultaneously extracts entities and relations using a dynamic relation graph.
• Text2Event (Lu et al., 2021) is a sequence-tostructure controlled generation model with constrained decoding for event extraction.It focuses on the structured generation that uses event schema to form event records.
• BART-Gen (Li et al., 2021) is designed for document-level event extraction that can deal with the long-distance dependence issue and coreference problem.Constrained generation is applied for argument extraction that requires eventspecific templates.
• DEGREE (Hsu et al., 2022) is a generative event extraction approach that highly relies on the designed template.
• TANL (Paolini et al., 2021) is a model that extracts event triggers and arguments by so called augmented translation that embeds target outputs into the context sentence.

Implementation
We preprocess the data by separating original samples into event samples and inserting placeholders for target entities.The instances are processed with distinct prefixes for subtasks: 'TriggerEvent: ' and 'Arguments: '. Figure 3 shows a data preprocessing example.Details pertaining to our pipeline training and inference process, including specifics about the two-stage fine-tuning, such as the learning rate and batch size, as well as the beam search strategy employed during inference, are elaborated in Appendix A.1.

OFEE performance
As described in Section 4.3, Text2Event, BART-Gen and DEGREE utilize different oracle information.To compare the performance of our COFFEE framework with these methods under the OFEE  setting, we implemented the following adaptations to these baseline approaches: • Text2Event (Lu et al., 2021) relies on a complex constrained decoding mechanism that depends on the event schema.For the oracle-free setting, we utilized the default decoding of the T5 model to generate results.
• BART-Gen (Li et al., 2021) adopts a constrained generation mechanism, which necessitates the use of templates.We removed the template and the constrained decoding, thereby enabling the model to function.The trigger extraction performance of BART-Gen is not reported in our study due to an implementation error stemming from different preprocessing methods, which prevented us from applying this approach to the ACE-05E+ dataset.Consequently, we depended on the ground truth triggers for argument extraction in this instance.
• The DEGREE (Hsu et al., 2022) model is designed to generate 'invalid' instances during both the training and inference phases, wherein eventspecific knowledge is combined with context even if no such event is mentioned in the context.We eliminated these event-specific templates, leaving only the context sentence as input.
As presented in Table 2, we report F1 scores of the compared methods over four sub-tasks described in 4.1, namely trigger identification, trigger classification, argument identification, and argument classification.We observe the following: • Firstly, it is crucial to highlight that the oraclefree setting poses a more challenging scenario.When all oracle information is removed, generation-based baselines relying on templates exhibit a varying degree of performance decline on both datasets (↓ 0.5% to 37.42% in argument classification).Although DEGREE is effective with the oracle information, it struggles to filter out the 'invalid' events in the oracle-free setting, resulting in an almost zero (2.18%) trigger classification F1.This indicates that the information leaked in the template significantly contributes to the performance of DEGREE.
• Our proposed COFFEE outperforms the classification-based approach OneIE and the generation-based approaches Text2Event, BART-Gen, and DEGREE in both the presence and absence of oracle information across all four metrics.This demonstrates that our COFFEE can effectively leverage the input context to extract event frames.
• In comparison to TANL, our COFFEE achieves similar results in trigger extraction, with a difference of only 1.36%.One possible explanation is that the threshold-based method results in a smaller recall value due to more false positives.However, our model possesses robust argument extraction capabilities and attained superior performance in argument extraction with these extracted triggers (↑ 3.33% and ↑ 2.46% on ACE-05E and ACE05E+, respectively).These findings corroborate the effectiveness of the shared generator on trigger and argument prediction.

Ablation study
We conducted an ablation study on the threshold and weight parameters to demonstrate the effectiveness of our selector S and the influence of these parameters on the COFFEE performance.the minimum score a candidate must achieve to be selected.Increasing the threshold results in fewer candidates being selected but with higher accuracy.
Conversely, an overly high threshold could filter out some of the correct candidates, decreasing performance.The optimal threshold value is 0.2, which achieves the best performance on all four subtasks.In addition, Figure 5 demonstrates the influence of the weight parameter on COFFEE.The weight represents the ratio of combining the ranking score and generation score.When the weight is set to 0, only the generation score is considered, while a weight of 1 means that only the ranking score is considered.As depicted in Figure 5, the best extraction performance is achieved with a fixed threshold and an optimal weight value of α = 0.4.The initial improvement in the F1 score with increasing weight suggests that the ranking score can effectively refine the results of the beam search.However, the ranking scores exhibit significant variations, leading to a corresponding fluctuation in softmax probability as the weight increases.As the final probability becomes increasingly reliant on the ranker probability, fewer candidates are selected at the same threshold, resulting in a decline in performance.

Qualitative Case Analysis
In order to demonstrate the ability of our model to select event candidates, we analyze the results of two instances selected from the test set.For comparison, we select COFFEE without ranking and TANL, given its high performance.As shown in Table 3, our proposed model successfully extracts the missing events not detected by the baselines.The re-ranking mechanism enables the model to select more accurate candidates.
In particular, only COFFEE successfully predicts all the events within the context.In Example 1, both TANL and COFFEE without ranking fail to extract E1, triggered by 'pay', suggesting that the baselines may have difficulty identifying complex event triggers.In this case, there is not a specific amount of money to be paid, but a mention of cost.In Example 2, TANL fails to extract E2, which is triggered by 'becoming', and COFFEE without ranking fails to extract E1, highlighting the inability of the baselines to identify events and their corresponding arguments consistently.In contrast, our COFFEE successfully identifies the events and extracts the target arguments, demonstrating its superior performance.
Comparing COFFEE with and without ranking, we can conclude that re-ranking in the selector is crucial.In both examples, COFFEE fails to detect all events without re-ranking.Even though both candidates are the correct targets, the beam scores differ more than expected, which leads to incorrect ranking.The re-ranking can increase the probability of the second candidate and thus allowing it to be selected under the chosen threshold.
These examples demonstrate the improvements in event extraction offered our selector S, which allows the framework to re-rank and select the correct triggers for multi-event instances, outperforming the baselines and establishing our model as a more effective and reliable solution for OFEE tasks.
6 Related Work

Event Extraction
Early event extraction research primarily relied on rule-based methods involving hand-written patterns to identify event triggers and arguments in text (Li et al., 2013(Li et al., , 2015)).Supervised machine learning techniques became popular, with various feature-based classification models employed (Hsi et al., 2016).However, these methods faced limitations due to manual feature engineering and the need for large annotated datasets.Researchers then turned to deep learning approaches, utilizing convolutional neural networks (CNNs) (Chen et al., 2015;Nguyen and Grishman, 2015;Björne and Salakoski, 2018;Yang et al., 2019), recurrent neural networks (RNNs) (Nguyen et al., 2016), and Tree-LSTM (Li et al., 2019) for event extraction, which automatically learned relevant features and improved performance.
The introduction of pre-trained language models revolutionized event extraction.Fine-tuning these models achieved state-of-the-art performance across various benchmarks (Lin et al., 2020;Ramponi et al., 2020;Wadden et al., 2019;Yang et al., 2021).These models captured deep contextual information and benefited from knowledge transfer, enhancing performance with limited annotated data.Some studies framed event extraction as a multi-turn question answering task (Du and Cardie, 2020;Li et al., 2020;Liu et al., 2020;Zhou et al., 2021), while others approached it as a sequence-to-sequence generation task (Hsu et al., 2022;Lu et al., 2021;Li et al., 2021).Although effective, these methods heavily relied on manually designed prompts and templates, except for Text2Event (Paolini et al., 2021), which depended solely on context information.In contrast, our work focuses on oracle-free event extraction and addresses the task via generation without tem-

Ethics Statement
In preparing and submitting this research paper, we affirm that our work adheres to the highest ethical standards and is devoid of any ethical issues.The study presented in the manuscript was conducted in a manner that respects the principles of academic integrity, transparency, and fairness.

A Appendices
A.1 Implementation Details Our pipeline training comprises two stages: generation model fine-tuning and re-ranking model fine-tuning.The T5-base model (Raffel et al., 2020) fine-tuning is achieved through the Hug-gingFace Transformers library (Wolf et al., 2019) on an RTX3090 GPU, using an AdamW optimizer (Loshchilov and Hutter, 2017), with a learning rate of 0.0001 with a decay schedule of 1e-5.We set a batch size of 8 and maximum input/output sequence lengths at 650/200.For inference, we generate candidate triggers using a beam search strategy with 10 beams.These candidates are then re-ranked and filtered by the selector S, based on optimal thresholds and weights derived through grid search.
In the second stage, we fine-tune a RoBERTabase model (Liu et al., 2019) for re-ranking.This stage reduces the maximum input length to 512 and sets the number of negative candidates for contrastive learning to 5, with a learning rate of 0.005.
Upon refining triggers, they are concatenated to the context and reintroduced to the generator for argument generation via a greedy search.The final extraction of entities and roles is conducted using regular expressions.Our COFFEE generator and selector take approximately 4 hours and 2 hours to train, respectively.
Performance evaluation of the trigger and argument extraction is based on regular expressions to detect entities extracted from the placeholders.The results can be observed in Table 2.

Figure 3 :
Figure 3: Example of input and target for the model.

Figure 4 Figure 4 :
Figure 4: Effect of threshold in COFFEE framework.

Figure 5 :
Figure 5: The influence of the weight α on performance.

Table 1 :
The statistics of our used datasets.

Table 2 :
Performance comparison of COFFEE and SOTA generation-based approaches.† The trigger classification F1 of DEGREE is nearly zero because the model cannot exclude the negative samples constructed without a template.♮ , ♢ , and ♭ denote the model that requires a manually designed template, example keywords, and event description, respectively.The highest results are in bold and the second highest results are underlined.
ContextKommersant business daily joined in , declaring in a furious front -page headline : " The United States is demanding that Russia , France and Germany pay for the Iraqi war .

Table 3 :
Event extraction examples from the test set using COFFEE, COFFEE without ranking and TANL+COFFEE.The triggers and arguments missed by the baselines but captured by COFFEE are highlighted .It is evident that COFFEE is generally more effective in detecting the events.Our results show that this reliance on templates and human-designed trigger sets is unnecessary, and a pure oracle-free model applied directly can perform very well on general event extraction.In the future, we plan to extend sentence-level event extraction to documentlevel and explore zero-shot settings to handle the emergence of unseen events.Despite its promising results, our study has limitations.Our model primarily works with English text, limiting its applicability to other languages.Its focus on sentence-level extraction doesn't consider document context, which could be investigated in future research.The employed training dataset is relatively small, potentially not encompassing all possible event types, thus affecting the model's performance and generalizability.Additionally, our two-stage inference framework, while enhanced by a ranking module, is prone to error propagation.If a trigger isn't identified in the first stage, its associated arguments cannot be extracted.Future work should address these issues for improved performance and broader applicability.