Efficient Zero-shot Event Extraction with Context-Definition Alignment

Event extraction (EE) is the task of identifying interested event mentions from text. Conventional efforts mainly focus on the supervised setting. However, these supervised models cannot generalize to event types out of the pre-defined ontology. To fill this gap, many efforts have been devoted to the zero-shot EE problem. This paper follows the trend of modeling event-type semantics but moves one step further. We argue that using the static embedding of the event type name might not be enough because a single word could be ambiguous, and we need a sentence to define the type semantics accurately. To model the definition semantics, we use two separate transformer models to project the contextualized event mentions and corresponding definitions into the same embedding space and then minimize their embedding distance via contrastive learning. On top of that, we also propose a warming phase to help the model learn the minor difference between similar definitions. We name our approach Zero-shot Event extraction with Definition (ZED). Experiments on the MAVEN dataset show that our model significantly outperforms all previous zero-shot EE methods with fast inference speed due to the disjoint design. Further experiments also show that ZED can be easily applied to the few-shot setting when the annotation is available and consistently outperforms baseline supervised methods.


Introduction
Event extraction, the task of identifying event mentions from documents and classifying them into pre-defined event types, is a fundamental NLP problem (Grishman et al., 2005).As a centric information extraction task, event extraction is the foundation of a series of event-centric NLP applications (Chen et al., 2021) including event relation extraction (Wang et al., 2020a), event schema induction (Li et al., 2020), and missing event prediction (Chaturvedi et al., 2017).
Figure 1: Zero-shot event extraction task demonstration.Given a corpus, the goal is to identify all event mentions that fit the target event definitions without using any annotation.The event definitions and corresponding mentions are indicated in the same color.
Traditional event extraction efforts (Wadden et al., 2019;Wang et al., 2019;Lin et al., 2020) mostly focus on learning to identify and classify events under a supervised learning setting, where a pre-defined event ontology and large-scale expert annotations is available.However, the learned supervised models cannot be easily applied to new event types out of the pre-defined ontology, limiting these models' usage in real applications.
Recently, large-scale pre-trained language models have demonstrated strong semantics representation capabilities and motivated a series of works to extract events in a zero-shot setting.For example, Du and Cardie (2020) propose to manually design templates for each event type to convert the event extraction problem into a question-answering (QA) task and then leverage QA models to extract events.Following that, Lyu et al. (2021) propose to verbalize candidate triggers and event types into hypothesis and premises and leverage pre-trained textual entailment models to extract events.However, as analyzed in (Lyu et al., 2021), these models heavily rely on the template design and often suffer from the domain-shifting problem between the original training task and the new task.Moreover, as these models require jointly encoding the event mentions and event types, the time complexity is O(N * T ), where N is the number of event mention candidates and T is the number of event types.Considering the low inference speed and high computation cost of inference with a deep model, such complexity could be a massive burden for real-time EE systems.
To avoid manually designing templates and to improve the inference efficiency, another line of work (Zhang et al., 2021) tries to leverage pre-trained language representation models (i.e., BERT (Devlin et al., 2019)) to acquire a contextualized event type representation.The model can decouple the mention and label representations during the inference time and predict the candidate trigger to the most similar event type based on the cosine similarity.As a result, this method could significantly reduce the inference time complexity from O(N * T ) to O(N + T ).However, as the experiments show, using only the label name might not lead to a good event-type representation because the selected words could be ambiguous.
In this work, we follow the trend of representation learning (Zhang et al., 2021;Gao et al., 2021) and move forward from representing each event type with a single name to a definition sentence.Specifically, we propose a three-stage event representation learning framework.In the offline pretraining phase, we leverage auto-extracted contextdefinition alignments to learn a definition encoding model that can encode the contextualized mentions and definitions into the same embedding space.In the second warming phase, we use the target event types to retrieve hard negative examples to further polish the model.In the end, we identify and classify event mentions based on the cosine similarity between the mention representation and corresponding event-type representations.As our system is a disjoint model, the inference time complexity is also O(N + T ).Experiments on MAVEN (Wang et al., 2020b), the largest EE dataset to the best of our knowledge, show that ZED outperforms all previous zero-shot approaches with high inference efficiency.Further experiments show that ZED could also be applied to the supervised setting, where it achieves comparable performance in the fully supervised setting and consistently outperforms baseline supervised models in the data-scarce learning settings.Specifically, with 10% of the training data, ZED could achieve over 95% of the full performance.All the collected alignment data, created definitions, and the code are available at: https://github.com/tencentailab/ZED.

Related Works
In this section, we introduce related works about event extractions, contrastive representation learning, and definition modeling.

Event Extraction
As a fundamental information extraction task (Chen et al., 2021), event extraction has attracted many efforts in the NLP community (Sundheim, 1992;Grishman and Sundheim, 1996;Riloff, 1996;Grishman et al., 2005;Chen et al., 2021;Hong et al., 2022).Recent success on the event extraction task mostly relies on employing either symbolic features (Ji and Grishman, 2008;Liao and Grishman, 2010;Liu et al., 2016) or distributed features (Chen et al., 2015;Nguyen et al., 2016;Liu et al., 2018;Zhang et al., 2019;Wadden et al., 2019;Lin et al., 2020) to learn supervised models with large-scale high-quality annotations.However, the requirement of a pre-defined ontology and corresponding annotations limits the application of these models in real applications.
To address this issue and extract unseen event types, Huang et al. (2018) propose a zero-shot event extraction task and use a transfer-learning framework to apply the model trained with seen event types to unseen ones.However, the prerequisite of their high performance is the similarity between seen and unseen event types.Recently, with the fast development of large-scale language models, several works (Du and Cardie, 2020;Lyu et al., 2021;Zhang et al., 2021) propose to leverage the pretrained models to encode the label semantics either with templates or contextualized embeddings.In this work, we follow the effort of using deep models to model the label semantics but make a step further.Instead of directly using a pre-trained model, we train a disjoint context-to-definition alignment encoding model, which can effectively map the candidate event mentions and definitions into the same embedding space and thus more accurately and efficiently extract events for any arbitrarily defined event types.

Contrastive Representation Learning
The contrastive loss (Chopra et al., 2005) is one of the most popular training objectives for representa-Figure 2: Overall framework of ZED.In the offline training phase, we train the separate context and definition encoders with auto-extracted context-definition alignment data.In the second warming phase, after knowing the target event types, as no annotation is provided, we first retrieve similar concepts from WordNet (Miller, 1998) and use corresponding alignment data to polish the representation model.In the last inference phase, after encoding all candidate event type definitions, for each candidate event mention, we will encode it with the context encoder and determine whether it belongs to one of the target event types based on the cosine similarity.tion learning.The original contrastive loss and its variations (e.g., triplet loss (Schroff et al., 2015), lifted structured loss (Oh Song et al., 2016), N-pair loss (Sohn, 2016), and NCE loss (Gutmann and Hyvärinen, 2010)) have been shown helpful for a series of vision applications (Radford et al., 2021).After being introduced to the NLP community, the contrastive learning-based method also leads to the success of a series of representation learning tasks such as sentence representation (Gao et al., 2021).Different from previous works, where the anchors and positive/negative examples typically belong to the same category (e.g., image/sentence), we propose to use the contextualized token representation as the anchor and event type definition representations as the positive/negative examples to better solve the zero-shot event extraction task.Moreover, motivated by the success of the "pre-training+finetuning" paradigm, we propose a novel three-stage representation learning framework.

Definition Modeling
Humans are capable of understanding new concepts by reading their glosses or definitions.Thus, how to leverage the definitions and explanations from dictionaries to help understand human language is a long-standing question in the NLP community.Most of the previous efforts in this direc-tion are working on the word sense disambiguation task (Luo et al., 2018;Huang et al., 2019;Blevins and Zettlemoyer, 2020;Kumar et al., 2019;Bevilacqua and Navigli, 2020;Yao et al., 2021;Su et al., 2022a,b).These models learn to map a token into the correct pre-defined synset by either jointly or disjointly encoding the tokens and definitions.Even though the setting of our model and these WSD models are similar, identifying event mentions that satisfy an arbitrary event type definition is a more challenging task (Senel et al., 2022).WSD aims to learn to distinguish the correct synset versus several (typically less than 10) other predefined synsets, while our goal is to align an event mention and the corresponding definition, where all other arbitrary definitions are considered to be the negative candidates.To address the engineering limitation that negative candidates exceed the GPU memory limitation, we propose a coarse-to-fine negative sampling strategy to help models learn the minor differences between similar definitions without forgetting the big picture.

Task Definition
We define the zero-shot event extraction task as follows.Given a document in the format of a sentence set S and event type set E. Each event type E ∈ E is defined with a natural sentence d.The task is to identify all mentions M E in S that satisfy the definition of E for each E ∈ E without using direct annotations during the training phase.

Model
We present the model framework in Figure 2. Motivated by the success of the "pre-training + finetuning" learning paradigm, we propose to address the zero-shot event extraction problem with a threestage framework.Technical details are as follows.

Offline Pre-training
The offline pre-training step aims to train a decent definition encoder to map the target mention representation and corresponding definitions into the same embedding space.To achieve this goal, as no annotation is provided, we first collect contextdefinition alignments and then train the encoder with a contrastive learning loss.

Data Preparation
We select all verbal synsets from the WordNet ontology (Miller, 1998) to form our open-world event definition set.In total, we collect 13,814 event synsets.After that, to collect large-scale alignment data between context and definitions, we apply the current state-of-the-art word sense disambiguation model (Yao et al., 2021) to the NYT corpus (Sandhaus, 2008) to align tokens in NYT with their correct definitions.We randomly select 10 context instances for each synset to speed up the training process.As a result, we collect 775K context-definition alignments.Examples of extracted alignments are presented in table 1.

Context-definition Alignment Encoding with Contrastive Learning
The goal of the context-definition alignment encoding is encoding the contextualized representation of the target mention and the sentence representation of the definition into the same embedding space and pushing them to be closer to each other because they should have similar semantic meanings.As this objective aligns well with the learning objective of the contrastive learning framework, we follow the standard contrastive learning framework (Chopra et al., 2005).Specifically, we denote the pre-processed context-definition alignment set as T , where each instance (S, i, j, D) ∈ T contains context sentence S, which is a list of tokens w S 1 , w S 2 , ..., w S n , target word starting position i, target word ending position1 j, and a definition sentence D, which is also a list of tokens w D 1 , w D 2 , ..., w D m .We follow the standard approach to get the contextualized word representation as the mean pooling of all sub-token representations: where e k is the contextualized representation of token k produced by a transformer baseline language model (e.g., BERT (Devlin et al., 2019)).For the sentence encoding, we choose to use the average representation of all tokens as follows: where F F N represents a two-layer feed-forward neural network and e k is the token representation of token w k .
Following the contrastive learning framework, during this step, we optimize the marginal ranking loss 2 .Assume that the set of randomly sampled negative definitions is D , for each D ∈ D , we could follow equation 2 to compute its representation as d D .For each instance (S, i, j, D) ∈ T and a randomly sampled negative definition set D , we minimize the following marginal ranking loss: where max means the maximum operation, cos is the cosine similarity, and is the margin.

Query-specific Warming
After the pre-training phase, the model briefly understands how to project the contextualized event mentions and corresponding definitions into similar positions in the embeddings.However, its capability of distinguishing similar definitions is still limited because the previous negative sampling strategy does not encourage such capabilities.To address this issue, we introduce an additional warming phase to help models learn the minor difference between similar definitions.
Similar to how human beings understand new concepts by recalling relevant knowledge, we also retrieve relevant knowledge from T to further finetune the model.Specifically, assume that the set of interested event definitions is D, for each D ∈ D, we first retrieve the most similar definition D from the original definition set D by: where sim is the similarity measurement and P LM represents the encoding with a pre-trained language model.In our experiment, we select cosine similarity as the similarity measurement and average contextualized token embedding encoded with BERT-base (Devlin et al., 2019) as the encoding.But other techniques could also be applied.
We thus denote the set of all retrieved relevant definitions as D and select a subset Ĩ of I such that all definitions in Ĩ belong to D to further fine-tune the model.After generating all the data, we will fine-tune all models following the loss function in Equation 3.

Inference
During the inference, we compute the representation for each candidate event mention in and target event type descriptions.After that, for each candidate mention, we compute its cosine similarity with all the target event-type representations.If the largest similarity is larger than a threshold t, this mention is identified and labeled as the most similar event type.Assume that the size of all candidate mentions and target event types are N and T , respectively.Compared with previous zero-shot models that rely on the joint encoding of the candidate mention and target event types (Du and Cardie, 2020;Lyu et al., 2021;Yao et al., 2021), we successfully reduce the computation complexity from O(N * T ) to O(N + T ).A numerical evaluation of the computation efficiency is shown in Section 6.2.

Experiments
This section introduces experiment details, including the selected baseline methods, experiment datasets, and implementation details.

Baseline Methods
In the past two years, the community has been devoting significant effort to solving the zeroshot event extraction problem with different approaches.Specifically, we select the following best-performing models as our baselines.based on the similarity between the mention representation and event type representations.
Besides these baselines, we also present the "Chance" performance, where a mention is randomly selected following the percentage of gold mentions and randomly assigned an event type, and the "Most Popular Event Type" performance, where a mention is also randomly selected following the percentage of gold mentions and is always predicted to be the most popular event type.

Evaluation Dataset
We select MAVEN (Wang et al., 2020b) as the evaluation dataset due to its large-scale and balanced distribution.Specifically, MAVEN contains 186 unique event types selected from FrameNet (Baker et al., 1998) and 118,732 annotated event mentions, which is almost two magnitudes larger than the previous datasets such as ACE (Grishman et al., 2005).Moreover, MAVEN provides the official event mention candidates to evaluate the mention understanding capability of all event extraction models more fairly.As the original dataset only provides the event name in the format of a phrase (e.g., "Body_movement"), we directly use definitions from Wordnet as the description3 .Examples of the event types and corresponding definitions are presented in Appendix Table 5.

Implementation Details
For baseline models, we conduct experiments with officially released code, hyperparameters, templates, and pre-trained models.For ZED, we use two separate encoders for the context and definition encoding.Both of them are initialized with BERT-base (Devlin et al., 2019)  set is needed in the zero-shot setting and the test set of MAVEN is not publicly available, we report the performance on the dev set.Specifically, we set the margin to 0.2 for the marginal ranking loss and set the number of negative examples to 2. The selection threshold at the inference phrase is set to be 0.7.We train the model with ten epochs for both the pre-training and warming phrases.We directly evaluate the last checkpoint to simulate the real application, where no dev set is available.All models are trained with Tesla P40 with batch size 16.The pre-training and warming phrases will take around 200 and 3 hours on a single GPU, respectively, but we could speed it up with multiple GPUs.
6 Zero-shot Performance The zero-shot performance of all models is presented in Table 2, from which we can make the following observations: 1.All models significantly outperform the naive baselines even though they do not use any annotations.This observation shows that current deep models can indeed learn rich semantics that could generalize outside of their original training goal.
2. The overall performance of pre-trained QA, TE, and WSD models is not satisfying because they suffer from domain shifting.For example, even though current deep-model-driven QA models have surpassed human performance on several 3. Compared with other methods, Contextualized label embedding achieves lower identification F1 but higher classification accuracy, which aligns with the original observation in (Zhang et al., 2021).The reason behind this is that due to the cone property of the BERT representation (i.e., most of the token representations of BERT are grouped in a small region), it is tough to determine the cosine similarity boundary of whether an event mention fits a specific event type.As a result, even though CLE could accurately identify high-confident mentions, it cannot handle boundary ones very well.
4. Compared with baseline methods, ZED could perform better on both the identification and classification tasks.The main reason is that we are using definitions to model the label semantics, which is more accurate than a single word.

Ablation Study
From the ablation study results in Table 3, we can see that if we remove the warming phase, the model's performance will drop on both the identification and classification, especially the classification step.This aligns well with our assumption that the model can learn to model the general definition semantics after the pre-training step but cannot distinguish minor differences very well.The performance drop of removing the strong negative sampling module indicates that strong negatives are crucial for the success of representation learning, which aligns well with previous observations (Clark et al., 2020).
Besides those ablation studies, we also show the impact of the pre-training data scale in Figure 3.As expected, the more data we use, the better performance we will get.However, the performance gain after 10 instances per synset is limited.As a result, we select 10 instances for each synset as the pre-training data for training efficiency.

Inference Efficiency
We present the inference speed of all evaluated models in Figure 4.As ZED adopts a disjoint encoding design, we successfully reduce the computation complexity from O(N * T ) to O(N + T ), where N is the number of event mentions and T number of event types.On Maven, which has 168 different event types, ZED could speed up the inference efficiency by almost two magnitudes.

Warming with Gold Annotation
ZED can also be adapted to a fully supervised or few-shot learning setting when the annotation is available.Specifically, during the warming phase of our model, we can replace the auto-retrieved examples with the annotated ones and fine-tune the model.In this section, we follow the benchmark paper (Wang et al., 2020b)   the recent language model-driven baselines4 : DM-BERT (Wang et al., 2019) and BERT (Devlin et al., 2019) + CRF (Lafferty et al., 2001), which also achieved the previous best performance.Please refer to the original papers for technical details of these baseline models.We implement all models with the officially released code5 and report the average performance of five trials on the development set.All models are trained for ten epochs, and the final model is evaluated.Like the zero-shot setting, we also report the micro precision, recall, and F1 for both the "identification" and "identifica-tion+classification" settings.All hyper-parameters are based on the officially released code.
Results in Table 4 show that with the help of the pre-training step, our model can outperform all previous supervised models on the identification task and comparable performance on the classification task.This makes sense because a carefully designed deep model could learn to identify and classify event mentions well with the large-scale annotation provided by MAVEN.
However, we argue that such a large-scale annotation is often expensive in terms of money and time.The data-scarce learning setting might be more applicable in real applications.Thus, we also test the performance of these supervised settings under the data-scarce learning setting.Specifically, we randomly select 1% to 10% of the training sentences to be sampled from the training data and report the performances in Figure 5.Our model can constantly outperform baseline models with a small number of annotations.Especially when only 1% of the data is available, we only have 7.07 mentions per event type, ZED could achieve over 50 F1.With 10% of the training data, ZED could achieve over 95% of full supervised performance.These observations show that our framework could be applied to broader applications where limited or enough annotations are available besides the zero-shot setting.Besides that, another interesting finding is that even though "BERT+CRF" could outperform "DMBERT" slightly when enough annotation is available, which is consistent with the observations in (Wang et al., 2020b), its performance is worse under the data-scarce setting.This observation indicates that using CRF might not be the optimal option when the annotation scale is limited.

Conclusion
This paper proposes a novel zero-shot event extraction framework ZED.Given a set of interested event types in the format of definitions, ZED could automatically extract all the event mentions that fit the definitions from raw documents much better than previous methods.Experiments show that the proposed warming phase and the mixed strong negative examples sampling strategies contribute to the success of ZED.Additional experiments also show that ZED could be applied to the supervised setting.Thanks to the pre-training phase, it could achieve good performance under both the fully supervised and data-scarce settings.

Figure 3 :
Figure 3: Effect of training instance number per definition.(Zero-shot performance on the Identifica-tion+Classification F1 is reported.)

Figure 4 :
Figure4: Inference Speed of all zero-shot models.For a fair comparison, we evaluate all models with the same GPU and use batch size 1.As our model is smaller than baseline models, we could use a larger batch size in real applications to further boost efficiency.

Figure 5 :
Figure 5: Model performance With limited annotation.

Table 1 :
Demonstration of collected context and definition alignments.Target mentions are underlined.

Table 2 :
(Wang et al., 2020b)tification and classification results on MAVEN(Wang et al., 2020b), which has 168 event types.Best F1 performances are indicated with bold font.
2. Pre-trained Textual Entailment Models (Lyu et al., 2021) (TE): Motivated by the QA approach, Lyu et al. (2021) explore the possibility of utilizing a pre-trained textual entailment (TE) model to automatically extract events.Specifically, for each target event type, Lyu et al. (

Table 3 :
Ablation study."I" and "C" represent the identification and classification, respectively.

Table 4 :
to compare with Model Performance with full annotations, Best F1 performances are indicated with the bold font."I" and "C" indicate the event identification and classification, respectively.