CLEVE: Contrastive Pre-training for Event Extraction

Event extraction (EE) has considerably benefited from pre-trained language models (PLMs) by fine-tuning. However, existing pre-training methods have not involved modeling event characteristics, resulting in the developed EE models cannot take full advantage of large-scale unsupervised data. To this end, we propose CLEVE, a contrastive pre-training framework for EE to better learn event knowledge from large unsupervised data and their semantic structures (e.g. AMR) obtained with automatic parsers. CLEVE contains a text encoder to learn event semantics and a graph encoder to learn event structures respectively. Specifically, the text encoder learns event semantic representations by self-supervised contrastive learning to represent the words of the same events closer than those unrelated words; the graph encoder learns event structure representations by graph contrastive pre-training on parsed event-related semantic structures. The two complementary representations then work together to improve both the conventional supervised EE and the unsupervised “liberal” EE, which requires jointly extracting events and discovering event schemata without any annotated data. Experiments on ACE 2005 and MAVEN datasets show that CLEVE achieves significant improvements, especially in the challenging unsupervised setting. The source code and pre-trained checkpoints can be obtained from https://github.com/THU-KEG/CLEVE.


Introduction
Event extraction (EE) is a long-standing crucial information extraction task, which aims at extracting event structures from unstructured text. As illustrated in Figure 1, it contains event detection task to identify event triggers (the word "attack") and classify event types (Attack), as well as event argument extraction task to identify entities serving as event arguments ("today" and "Netanya") and classify their argument roles (Time-within and Place) (Ahn, 2006). By explicitly capturing the event structure in the text, EE can benefit various downstream tasks such as information retrieval (Glavaš andŠnajder, 2014) and knowledge base population (Ji and Grishman, 2011). Existing EE methods mainly follow the supervised-learning paradigm to train advanced neural networks (Chen et al., 2015;Nguyen et al., 2016;Nguyen and Grishman, 2018) with humanannotated datasets and pre-defined event schemata. These methods work well in lots of public benchmarks such as ACE 2005 (Walker et al., 2006) and TAC KBP (Ellis et al., 2016), yet they still suffer from data scarcity and limited generalizability. Since annotating event data and defining event schemata are especially expensive and laborintensive, existing EE datasets typically only contain thousands of instances and cover limited event types. Thus they are inadequate to train large neural models  and develop methods that can generalize to continually-emerging new event types (Huang and Ji, 2020).
Inspired by the success of recent pre-trained language models (PLMs) for NLP tasks, some pio-neering work (Wang et al., 2019a;Wadden et al., 2019) attempts to fine-tune general PLMs (e.g, BERT (Devlin et al., 2019)) for EE. Benefiting from the strong general language understanding ability learnt from large-scale unsupervised data, these PLM-based methods have achieved state-ofthe-art performance in various public benchmarks.
Although leveraging unsupervised data with pretraining has gradually become a consensus for EE and NLP community, there still lacks a pre-training method orienting event modeling to take full advantage of rich event knowledge lying in largescale unsupervised data. The key challenge here is to find reasonable self-supervised signals (Chen et al., 2017;Wang et al., 2019a) for the diverse semantics and complex structures of events. Fortunately, previous work (Aguilar et al., 2014;Huang et al., 2016) has suggested that sentence semantic structures, such as abstract meaning representation (AMR) (Banarescu et al., 2013), contain broad and diverse semantic and structure information relating to events. As shown in Figure 1, the parsed AMR structure covers not only the annotated event (Attack) but also the event that is not defined in the ACE 2005 schema (Report).
Considering the fact that the AMR structures of large-scale unsupervised data can be easily obtained with automatic parsers (Wang et al., 2015), we propose CLEVE, an event-oriented contrastive pre-training framework utilizing AMR structures to build self-supervision signals. CLEVE consists of two components, including a text encoder to learn event semantics and a graph encoder to learn event structure information. Specifically, to learn effective event semantic representations, we employ a PLM as the text encoder and encourage the representations of the word pairs connected by the ARG, time, location edges in AMR structures to be closer in the semantic space than other unrelated words, since these pairs usually refer to the trigger-argument pairs of the same events (as shown in Figure 1) (Huang et al., 2016). This is done by contrastive learning with the connected word pairs as positive samples and unrelated words as negative samples. Moreover, considering event structures are also helpful in extracting events (Lai et al., 2020) and generalizing to new event schemata (Huang et al., 2018), we need to learn transferable event structure representations. Hence we further introduce a graph neural network (GNN) as the graph encoder to encode AMR structures as structure representations. The graph encoder is contrastively pre-trained on the parsed AMR structures of large unsupervised corpora with AMR subgraph discrimination as the objective.
By fine-tuning the two pre-trained models on downstream EE datasets and jointly using the two representations, CLEVE can benefit the conventional supervised EE suffering from data scarcity. Meanwhile, the pre-trained representations can also directly help extract events and discover new event schemata without any known event schema or annotated instances, leading to better generalizability. This is a challenging unsupervised setting named "liberal event extraction" (Huang et al., 2016). Experiments on the widely-used ACE 2005 and the large MAVEN datasets indicate that CLEVE can achieve significant improvements in both settings.

Related Work
Event Extraction. Most of the existing EE works follow the supervised learning paradigm. Traditional EE methods (Ji and Grishman, 2008;Gupta and Ji, 2009;Li et al., 2013) rely on manually-crafted features to extract events. In recent years, the neural models become mainstream, which automatically learn effective features with neural networks, including convolutional neural networks (Nguyen and Grishman, 2015;Chen et al., 2015), recurrent neural networks (Nguyen et al., 2016), graph convolutional networks (Nguyen and Grishman, 2018;Lai et al., 2020). With the recent successes of BERT (Devlin et al., 2019), PLMs have also been used for EE (Wang et al., 2019a,b;Yang et al., 2019;Wadden et al., 2019;Tong et al., 2020). Although achieving remarkable performance in benchmarks such as ACE 2005 (Walker et al., 2006) and similar datasets (Ellis et al., 2015(Ellis et al., , 2016Getman et al., 2017;, these PLM-based works solely focus on better finetuning rather than pre-training for EE. In this paper, we study pre-training to better utilize rich event knowledge in large-scale unsupervised data. Event Schema Induction. Supervised EE models cannot generalize to continually-emerging new event types and argument roles. To this end, Chambers and Jurafsky (2011) explore to induce event schemata from raw text by unsupervised clustering. Following works introduce more features like coreference chains (Chambers, 2013) and entities Sha et al., 2016). Recently, Huang and Ji (2020)  supervised setting allowing to use annotated data of known types. Following Huang et al. (2016), we evaluate the generalizability of CLEVE in the most challenging unsupervised "liberal" setting, which requires to induce event schemata and extract event instances only from raw text at the same time.
Contrastive Learning. Contrastive learning was initiated by Hadsell et al. (2006) following an intuitive motivation to learn similar representations for "neighboors" and distinct representations for "non-neighbors", and is further widely used for selfsupervised representation learning in various domains, such as computer vision (Wu et al., 2018;Oord et al., 2018;Hjelm et al., 2019;He et al., 2020) and graph (Qiu et al., 2020;You et al., 2020;. In the context of NLP, many established representation learning works can be viewed as contrastive learning methods, such as Word2Vec (Mikolov et al., 2013), BERT (Devlin et al., 2019;Kong et al., 2020) and ELECTRA (Clark et al., 2020). Similar to this work, contrastive learning is also widely-used to help specific tasks, including question answering (Yeh and Chen, 2019), discourse modeling (Iter et al., 2020), natural language inference (Cui et al., 2020) and relation extraction .

Methodology
The overall CLEVE framework is illustrated in Figure 2. As shown in the illustration, our contrastive pre-training framework CLEVE consists of two components: event semantic pre-training and event structure pre-training, of which details are introduced in Section 3.2 and Section 3.3, respectively. At the beginning of this section, we first introduce the required preprocessing in Section 3.1, including the AMR parsing and how we modify the parsed AMR structures for our pre-training.

Preprocessing
CLEVE relies on AMR structures (Banarescu et al., 2013) to build broad and diverse self-supervision signals for learning event knowledge from largescale unsupervised corpora. To do this, we use automatic AMR parsers (Wang et al., 2015; to parse the sentences in unsupervised corpora into AMR structures. Each AMR structure is a directed acyclic graph with concepts as nodes and semantic relations as edges. Moreover, each node typically only corresponds to at most one word, and a multi-word entity will be represented as a list of nodes connected with name and op (conjunction operator) edges. Considering pretraining entity representations will naturally benefits event argument extraction, we merge these lists into single nodes representing multi-word entities (like the "CNN's Kelly Wallace" in Figure 1) during both event semantic and structure pre-training. Formally, given a sentence s in unsupervised corpora, we obtain its AMR graph g s = (V s , E s ) after AMR parsing, where V s is the node set after word merging and E s denotes the edge set.
where R is the set of defined semantic relation types.

Event Semantic Pre-training
To model diverse event semantics in large unsupervised corpora and learn contextualized event semantic representations, we adopt a PLM as the text encoder and train it with the objective to discriminate various trigger-argument pairs.

Text Encoder
Like most PLMs, we adopt a multi-layer Transformer (Vaswani et al., 2017) as the text encoder since its strong representation capacity. Given a sentence s = {w 1 , w 2 , . . . , w n } containing n tokens, we feed it into the multi-layer Transformer and use the last layer's hidden vectors as token representations. Moreover, a node v ∈ V s may correspond to a multi-token text span in s and we need a unified representation for the node in pre-training. As suggested by Baldini Soares et al. (2019), we insert two special markers [E1] and [/E1] at the beginning and ending of the span, respectively. Then we use the hidden vector for [E1] as the span representation x v of the node v. And we use different marker pairs for different nodes.
As our event semantic pre-training focuses on modeling event semantics, we start our pre-training from a well-trained general PLM to obtain general language understanding abilities. CLEVE is agnostic to the model architecture and can use any general PLM, like BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019).

Trigger-Argument Pair Discrimination
We design trigger-argument pair discrimination as our contrastive pre-training task for event semantic pre-training. The basic idea is to learn closer representations for the words in the same events than the unrelated words. We note that the words connected by ARG, time and location edges in AMR structures are quite similar to the triggerargument pairs in events (Huang et al., 2016(Huang et al., , 2018, i.e., the key words evoking events and the entities participating events. For example, in Figure 1, "Netanya" is an argument for the "attack" event, while the disconnected "CNN's Kelly Wallace" is not. With this observation, we can use these special word pairs as positive trigger-argument samples and train the text encoder to discriminate them from negative samples, so that the encoder can learn to model event semantics without human annotation. Let R p = {ARG, time, location} and P s = {(u, v)|∃(u, v, r) ∈ E s , r ∈ R p } denotes the set of positive trigger-argument pairs in sentence s.
For a specific positive pair (t, a) ∈ P s , as shown in Figure 2, we construct its corresponding negative samples with trigger replacement and argument replacement. Specifically, in the trigger replacement, we construct m t number of negative pairs by randomly sample m t number of negative triggerst ∈ V s and combine them with the positive argument a. A negative triggert must do not have a directed ARG, time or location edge with a, i.e., (t, a, r) ∈ E s , r ∈ R p . Similarly, we construct m a more negative pairs by randomly sample m a number of negative argumentsâ ∈ V s satisfying (t,â, r) ∈ E s , r ∈ R p . As the example in Figure 2, ("attack", "reports") is a valid negative sample for the positive sample ("attack", "Netanya"), but ("attack", "today's") is not valid since there is a ("attack", "today's", time) edge.
To learn to discriminate the positive triggerargument pair from the negative pairs and so that model event semantics, we define the training objective for a positive pair (t, a) as a cross-entropy loss of classifying the positive pair correctly: where m t , m a are hyper-parameters for negative sampling, and W is a trainable matrix learning the similarity metric. We adopt the cross-entropy loss here since it is more effective than other contrastive loss forms (Oord et al., 2018;.
Then we obtain the overall training objective for event semantic pre-training by summing up the losses of all the positive pairs of all sentences s in the mini batch B s : where θ denotes the trainable parameters, including the text encoder and W .

Event Structure Pre-training
Previous work has shown that event-related structures are helpful in extracting new events (Lai et al., 2020) as well as discovering and generalizing to new event schemata (Huang et al., 2016(Huang et al., , 2018Huang and Ji, 2020). Hence we conduct event structure pre-training on a GNN as graph encoder to learn transferable event-related structure representations with recent advances in graph contrastive pre-training (Qiu et al., 2020;You et al., 2020;. Specifically, we pre-train the graph encoder with AMR subgraph discrimination task.

Graph Encoder
In CLEVE, we utilize a GNN to encode the AMR (sub)graph to extract the event structure information of the text. Given a graph g, the graph encoder represents it with an graph embedding is the graph encoder and {x v } denotes the initial node representations fed into the graph encoder. CLEVE is agnostic to specific model architectures of the graph encoder.
Here we use a state-of-the-art GNN model, Graph Isomorphism Network (Xu et al., 2019), as our graph encoder for its strong representation ability. We use the corresponding text span representations {x v } produced by our pre-trained text encoder (introduced in Section 3.2) as the initial node representations for both pre-training and inference of the graph encoder. This node initialization also implicitly aligns the semantic spaces of event semantic and structure representations in CLEVE, so that can make them cooperate better.

AMR Subgraph Discrimination
To learn transferable event structure representations, we design the AMR subgraph discrimination task for event structure pre-training. The basic idea is to learn similar representations for the subgraphs sampled from the same AMR graph by discriminating them from subgraphs sampled from other AMR graphs (Qiu et al., 2020).
Given a batch of m AMR graphs {g 1 , g 2 , . . . , g m }, each graph corresponds to a sentence in unsupervised corpora. For the i-th graph g i , we randomly sample two subgraphs from it to get a positive pair a 2i−1 and a 2i . And all the subgraphs sampled from the other AMR graphs in the mini-batch serve as negative samples. Like in Figure 2, the two green (w/ "attack") subgraphs are a positive pair while the other two subgraphs sampled from the purple (w/ "solider") graph are negative samples. Here we use the subgraph sampling strategy introduced by Qiu et al. (2020), whose details are shown in Appendix C.
Similar to event semantic pre-training, we adopt the graph encoder to represent the samples a i = G (a i , x v )and define the training objective as: where 1 [j =2i−1] ∈ {0, 1} is an indicator function evaluating to 1 iff j = 2i − 1 and θ is the trainable parameters of graph encoder.

Experiment
We evaluate our methods in both the supervised setting and unsupervised "liberal" setting of EE.

Pre-training Setup
Before the detailed experiments, we introduce the pre-training setup of CLEVE in implementation. We adopt the New York Times Corpus (NYT) 1 (Sandhaus, 2008) as the unsupervised pretraining corpora for CLEVE. It contains over 1.8 million articles written and published by the New York Times between January 1, 1987, and June 19, 2007. We only use its raw text and obtain the AMR structures with a state-of-the-art AMR parser . We choose NYT corpus because (1) it is large and diverse, covering a wide range of event semantics, and (2) its text domain is similar to our principal evaluation dataset ACE 2005, which is helpful (Gururangan et al., 2020). To prevent data leakage, we remove all the articles shown up in ACE 2005 from the NYT corpus during pretraining. Moreover, we also study the effect of different AMR parsers and pre-training corpora in Section 5.2 and Section 5.3, respectively.
For the text encoder, we use the same model architecture as RoBERTa (Liu et al., 2019), which is with 24 layers, 1024 hidden dimensions and 16 attention heads, and we start our event semantic pre-training from the released checkpoint 2 . For the graph encoder, we adopt a graph isomorphism network (Xu et al., 2019) with 5 layers and 64 hidden dimensions, and pre-train it from scratch. For the detailed hyperparameters for pre-training and fine-tuning, please refer to Appendix D.

Adaptation of CLEVE
As our work focuses on pre-training rather than fine-tuning for EE, we use straightforward and common techniques to adapt pre-trained CLEVE to downstream EE tasks. In the supervised setting, we adopt dynamic multi-pooling mechanism (Chen et al., 2015;Wang et al., 2019a,b) for the text encoder and encode the corresponding local subgraphs with the graph encoder. Then we concate- nate the two representations as features and finetune CLEVE on supervised datasets. In the unsupervised "liberal" setting, we follow the overall pipeline of Huang et al. (2016) and directly use the representations produced by pre-trained CLEVE as the required trigger/argument semantic representations and event structure representations. For the details, please refer to Appendix A.

Supervised EE Dataset and Evaluation
We evaluate our models on the most widely-used ACE 2005 English subset (Walker et al., 2006) and the newly-constructed large-scale MAVEN  dataset. ACE 2005 contains 599 English documents, which are annotated with 8 event types, 33 subtypes, and 35 argument roles. MAVEN contains 4, 480 documents and 168 event types, which can only evaluate event detection. We split ACE 2005 following previous EE work (Liao and Grishman, 2010;Li et al., 2013;Chen et al., 2015) and use the official split for MAVEN. EE performance is evaluated with the performance of two subtasks: Event Detection (ED) and Event Argument Extraction (EAE). We report the precision (P), recall (R) and F1 scores as evaluation results, among which F1 is the most comprehensive metric.
Baselines We fine-tune our pre-trained CLEVE and set the original RoBERTa without our event semantic pre-training as an important baseline. To do ablation studies, we evaluate two variants of CLEVE on both datasets: the w/o semantic model adopts a vanilla RoBERTa without event semantic pre-training as the text encoder, and the w/o structure only uses the event semantic representations   (1)

Evaluation Results
The evaluation results are shown in Table 1   all the baselines, including those using dependency parsing information (dbRNN, GatedGCN, SemSyn-GTN and MOGANED). This demonstrates the effectiveness of our proposed contrastive pre-training method and AMR semantic structure. It is noteworthy that RCEE ER outperforms our method in EAE since its special advantages brought by reformulating EE as an MRC task to utilize sophisticated MRC methods and large annotated external MRC data. Considering that our method is essentially a pre-training method learning better eventoriented representations, CLEVE and RCEE ER can naturally work together to improve EE further. (2) The ablation studies (comparisons between CLEVE and its w/o semantic or structure representations variants) indicate that both event semantic pre-training and event structure pre-training is essential to our method. (3) From the comparisons between CLEVE and its variants on ACE (golden) and ACE (AMR), we can see that the AMR parsing inevitably brings data noise compared to golden annotations, which results in a performance drop. However, this gap can be easily made up by the benefits of introducing large unsupervised data with pre-training.

Dataset and Evaluation
In the unsupervised setting, we evaluate CLEVE on ACE 2005 and MAVEN with both objective automatic metrics and human evaluation. For the automatic evaluation, we adopt the extrinsic clustering evaluation metrics: B-Cubed Metrics (Bagga and Baldwin, 1998), including B-Cubed precision, recall and F1. The B-Cubed metrics evaluate the quality of cluster results by comparing them to golden standard annotations and have been shown to be effective (Amigó et al., 2009 Huang et al. (2016), the AMR parsing is significantly superior to dependency parsing and frame semantic parsing on the unsupervised "liberal" event extraction task, hence we do not include baselines using other sentence structures in the experiments.

Evaluation Results
The automatic evaluation results are shown in Table 3 and    Table 5. We can observe that: (1) CLEVE significantly outperforms all the baselines, which shows its superiority in both extracting event instances and discovering event schemata.
(2) RoBERTa ignores the structure information. Although RoBERTa+VAGE encodes event structures with VGAE, the semantic representations of RoBERTa and the structure representations of VGAE are distinct and thus cannot work together well. Hence the two models even underperform LiberalEE, while the two representations of CLEVE can collaborate well to improve "liberal" EE. (3) In the ablation studies, the discarding of event structure pre-training results in a much more significant performance drop than in the supervised setting, which indicates event structures are essential to discovering new event schemata.

Effect of Supervised Data Size
In this section, we study how the benefits of pretraining change along with the available supervised data size. We compare the ED performance on MAVEN of CLEVE, RoBERTa and a non-pretraining model BiLSTM+CRF when trained on different proportions of randomly-sampled MAVEN training data in Figure 3. We can see that the im-   provements of CLEVE compared to RoBERTa and the pre-training models compared to the non-pretraining model are generally larger when less supervised data available. It indicates that CLEVE is especially helpful for low-resource EE tasks, which is common since the expensive event annotation.

Effect of AMR Parsers
CLEVE relies on automatic AMR parsers to build self-supervision signals for large unsupervised data. Intuitively, the performance of AMR parsers will influence CLEVE performance. To analyze the effect of different AMR parsing performance, we compare supervised EE results of CLEVE models using the established CAMR (Wang et al., 2016) and a new state-of-the-art parser  during pre-training in Table 6. We can see that a better AMR parser intuitively brings better EE performance, but the improvements are not so significant as the corresponding AMR performance improvement, which indicates that CLEVE is generally robust to the errors in AMR parsing.

Effect of Pre-training Domain
Pre-training on similar text domains may further improve performance on corresponding downstream tasks (Gururangan et al., 2020;Gu et al., 2020). To analyze this effect, we evaluate the supervised EE performance of CLEVE pre-trained on NYT and English Wikipedia in

Conclusion and Future work
In this paper, we propose CLEVE, a contrastive pre-training framework for event extraction to utilize the rich event knowledge lying in large unsupervised data. Experiments on two real-world datasets show that CLEVE can achieve significant improvements in both supervised and unsupervised "liberal" settings. In the future, we will (1) explore other kinds of semantic structures like the frame semantics and (2) attempt to overcome the noise in unsupervised data brought by the semantic parsers.

Acknowledgement
This work is supported by the National Natural Science Foundation of China Key Project (NSFC No. U1736204), grants from Beijing Academy of Artificial Intelligence (BAAI2019ZD0502) and the Institute for Guo Qiang, Tsinghua University (2019GQB0003). This work is also supported by the Pattern Recognition Center, WeChat AI, Tencent Inc. We thank Lifu Huang for his help on the unsupervised experiments and the anonymous reviewers for their insightful comments.

Ethical Considerations
We discuss the ethical considerations and broader impact of the proposed CLEVE method in this section: (1) Intellectual property. NYT and ACE 2005 datasets are obtained from the linguistic data consortium (LDC), and are both licensed to be used for research. MAVEN is publicly shared under the CC BY-SA 4.0 license 3 . The Wikipedia corpus is obtained from the Wikimedia dump 4 , which is shared under the CC BY-SA 3.0 license 5 . The invited expert is fairly paid according to agreed working hours.
(2) Intended use. CLEVE improves event extraction in both supervised and unsupervised settings, i.e., better extract structural events from diverse raw text. The extracted events then help people to get information conveniently and can be used to build a wide range of application systems like information retrieval (Glavaš andŠnajder, 2014) and knowledge base population (Ji and Grishman, 2011). As extracting events is fundamental to various applications, the failure cases and potential bias in EE methods also have a significant negative impact. We encourage the community to put more effort into analyzing and mitigating the bias in EE systems. Considering CLEVE does not model people's characteristics, we believe CLEVE will not bring significant additional bias.
(3) Misuse risk. Although all the datasets used in this paper are public and licensed, there is a risk to use CLEVE methods on private data without authorization for interests. We encourage the regulators to make efforts to mitigate this risk. (4) Energy and carbon costs. To estimate the energy and carbon costs, we present the computing platform and running time of our experiments in Appendix E for reference. We will also release the pre-trained checkpoints to avoid the additional carbon costs of potential users. We encourage the users to try model compression techniques like distillation and quantization in deployment to reduce carbon costs.

A Downstream Adaptation of CLEVE
In this section, we introduce how to adapt pretrained CLEVE to make the event semantic and structure representations work together in downstream event extraction settings in detail, including supervised EE and unsupervised "liberal" EE.

A.1 Supervised EE
In supervised EE, we fine-tune the pre-trained text encoder and graph encoder of CLEVE with annotated data. We formulate both event detection (ED) and event argument extraction (EAE) as multiclass classification tasks. An instance is defined as a sentence with a trigger candidate for ED, and a sentence with a given trigger and an argument candidate for EAE. The key question here is how to obtain features of an instance to be classified. For the event semantic representation, we adopt dynamic multi-pooling to aggregate the embeddings produced by text encoder into a unified semantic representation x sem following previous work (Chen et al., 2015;Wang et al., 2019a,b). Moreover, we also insert special markers to indicate candidates as in pre-training (Section 3.2). For the event structure representation, we parse the sentence into an AMR graph and find the corresponding node v of the trigger/argument candidate to be classified. Following Qiu et al. (2020), we encode v and its one-hop neighbors with the graph encoder to get the desired structure representation g str . The initial node representation is also obtained with the text encoder as introduced in Section 3.3.
We concatenate x sem and g str as the instance embedding and adopt a multi-layer perceptron along with softmax to get the logits. Then we fine-tune CLEVE with cross-entropy loss.

A.2 Unsupervised "Liberal" EE
Unsupervised "liberal" EE requires to discover event instances and event schemata only from raw text. We follow the pipeline of Huang et al. (2016) to parse sentences into AMR graphs and identify trigger and argument candidates with the AMR structures. We also cluster the candidates to get event instances and schemata with the joint constraint clustering algorithm (Huang et al., 2016), which requires semantic representations of the trigger and argument candidates as well as the event structure representations. The details of this clustering algorithm is introduced in Appendix B. Here we straightforwardly use the corresponding text span representations (Section 3.2) as semantic representations and encode the whole AMR graphs with the graph encoder to get desired event structure representations.

B Joint Constraint Clustering Algorithm
In the unsupervised "liberal" event extraction (Huang et al., 2016), the joint constraint clustering algorithm is introduced to get trigger and argument clusters given trigger and argument candidate representations. CLEVE focuses on learning event-specific representations and can use any clustering algorithm. To fairly compare with Huang et al. (2016), we also use the joint constraint clustering algorithm in our unsupervised evaluation. Hence we briefly introduce this algorithm here.

B.1 Preliminaries
The input of this algorithm contains a trigger candidate set T and an argument candidate set A as well as their semantic representations E T g and E A g , respectively. There is also an event structure representation E t R for each trigger t. We also previously set the ranges of the numbers of resulting trigger and argument clusters: the minimal and maximal number of trigger clusters K min T , K max T as well as the minimal and maximal number of argument clusters K min A , K max A . The algorithm will output the optimal trigger clusters C T = {C T 1 , ..., C T K T } and argument clusters C A = {C A 1 , ..., C A K A }.

B.2 Similarity Functions
The clustering algorithm requires to define triggertrigger similarities and argument-argument similarities. Huang et al. (2016) first defines the constraint function f : When P 1 and P 2 are two triggers, L i has tuple elements (P i , r, id(a)), which means the argument a has a relation r to trigger P i . id(a) is the cluster ID for the argument a. When P i is arguments, L i changes to corresponding triggers and semantic relations accordingly.
Hence the similarity functions are defined as: sim(a1, a2) = simcos(E a 1 g , E a 2 g ) + f (a1, a2) where E t g and E a g are trigger and argument semantic representations, respectively. R t is the AMR relation set in the parsed AMR graph of trigger t. E t r denotes the event structure representation of the node that has a semantic relation r to trigger t in the event structure. λ is a hyper-parameter. sim cos (·, ·) is the cosine similarity. Huang et al. (2016) also defines an objective function O(·, ·) to evaluate the quality of trigger clusters C T = {C T 1 , ..., C T K T } and argument clusters

B.3 Objective
It is defined as follows: where D inter (·) measures the agreement across clusters, and D intra (·) measures the disagreement within clusters. The clustering algorithm iteratively minimizes the objective function.

B.4 Overall Pipeline
This algorithm updates its clustering results iteratively. At first, it uses the Spectral Clustering algorithm (Von Luxburg, 2007) to get initial clustering results. Then for each iteration, it updates clustering results and the best objective value using previous clustering results. It selects the clusters with the minimum O value as the final result. The overall pipeline is shown in Algorithm 1.

C Subgraph Sampling
In the AMR subgraph discrimination task of event structure pre-training, we need to sample subgraphs from the parsed AMR graphs for contrastive pretraining. Here we adopt the subgraph sampling strategy introduced by Qiu et al. (2020), which consists of the random walk with restart (RWR), subgraph induction and anonymization: • Random walk with restart first randomly chooses a starting node (the ego) from the AMR graph to be sampled from. The ego must be a root node, i.e., there is no directed edge in the AMR graph pointing to the node. Then we treat the AMR graph as an undirected graph -Clustering with Spectral Clustering Algorithm: and do random walks starting from the ego. At each step, the random walk with a probability to return to the ego and restart. When all the neighbouring nodes of the current node have been visited, the RWR ends.
• Subgraph induction is to take the induced subgraph of the node set obtained with RWR as the sampled subgraphs.
• Anonymization is to randomly shuffle the indices of the nodes in the sampled subgraph to avoid overfitting to the node representations.
In our event structure pre-training, we take subgraphs of the same sentence (AMR graph) as positive pairs. But, ideally, the two subgraphs in a positive pair should be taken from the same event rather than only the same sentence. However, it is hard to unsupervisedly determine which parts of an AMR graph belong to the same event. We think this task is almost as hard as event extraction itself. The rule used in the event semantic pre-training only handles the ARG, time and location relations, and for the other about 100 AMR relations, we cannot find an effective method to determine   which event their edges belong to. Hence, to take advantage of all the structure information, we adopt a simple assumption that the subgraphs from the same sentence express the same event (or at least close events) to design the subgraph sampling part here. We will explore more sophisticated subgraphsampling strategies in our future work.

D.1 Pre-training Hyperparameters
During pre-training, we manually tune the hyperparameters and select the models by the losses on a held-out validation set with 1, 000 sentences. The event structure pre-training hyperparameters mainly follow the E2E model of Qiu et al. (2020). Table 8 and Table 9 show the best-performing hyper-parameters used in experiments of the event semantic pre-training and event structure pretraining, respectively.

D.2 Fine-tuning Hyperparameters
CLEVE in the unsupervised "liberal" setting directly uses the pre-trained representations and hence does not have additional hyperparameters.
For the fine-tuning in the supervised setting, we manually tune the hyperparameters by 10 trials. In each trial, we train the models for 30 epochs and select models by their F1 scores on the validation set. Table 10 shows the best fine-tuning hyperparameters for CLEVE models and RoBERTa. For the other baselines, we take their reported results.

E Training Details
For reproducibility and estimating energy and carbon costs, we report the computing infrastructures and average runtime of experiments as well as validation performance.

E.1 Pre-training Details
For pre-training, we use 8 RTX 2080 Ti cards. The event semantic pre-training takes 12.3 hours. The event structure pre-training takes 60.2 hours.

E.2 Fine-tuning/Inference Details
During the fine-tuning in the supervised setting and the inference in the unsupervised "liberal" setting, we also use 8 RTX 2080 Ti cards. For the supervised EE experiments, Table 11 and  Table 12 show the runtime and the results on the validation set of the model implemented by us.
In the unsupervised "liberal" setting, we only do inference and do not involve the validation. We report the runtime of our models in Table 13.