Few-shot Event Detection: An Empirical Study and a Unified View

Few-shot event detection (ED) has been widely studied, while this brings noticeable discrepancies, e.g., various motivations, tasks, and experimental settings, that hinder the understanding of models for future progress.This paper presents a thorough empirical study, a unified view of ED models, and a better unified baseline. For fair evaluation, we compare 12 representative methods on three datasets, which are roughly grouped into prompt-based and prototype-based models for detailed analysis. Experiments consistently demonstrate that prompt-based methods, including ChatGPT, still significantly trail prototype-based methods in terms of overall performance. To investigate their superior performance, we break down their design elements along several dimensions and build a unified framework on prototype-based methods. Under such unified view, each prototype-method can be viewed a combination of different modules from these design elements. We further combine all advantageous modules and propose a simple yet effective baseline, which outperforms existing methods by a large margin (e.g., 2.7% F1 gains under low-resource setting).


Introduction
Event Detection (ED) is the task of identifying event triggers and types in texts. For example, given "Cash-strapped Vivendi wants to sell Universal Studios", it is to classify the word "sell" into a TransferOwnership event. ED is a fundamental step in various tasks such as successive event-centric information extraction (Huang et al., 2022;Ma et al., 2022b;Chen et al., 2022), knowledge systems (Li et al., 2020;Wen et al., 2021), story generation (Li et al., 2022a), etc. However, the annotation of event instances is costly and labor-consuming, which motivates the research on improving ED with limited labeled samples, i.e., the few-shot ED task.
Extensive studies have been carried out on fewshot ED. Nevertheless, there are noticeable discrepancies among existing methods from three aspects. † Co-corresponding Author.
1 Our code will be publicly available at https://github.com/mayubo2333/fewshot_ED. (1) Motivation ( Figure 1): Some methods focus on model's generalization ability that learns to classify with only a few samples (Li et al., 2022b). Some other methods improve the transferability, by introducing additional data, that adapts a well-trained model on the preexisting schema to a new schema using a few samples (Lu et al., 2021). There are also methods considering both abilities (Liu et al., 2020;Hsu et al., 2022).
(2) Task setting: Even focusing on the same ability, methods might adopt different task settings for training and evaluation. For example, there are at least three settings for transferability: episode learning (EL, Deng et al. 2020;Cong et al. 2021), class-transfer (CT, Hsu et al. 2022) and task-transfer (TT, Lyu et al. 2021;Lu et al. 2022).
(3) Experimental Setting: Even focusing on the same task setting, their experiments may vary in different sample sources (e.g., a subset of datasets, annotation guidelines, or external corpus) and sample numbers (shot-number or sampleratio). Table 1 provides a detailed comparison of representative methods.
In this paper, we argue the importance of a unified setting for a better understanding of few-shot ED. First, based on exhaustive background investigation on ED and similar tasks (e.g., NER), we conduct an empirical study of ten SOTA methods under two practical settings: low-resource setting for generalization ability and class-transfer setting for transferability. We roughly classify the ten methods into two groups: prototype-based mod- Table 1: Noticeable discrepancies among existing few-shot ED methods. Explanations of task settings can be found in Section 2.1, which also refer to different motivations: LR for generalization, EL, CT, and TT for transfer abilities. Dataset indicates the datasets on which the training and/or evaluation is conducted. Sample Number refers to the number of labeled samples used. Sample Source refers to where training samples come from. Guidelines: example sentences from annotation guidelines. Datasets: subsets of full datasets. Corpus: (unlabeled) external corpus. els to learn event-type representations and proximity measurement for prediction and prompt-based models that convert ED into a familiar task of Pretrained Language Models (PLMs).
The second contribution is a unified view of prototype-based methods to investigate its superior performance. Instead of picking up the bestperforming method as in conventional empirical studies, we take one step further. We break down the design elements along several dimensions, e.g., the source of prototypes, the aggregation form of prototypes, etc. And third, through analyzing each effective design element, we propose a simple yet effective unified baseline that combines all advantageous elements of existing methods. Experiments validate an average 2.7% F 1 gains under low-resource setting and the best performance under class-transfer setting. Further analysis also provides many valuable insights for future research.

Preliminary
Event detection (ED) is usually formulated as either a span classification task or a sequence labeling task, depending on whether candidate event spans are provided as inputs. We brief the sequence labeling paradigm here because the two paradigms can be easily converted to each other.
Given a dataset D annotated with schema E (the set of event types) and a sentence X = [x 1 , ..., x N ] T ∈ D, where x i is the i-th word and N the length of this sentence, ED aims to assign a label y i ∈ (E ∪ {N.A.}) for each x i in X. We say that word x i triggering an event y i if y i ∈ E.

Few-shot ED task settings
We categorize few-shot ED settings to four cases: low-resource (LR), class-transfer (CT), episode learning (EL) and task-transfer (TT). Lowresource setting assesses the generalization ability of few-shot ED methods, while the other three settings are for transferability. We adopt LR and CT in our empirical study towards practical scenarios. More details can be found in Appendix A.1. Low-resource setting assumes access to a dataset D = (D train , D dev , D test ) annotated with a label set E, where |D dev | ≤ |D train | |D test |. It assesses the generalization ability of models by (1) utilizing only few samples during training, and (2) evaluating on the real and rich test dataset. Class-transfer setting assumes access to a source dataset D (S) with a preexisting schema E (S) and a target dataset D (T ) with a new schema E (D) . Note that E (S) and E (D) contain disjoint event types. D (S) contains abundant training samples, while D (T ) is the low-resource setting dataset described above. Models under this setting are expected to be pre-trained on D (S) then further trained and evaluated on D (T ) .

Category of existing methods
We roughly group existing few-shot ED methods into two classes: prompt-based methods and prototype-based methods. More details are introduced in Appendix A.2. Prompt-based methods leverage the rich language knowledge in PLMs by converting downstream tasks to the task with which PLMs are more familiar. Such format conversion narrows the gap between pre-training and downstream tasks and benefits knowledge induction in PLMs with limited annotations. Specifically, few-shot ED can be converted to machine reading comprehension (MRC, Du and Cardie 2020;Liu et al. 2020;Feng et al. 2020), natural language inference (NLI, Lyu et al. 2021), conditional generation (CG, Paolini et al. 2021;Lu et al. 2021Lu et al. , 2022Hsu et al. 2022), and the cloze task (Li et al., 2022b). We give examples of these prompts in Table 6. Prototype-based methods predict an event type for each word/span mention by measuring its representation proximity to prototypes. Here we define prototypes in a generalized format -it represents some event type. For example, Prototypical Network (ProtoNet, Snell et al. 2017) and its variants (Lai et al., 2020a,b;Deng et al., 2020Deng et al., , 2021Cong et al., 2021;Lai et al., 2021) construct prototypes via a subset of sample mentions for fewshot ED. Except for event mentions, a line of work leverage related knowledge to learn prototypes' representation, including AMR graph (Huang et al., 2018) and definitions (Shen et al., 2021).
For comprehensiveness, we also include competitive methods from similar tasks, mainly Named Entity Recognition (NER) and Slot Tagging (ST), which are highly adaptable to ED. Such expansion enriches the categorization and enables us to build a unified view in Section 3. For instance, some methods leverage label semantics to enhance (Hou et al., 2020) (Hadsell et al.) in NER. Such method also determines the event by measuring the distances with other samples and aggregates these distances to evaluate an overall distance to each event type. Therefore we view it as a generalized format of prototype-based methods as well.

A Prototype-based Unified View
Due to the superior performance (Sections 5 and 6), we zoom into prototype-based methods to pro- vide a unified view towards a better understanding. We observe that they share lots of similar components. As shown in Table 2 and Figure 2, we decompose prototype-based methods into 5 design elements: prototype source, transfer function, distance function, aggregation form, and CRF module. This unified view enables us to compare choices in each design element directly. By aggregating the effective choices, we end with a Unified Baseline. Formally, given an event mention x, prototypebased methods predict the likelihood p(y|x) from P-score(x, y) for each y ∈ (E ∪ {N.A.}) The general framework is as follows. Denote the PLM's output representation of event mention x and data c y in prototype source C y as h x and h cy respectively, where h ∈ R m and m is the dimension of PLM's hidden space. The first step is to convert h x and h cy to appropriate representations via a transfer function f (·). Then the methods maintain either a single or multiple prototypes c y 's for each event type, determined by the adopted aggregation form. Third, the distance between f (h x ) and f (h cy ) (single prototype) or f (h cy )'s (multiple prototypes) is computed Table 2: Decomposing five prototype-based methods and unified baseline along design elements. "Both" in column 1 means both event mentions and label names for y are prototype sources. JSD: Jensen-Shannon divergence. M: Projection matrix in TapNet. N (µ(h), Σ(h)): Gaussian distribution with mean µ(h) and covariance matrix Σ(h).
Next, we detail the five design elements: Prototype source C y (purple circles in Figure 2, same below) indicates a set about the source of data / information for constructing the prototypes. There are mainly two types of sources: (1) event mentions (purple circle without words): ProtoNet and its variants in Figure 2(b),(c),(d) additionally split a support set S y from training data as prototype source, while contrastive learning methods in Figure 2(a) view every annotated mention as the source (except the query one).
(2) Label semantics (purple ellipses with words): Sometimes, the label name l y is utilized as the source to enhance or directly construct the prototypes. For example, FSLS in Figure 2(e) views the text representation of type names as prototypes, while L-TapNet-CDT in Figure 2(c) utilizes both the above kinds of prototype sources. Transfer function f : R m → R n (yellow modules) transfers PLM outputs into the distance space for prototype proximity measurement. Widely used transfer functions include normalization in Figure 2(b), down-projection in Figure 2(c), reparameterization in Figure 2(a), or an identity function. Distance function d : R n × R n → R + (green modules) measures the distance of two transferred representations within the same embedded space. Common distance functions are euclidean distance in Figure 2(d) and negative cosine similarity in Figure 2(b),(c),(e).
Aggregation form (blue modules) describes how to compute P-score(x, y) based on a single or multiple prototype sources. Aggregation may happen at three levels.
(1) feature-level: ProtoNet and its variants in Figure 2(b),(c),(d) aims to construct a single prototype hc y for each event type y by merging various features, which ease the calculation P-score(x, y) = −d(f (h x ), f (hc y )).
(2) score-level: CONTAINER in Figure 2(a) views each data as a prototype (they have multiple prototypes for each type y) and computes the distance d(f (h x ), f (h cy )) for each c y ∈ C y . These distances are then merged to obtain P-score(x, y).
(3) loss-level: Such form has multiple parallel branches b for each mention x. Each branch has its own P-score (b) (x, y) and is optimized with different loss components during training. Thus it could be viewed as a multi-task learning format. See unified baseline in Figure 2(f). CRF module (orange modules) adjusts predictions within the same sentence by explicitly considering the label dependencies between sequential inputs. The vanilla CRF (Lafferty et al., 2001) and its variants in Figure 2 Dataset source. We utilize ACE05 (Doddington et al., 2004), MAVEN (Wang et al., 2020) and ERE (Song et al., 2015) to construct few-shot ED datasets in this empirical study. Detailed statistics about these three datasets are in Appendix B.1. Low-resource setting. We adopt K-shot sampling strategy to construct few-shot datasets for the lowresource setting, i.e., sampling K train and K dev samples per event type to construct the train and dev sets, respectively. 2 We set three (K train , K dev ) in our evaluation: (2, 1), (5, 2) and (10, 2). We follow Yang and Katiyar (2020) taking a greedy sampling algorithm to approximately select K samples for each event type. See Appendix B.2 for details and the statistics of the sampled few-shot datasets. We inherit the original test set as D test . Class-transfer setting. The few-shot datasets are curated in two sub-steps: (1) Dividing both event types and sentences in the original dataset into two disjoint parts, named source dataset and target dataset pool, respectively. (2) Sampling few-shot samples from the target dataset pool to construct target dataset. The same sampling algorithm as in low-resource setting is used. Then we have the source dataset and the sampled target dataset. See Appendix B.2 for details and the statistics of the sampled few-shot datasets. Evaluation Metric We use micro-F 1 score as the evaluation metric. To reduce the random fluctuation, the reported values of each setting are the averaged score and sample standard deviation, of results w.r.t 10 sampled few-shot datasets. Implementation Details We unify PLMs in each method as much as possible for a fair comparison in our empirical study. Specifically, we use RoBERTa-base (Liu et al., 2019) for all prototype-based methods and three non-generation prompt-based methods.
However, we keep the method's original PLM for two promptbased methods with generation prompt, UIE (T5-base, Raffel et al. 2020) and DEGREE (BART-large, Lewis et al. 2020). We observe their performance collapses with smaller PLMs. See more details in Appendix B.4.

Evaluated methods
We evaluate 10 representative methods, including 5 prompt-based and 5 prototype-based methods. The 10 methods are detailed in Appendix B.3. Fine-tuning To validate the effectiveness of fewshot methods, we also fine-tune a supervised classifier for comparison as a trivial baseline.  (Perez et al., 2021) is of opposition to introducing an additional dev set for few-shot learning. We agree with their opinion but choose to keep a very small dev set mainly for feasibility consideration. Given the number of experiments in our empirical study, it is infeasible to conduct cross-validation on every single train set for hyperparameter search. 5 Results: Low-resource Learning

Overall comparison
We first overview the results of the 10 methods under the low-resource setting in Table 3. Fine-tuning. Despite its simpleness, fine-tuning achieves acceptable performance. In particular, it is even comparable to the strongest existing methods on MAVEN dataset (only being 1.1% and 0.5% less under 5-shot and 10-shot settings). One possible reason that fine-tuning is good on MAVEN is that MAVEN has 168 event types, much larger than others. When the absolute number of samples is relatively large, PLMs might capture implicit interactions among different event types, even though the samples per event type are limited. When the sample number is scarce, however, finetuning is much poorer than existing competitive methods (see ACE05). Thus, we validate the necessity and progress of existing few-shot methods. Prompt-based methods. Prompt-based methods deliver much poorer results than expected, even compared to fine-tuning, especially when the sample number is extremely scarce. It shows designing effective prompts for ED tasks with very limited annotations is still challenging. We speculate it is due to the natural gap between ED tasks (sequence labeling or span extraction) and pretraining tasks in PLMs (sentence classification or generation).
Among prompt-based methods, PTE and DE- GREE achieve relatively robust performance under all settings. DEGREE is advantageous when the sample size is small, but it cannot well handle a dataset with many event types like MAVEN. Note that, DEGREE enumerates event types to query their potential triggers; both efficiency and effectiveness drop with the increasing number of event types. When sample sizes are relatively large, EEQA shows competitive performance as well.

Prototype-based methods
Since prototype-based methods have overall better results, we zoom into the design elements to search for effective choices based on the unified view. Transfer function, Distance function, and CRF. We compare combinations of transfer and distance functions and four variants of CRF modules in Appendices C.1 and C.2. We make two findings: (1) A scaled coefficient in the distance function achieves better performance with the normalization transfer function.
(2) There is no significant difference between models with or without CRF modules. Based on these findings, we observe a significant improvement in five existing methods by simply substituting their d and f for more appropriate choices, see Figure 3 and Appendix C.1. We would use these new transfer and distance functions in further analysis and discussion. Prototype Source. We explore whether label semantic and event mentions are complementary prototype sources, i.e., whether utilizing both achieves better performance than either one. We choose  Table 9.
ProtoNet and FSLS as base models which contain only a single kind of prototype source (mentions or labels). Then we combine the two models using three aggregating forms mentioned in Section 3 and show their results in Figure 4. Observe that: (1) leveraging label semantics and mentions as prototype sources simultaneously improve the performance under almost all settings, and (2) merging the two kinds of sources at loss level is the best choice among three aggregation alternatives. Contrastive or Prototypical Learning. Next, we investigate the effectiveness of contrastive learning (CL, see CONTAINER) and prototypical learning (PL, see ProtoNet and its variants) for event mentions. We compare three label-enhanced (since we have validated the benefits of label semantics) methods aggregating event mentions with different approaches.  Figure 5: Results of (label-enhanced) PL and CL methods on ACE05 and MAVEN few-shot datasets. See full results on three datasets in Table 10.
MoCo setting (He et al., 2020). The in-batch CL and MoCo CL are detailed in Appendix C.4. Figure 5 suggests CL-based methods outperform Ll-ProtoNet. There are two possible reasons: (1) CL has higher sample efficiency since every two samples interact during training. PL, however, further splits samples into support and query set during training; samples within the same set are not interacted with each other. (2) CL adopts score-level aggregation while PL adopts feature-level aggregation. We find the former also slightly outperforms the latter in Figure 4. We also observe that MoCo CL usually has a better performance than in-batch CL when there exists complicated event types (see MAVEN), or when the sample number is relatively large (see ACE 10-shot). We provide a more detailed explanation in Appendix C.4.

The unified baseline
Here is a summary of the findings: (1) Scaled euclidean or cosine similarity as distance measure with normalized transfer benefits existing methods.
(2) CRF modules show no improvement in performance. (3) Label semantic and event mentions are complementary prototype sources, and aggregating them at loss-level is the best choice. (4) As for the branch of event mentions, CL is more advantageous than PL for few-shot ED tasks. (5) MoCo CL performs better when there are a good number of sentences, otherwise in-batch CL is better.
Based on these findings, we develop a simple but effective unified baseline as follows. We utilize both label semantic and event mentions as prototype sources. We aggregate two types of sources at loss-level, while merge multiple event mentions at score-level and adopt CL. Specifically, we assign two branches with their own losses for label semantic and event mentions respectively: L = L label + L mention . For label semantic branch, we follow the design of FSLS. For event men-tion branch, we inherit CONTAINER except minor change: if the total sentence number in train set is smaller than 128, we take MoCo CL rather than inbatch CL. Both two branches adopt scaled cosine similarity as distance function and normalization as transfer function, and we do not add any CRF modules. The diagram of the unified baseline is illustrated in Figure 2(f) and its performance is shown in Table 3. Clearly, unified baseline outperforms all existing methods significantly under every low-resource setting, on all datasets.

Results: Class-transfer Learning
In this section, we evaluate existing methods and the unified baseline under class-transfer setting. : Class-transfer results of prompt-based methods. We plot fine-tuning (red dash lines), best and second best prototype-based methods (black solid/dash lines) for comparison. See full results in Table 11.

Prompt-based methods
We first focus on 4 existing prompt-based methods and explore whether they could smoothly transfer event knowledge from a preexisting (source) schema to a new (target) schema. We show results in Figure 6 and Appendix D.1. The findings are summarized as follows.
(1) The transfer of knowledge from source event types to target event types facilitates the model prediction under most scenarios. It verifies that an appropriate prompt usually benefits inducing the knowledge learned in PLMs.
(2) However, such improvement gradually fades with the increase of sample number from  Table 12. either source or target schema. For example, the 5shot v.s 10-shot performance for PTE and UIE are highly comparable. We speculate these prompts act more like a catalyst: they mainly teach model how to induce knowledge from PLMs themselves rather than learn new knowledge from samples. Thus the performance is at a standstill once the sample number exceeds some threshold.
(3) Overall, the performance of prompt-based methods remains inferior to prototype-based methods in class-transfer setting (see black lines in Figure 6).

Prototype-based methods
We further explore the transfer ability of existing prototype-based methods and unified baseline 3 . Thanks to the unified view, we conduct a more thorough experiment that enumerates all possible combinations of models used in the source and target domain, to assess if the generalization ability affects transferability. That is, the parameters in PLMs will be shared from source to target model. We show results in Figure 7 and Appendix D.2.
1. Is transfer learning effective for prototype-based methods? It depends on the dataset (compare the first row with other rows in each column). For ACE05 and MAVEN datasets, the overall answer is yes. Contrary to our expectation, transfer learning affects most target models on ERE dataset negatively, especially for 2-and 5-shot settings.
2. Do prototype-based methods perform better than simple fine-tuning? It depends on whether finetuning the source or target model. When fine-tuning a source model (row 2), it sometimes achieves comparable even better performance than the prototypebased methods (last 4 rows). When fine-tuning a target model (column 1), however, the performance drops significantly. Thus, we speculate that powerful prototype-based methods are more necessary in target domain than source domain. 3. Is the choice of prototype-based methods important? Yes. When we select inappropriate prototypebased methods, they could achieve worse performance than simple fine-tuning and sometimes even worse than models without class transfer. For example, CONTAINER and L-TapNet are inappropriate source model for ACE05 dataset. 4. Do the same source and target models benefit the event-related knowledge transfer? No. The figures show the best model combinations often deviate from the diagonals. It indicates that different source and target models sometimes achieve better results. 5. Is there a source-target combination performing well on all settings? Strictly speaking, the answer is No. Nevertheless, we find that adopting FSLS as the source model and our unified baseline as the target model is more likely to achieve competitive (best or second best) performance among all alternatives. It indicates that (1) the quality of different combinations show kinds of tendency though no consistent conclusion could be drawn.
(2) a model with moderate inductive bias (like FSLS) might be better for the source dataset with abundant samples. Then our unified baseline could play a role during the target stage with limited samples.

Conclusion
We have conducted a comprehensive empirical study comparing ten representative methods under unified low-resource and class-transfer settings. For systematic analysis, we proposed a unified framework of promising prototype-based methods. Based on it, we presented a simple and effective baseline that outperforms all existing methods significantly under low-resource setting, and is an ideal choice as the target model under class-transfer setting. In the future, we aim to explore how to leverage unlabeled corpus for fewshot ED tasks, such as data augmentation, weaklysupervised learning, and self-training.

Limitations
We compare ten representative methods, present a unified view on existing prototype-based methods, and propose a competitive unified baseline by combining the advantageous modules of these methods. We test all methods, including the unified baseline, on three commonly-used English datasets using various experimental settings and achieve consistent results. However we acknowledge the potential disproportionality of our experiments in terms of language, domain, schema type and data scarcity extent. Therefore, for future work, we aim to conduct our empirical studies on more diverse event-detection (ED) datasets.
We are fortunate to witness the rapid development of Large Language Models (LLMs Brown et al. 2020b;Ouyang et al. 2022;Chung et al. 2022) in recent times, especially after the com-pletion and submission of our work (in October 2022, ACL Rolling Review). In our comparison between prompt-based methods and prototype-based methods, we did not incorporate LLMs. However, we believe that this limitation is unlikely to have a significant impact on our findings and conclusions at present, as a series of recent works have shown that LLMs, including ChatGPT 4 , currently face challenges in dealing with Information Extraction (IE) tasks that require structured outputs (Josifoski et al., 2023). Moreover, their performance on ED tasks and other IE tasks remains far from that of supervised small language models (SLMs) (Qin et al., 2023;Gao et al., 2023;Ma et al., 2023;Zhan et al., 2023). We plan to explore ways that combine the strengths of LLMs and SLMs to improve few-shot ED tasks in the near future.
In this work, we focus more on the model aspect of few-shot ED tasks rather than data. We compare all methods under the unified scenario without external knowledge and unlabeled data. In the future, we plan to explore how to utilize annotation guidelines, unlabeled corpus, and LLMs to improve few-shot ED tasks. and Cardie, 2020;Liu et al., 2020;Feng et al., 2020;Paolini et al., 2021;Lu et al., 2021;Hsu et al., 2022;Li et al., 2022b) adopt the low-resource setting to train and evaluate their models. Such setting has been applied since an early stage (Bronstein et al., 2015;Peng et al., 2016;Zhang et al., 2021), and is often used together with low-resource setting to additionally evaluate transferability of the models (Paolini et al., 2021;Lu et al., 2021;Hsu et al., 2022). 3. Episode learning setting is a classical setting in few-shot Computer Vision (CV) tasks and has been adapted to NLP tasks. It has two phases, metatraining and meta-testing, each of which consists of multiple episodes. Each episode is a few-shot problem with its own train (support) and test (query) sets and event-type classes. Since the sets in each episode are sampled uniformly having K different classes and each class having N instances, episode learning is also known as N -way-K-shot classification.
Many existing few-shot ED methods adopt this setting (Lai et al., 2020a,b;Deng et al., 2020;Cong et al., 2021;Lai et al., 2021;Chen et al., 2021). However, we argue that episode learning assumes an unrealistic scenario. First, during the metatraining stage, a large number of episodes is needed, for example, 20,000 in Cong et al. (2021). Though the label sets of meta-training and meta-testing stages are disjoint, class transfer setting is more reasonable when there are many samples in another schema. Second, tasks with episode learning are evaluated by the performance on samples of the test (query) set in the meta-testing phase. The test sets are sampled uniformly, leading to a significant discrepancy with the true data distribution in many NLP tasks. The absence of sentences without any events further leads to distribution distortion. Further, each episode contains samples with only K different classes, where K is usually much smaller than the event types in the target schema. All these factors may lead to an overestimation on the ability of few-shot learning systems. For above reasons, we do not consider this setting in our experiments. 4. Task transfer setting is very similar to class transfer. The main difference is that it relaxes the constraint in source phase, from the same task with different schema to different tasks. 5 The development of this setting also heavily relies on the success of PLMs. Liu et al. (2020) recently construct unified generation frameworks on multiple IE tasks. Their experiments also reveal that pre-training on these tasks benefits few-shot ED. Though task transfer setting is reasonable and promising, we do not include this setting out of its extreme diversity and complexity. That is, there are (1) too many candidate tasks as pre-training tasks, and (2) too many optional datasets for each pre-training task. Thus it is almost infeasible to conduct a comprehensive empirical study on task transfer setting.

A.2 Taxonomy of methods
We categorize existing methods to two main classes, prompt-based methods and prototypebased methods, and list them in Table 1. Here we give a detailed introduction of existing methods.
Note that in our empirical study, we also include some methods which are originally developed for similar few-shot tasks but can be easily adapted to ED. We leave a special subsection for them. Few-shot ED methods. Due to the prohibitively cost for labeling amounts of event mentions, fewshot ED is a long-standing topic in event-related research community. The proposed solutions are mainly in two branches. The first branch, prototypebased 6 methods, is a classical approach on fewshot learning. It defines a single or multiple prototypes for each event type representing the label-wise properties. It then learns the embedding representation of each sample via shortening the distance from its corresponding prototypes given a distance/similarity metric. Bronstein et al. The other branch, prompting methods, is made possible with the surge of development in PLMs. Given a specific task, prompting methods map the task format to a new format with which the PLMs are more familiar, such as masked word prediction (Schick and Schütze, 2021) and sequence generation (Raffel et al., 2020;Brown et al., 2020a). Such format conversion narrows down the gaps between pre-training tasks and downstream tasks, which is beneficial for inducing learned knowledge from PLMs with limited annotations. As for event detection (and many other IE tasks), however, it is not trivial to design a smooth format conversion. One simple idea is leveraging one single template to prompt both event types and their triggers simultaneously (Paolini et al., 2021;Lu et al., 2021). However, such prompting methods show performance far from satisfactory, especially when they are not enhanced by two-stage pre-training and redundant hinting prefix (Lu et al., 2022). Another natural idea is enumerating all legal spans and querying the PLMs whether each span belongs to any class, or vice versa (Hsu et al., 2022). A major limitation here is the prohibitively time complexity, particularly when there are many event types. Combining the merits of prompting methods and conventional fine-tuning methods is another solution. Du and Cardie (2020) 2021) first segment one sentence into several clauses and view the predicates of clauses as trigger candidates. Then they leverage NLI format to query the event types of these candidates. Recently, Li et al. (2022b) propose a strategy combining Pattern-Exploiting Training (PET, Schick and Schütze 2021) and CRF module. Initially, they conduct sentence-level event detection determining whether one sentence contains any event types or not. For each identified event type, they further use a linear chain CRF to locate the trigger word.
Few-shot NER/ST methods. There are several models which are originally designed for similar tasks like Named Entity Recognition (NER) and Slot Tagging (ST) but could be applied to ED task. Similar to ED methods, one classical paradigm in NER is utilizing ProtoNet (Snell et al., 2017) and its variants to learn one representative prototypes for each class type with only few examples.

B Datasets and Models
We curate few-shot datasets used in this emprical study from three full and commonly-used datasets:

B.1 Full dataset
ACE05 is a joint information extraction dataset, with annotations of entities, relations, and events. We only use its event annotation for ED task. It contains 599 English documents and 33 event types in total. We split documents in ACE05 following previous work (Li et al., 2013) to construct train and test dataset respectively. MAVEN is a newlybuilt large-scale ED dataset with 4480 documents and 168 event types. We use the official split for MAVEN dataset. ERE is another joint information extraction dataset having a similar scale as ACE05 (458 documents, 38 event types). We follow the preprocessing procedure in Lin et al. (2020). Table 4 reports detailed statistics of the three datasets.
ED could be viewed as either a span classification or a sequence labeling task. In our work, we adopt span classification paradigm for MAVEN dataset since it provides official spans for candidate triggers (including negative samples). For the other two datasets, we follow sequence labeling paradigm to predict the event type word by word.

B.2 Dataset construction
This section introduces how we construct few-shot datasets from the three full ED datasets. Low-resource setting. We downsample sentences from original full training dataset to construct D train and D dev , and inherit the original test set as the unified D test . For D train and D dev , we adopt K-shot sampling strategy that each event type has (at least) K samples. Since our sampling is at sentence-level and each sentence could have multiple events, the sampling is NP-complete 7 and unlikely to find a practical solution satisfying exactly K samples for each event type. Therefore, we follow Yang and Katiyar (2020) and Ma et al. (2022a) and adopt a greedy sampling algorithm to select sentences, as shown in Alg. 1. Note that the actual sample number of each event type can be larger than K under this sampling strategy. The statistics of the curated datasets are listed in Table 5 (top).

Algorithm 1 Greedy Sampling
Require: shot number K, original full dataset D = {(X, Y)} tagged with label set E 1: Sort E based on their frequencies in {Y} as an ascending order 2: S ← φ, Counter ← dict() 3: for y ∈ E do 4: Counter(y) ← 0 5: end for 6: for y ∈ E do 7: while Counter(y) < K do Class-Transfer setting This setting has a more complicated curation process, and roughly consists of two sub-steps: (1) Dividing both event types and sentences in the original dataset into two disjoint parts named source dataset and target dataset pool.
(2) Using the entire source dataset, and selecting few-shot samples from the target pool to construct target set. For step (1), we follow Huang et al. (2018) and Chen et al. (2021) to pick out the most frequent 10, 120, and 10 event types from ACE05, MAVEN and ERE dataset respectively, as E (S) . The remaining types are E (T ) . Then we take sentences containing any annotations in E (T ) to D (T ) f ull for enriching the sampling pool of target dataset as much as possible, where R(Y ; E (S) represents the relabeling operation that substituting any y j ∈ E (S) ) to N.A. to avoid information leakage. The remaining sentences are collected as D (S) .
For step (2), we adopt the same strategy as lowresource setting to sample K-shot D  Table 5 (bottom).

B.3 Existing methods
We conduct our empirical study on ten representative existing methods. Five of them are prompt-based and the other five are prototype-based. 1. Prompt-based methods leverage the rich knowledge in PLMs by converting specific downstream tasks to the formats that PLMs are more familiar with. We give examples about prompt format of the five prompt-based methods in Table 6. EEQA/EERC (Du and Cardie, 2020; Liu et al., 2020): a QA/MRC-based method which first extracts the trigger word with a natural language query then classifies its type with an additional classifier. EDTE (Lyu et al., 2021): a NLI-based method which enumerates all event types and judges whether a clause is entailed by any event. The clause is obtained by SRL processing and the trigger candidate is the predicate of each clause. PTE (Schick and Schütze, 2021): a cloze-style prompt method which enumerates each word in the sentence and predicts whether it is the trigger of any event type. UIE (Lu et al., 2022): a generation based method that takes in a sentence and outputs a filled universal template, indicating the trigger words and their event types in the sentence. DEGREE (Hsu et al., 2022): also adopts a generation paradigm but it enumerates all event types by designing type-specific template, and outputs related triggers (if have). 2. Prototype-based methods predict an event type for each word or span by measuring the representation proximity between the samples and the prototypes for each event type. Prototypical Network (Snell et al., 2017): a classical prototype-based method originally developed for episode learning. Huang et al. (2021) adapt it to low-resource setting via further splitting the training set into support set S y and query set Q y . The prototypec y of each event type is constructed by averaged PLM representations of samples in S y .
For samples x in Q y during training, or in the test set during inference, P-score(x, y) is defined as the negative euclidean distance between h(x) andc y .
P-score(x, y) = −||h x − hc y || 2 L-TapNet-CDT (Hou et al., 2020): a ProtoNetvariant method with three main improvements: (1) it introduces TapNet, a variant of ProtoNet. Tap-Net's main difference from ProtoNet lies in a pro- Table 6: Prompt examples for different methods based on a sentence example X: The current government was formed in October 2000, in which the word formed triggering an Start-Org event. The underline part in UIE prompt is their designed Structured Schema Instructor (SSI), and the DESCRIPTION(y) in DEGREE prompt is a description about event type y ∈ E written in natural languages. We refer readers for their original paper in details.

Method Prompt Input Output
EEQA (Du and Cardie, 2020) X. What is the trigger in the event? formed.
EDTE (Lyu et al., 2021) Premise: X. Hypothesis: This text is about a Start-Org event.
Event trigger is N.A.
jection space M analytically constructed. The distance is computed in the subspace spanned by M.
(2) the basis in column space of M ⊥ is aligned with label semantic, thus M(E) is label-enhanced.
(3) a collapsed dependency transfer (CDT) module is used solely during inference stage to scale the event-type score.
P-score(x, y) ← P-score(x, y) + TRANS(y) Then its score related to e is calculated as the average distance with samples in S y (x).

B.4 Implementation Details
For all methods, we initialize their pre-trained weights and further train them using Huggingface library. 8 Each experiment is run on single NVIDIA-V100 GPU, and the final reported performance for each setting (e.g., ACE 2-shot) is the averaged result w.r.t ten distinct few-shot training datasets which are sampled with different random seeds. We further detail the implementation of all methods.

Prompt-based methods
We keep all other hyperparameters the same as in their original papers, except learning rates and epochs. We grid-search best learning rates in [1e-5, 2e-5, 5e-5, 1e-4] for each setting. As for epochs, we find the range of appropriate epochsis highly affected by the prompt format. Therefore we search for epochs method by method without a unified range.  2. Prototype-base methods We build a codebase based on the unified view. We then implement these methods directly on the unified framework, by having different choices for each design element. To ensure the correctness of our codebase, we also compare between results obtained from our implementation and original code for each method, and find they achieving similar performance on few-shot ED datasets. For all methods (including unified baseline), we train them with the AdamW optimizer with linear scheduler and 0.1 warmup step. We set weightdecay coefficient as 1e-5 and maximum gradient norms as 1.0. We add a 128-long window centering on the trigger words and only encode the words within the window; in other words, the maximum encoding sequence length is 128. The batch size is set as 128, and training steps as 200 if the transfer function is scaled (see Section 5.2) otherwise 500. We grid-search best learning rates in [1e-5, 2e-5, 5e-5, 1e-4] for each setting. For ProtoNet and its variants, we further split the sentences into support set and query set. The number in support set K S and query set K Q are (1, 1) for 2-shot settings, (2, 3) for 5-shot settings. The split strategy is (2, 8) for 10-shot dataset constructed from MAVEN and (5, 5) for others. For methods adopting MoCo-CL setting (also see Section 5.2), we maintain a queue storing sample representations with length 2048 for ACE/ERE 2-shot settings and 8192 for others. For methods adopting CRF, we follow default hyperparameters about CRF in their original papers. For methods adopting scaled transfer functions, we grid search the scaled coefficient τ in [0.1, 0.2, 0.3].
C Low-resource Setting-Extended
We conduct experiments with four existing prototype-based methods 13 by only changing their transfer and distance functions. We illustrate their results on ACE dataset in Figure 9. (1) From comparison about performance in ProtoNet and TapNet, we find TapNet, i.e., the down-projection transfer, shows no significant improvement on few-shot ED tasks.
(2) A scaled coefficient in distance function achieves strong performance with normalization transfer function, while the performance collapses (failing to converge) without normalization. (3) For ProtoNet and TapNet, scaled euclidean distance (SEU) is a better choice for distance function, while other methods prefer scaled cosine similarity (SS). Based on the findings above, we substitute d and f to the most appropriate for all existing methods and observe a significant improvement on all three datasets, as shown in Table 8.

C.2 CRF module
We explore whether CRF improves the performance of few-shot ED task. Based on Ll-MoCo model we developed in Section 5.   are in Figure 10. It shows different CRF variants achieve similar result compared with model without CRF, while a trained CRF (and its prototypeenhanced variant) slightly benefits multiple-word triggers when the sample is extremely scarce (see ACE05 2-shot). These results are inconsistent with other similar sequence labeling tasks such as NER or slot tagging, in which CRF usually significantly improves model performance. We speculate it is due to that the pattern of triggers in ED task is relatively simple. To validate such assumption, we count all triggers in ACE05 and MAVEN datasets. We find that above 96% of triggers are single words, and most of the remaining triggers are verb phrases Thus the explicit modeling of transfer dependency among different event types is somewhat not very meaningful under few-shot ED task. Hence, we drop CRF module in the unified baseline.  Figure 10: Overall performance of different CRF variants on ACE05 and MAVEN datasets. We also provide performance grouped by trigger word length: = 1: single trigger words. ≥ 2: trigger phrases.

C.3 Prototype source
We discuss the benefit of combining two kinds of prototype sources in Section 5.2, i.e., label semantic and event mentions, and show some results in Figure 4. Here we list full results on all three datasets in Table 9. The results further validate our claims: (1) leveraging both label semantics and mentions as prototype sources improve performance under almost all settings.
(2) Merging the two kinds of sources at the loss-level is the best choice among the three aggregation alternatives.

C.4 Contrastive Learning
Contrastive Learning (CL Hadsell et al.) is initially developed for self-supervised representation learning and is recently used to facilitate supervised learning as well. It pulls samples with same labels together while pushes samples with distinct labels apart in their embedding space. We view CL as a generalized format of prototype-based methods and include it to the unified view. Under such view, every sample is a prototype and each single event type could have multiple prototypes. Given an event mention, its distances to the prototypes are computed and aggregated by event types to determine the overall distance to each event type.

Two types of Contrastive Learning
We name the representation of event mention as query and prototypes (i.e., other event mentions) as keys. Then CL could be further split into two cases, in-batch CL (Chen et al., 2020) and MoCo CL (He et al., 2020), according to where their keys are from. In-batch CL views other event mentions within the same batch as the keys, and the encoder for computing the queries and keys in batch-CL is updated end-to-end by back-propagation. For MoCo CL, the encoder for key is momentumupdated along the encoder for query, and it accordingly maintains a queue to store keys and utilizes them multiple times once they are previously computed. We refer readers to MoCo CL (He et al., 2020) for the details of in-batch CL and MoCo CL.
CONTAINER (Das et al., 2022) adopts in-batch CL setting for few-shot NER model and we transfer it to ED domain in our empirical study. We further compare the two types of CL for our unified baseline with effective components in Section 5.2 and present the full results in Table 10. We observe in-batch CL outperforms MoCo-CL when the number of the sentence is small, and the situation reverses with the increasing of sentence number. We speculate it is due to two main reasons: (1) When all sentences could be within the single batch, inbatch CL is a better approach since it computes and updates all representations of keys and queries end-to-end by back propagation, while MoCo-CL computes the key representation by a momentumupdated encoder with gradient stopping. When the sentence number is larger than batch size, however, in-batch CL lose the information of some samples in each step, while MoCo-CL keeps all samples within the queue and leverages these approximate representations for a more extensive comparison and learning. (2) MoCo-CL also has an effect of data-augmentation under few-shot ED task, since the sentence number is usually much smaller than the queue size. Then the queue would store multiple representations for each sample, which are computed and stored in different previous steps. The benefits of such data augmentation take effect when there are relatively abundant sentences and accordingly diverse augmentations.

D.1 Prompt-based methods
We list the results of existing prompt-based methods on class-transfer setting in Table 11. See detailed analysis in Section 6.1.

D.2 Prototype-based methods
We list the results of existing prototype-based methods plus our developed unified baseline under classtransfer setting in Table 12. Note that we substitute the appropriate distance functions d and transfer functions f obtained in Section 5.2 for existing methods. See detailed analysis in Section 6.2. Table 9: Performance with different (1) prototype sources and (2) aggregation form. ProtoNet: only event mentions. FSLS: label semantic. Lf-ProtoNet: aggregate two types of prototype sources at feature-level. Ls-ProtoNet: at score-level. Ll-ProtoNet: at loss-level. The results are averaged over 10 repeated experiments and sample standard deviations are in round brackets.

Methods
ACE05 MAVEN ERE 2-shot 5-shot 10-shot 2-shot 5-shot 10-shot 2-shot 5-shot 10-shot  Table 10: Performance with three label-enhanced approaches. The number in square bracket represents (average) sentence number under this setting. Averaged F1-scores with sample standard deviations on 10 repeated experiments are shown. Table 11: Prompt-based methods under class-transfer setting. Averaged F1-scores with sample standard deviations on 10 repeated experiments are shown. We also list results of w/o and w/ transfer for comparison.