Knowledge-Enhanced Self-Supervised Prototypical Network for Few-Shot Event Detection

,


Introduction
Event detection is fundamental to information extraction, which consists of two sub-processes, i.e., trigger word identification and event classification.The former extracts the triggers from a piece of text describing events, while the latter classifies them into different event types.For example , in "Our college is to make arrangements for the meeting", the trigger words are "make arrangements", indicating an Arranging event.Event detection benefits many downstream applications, e.g., question answering and information retrieval.

2019
) have been heavily dependent on a large quantity of labeled data.However, in many real-world scenarios, labeled data are often inadequate, which limits the performance of existing methods.Therefore, researchers have shown increasing interests in event detection with only a few labeled instances, which is thus called Few-Shot Event Detection (FSED).
There are two kinds of approaches in FSED, namely, pipeline ones (Lai et al., 2020;Deng et al., 2020Deng et al., , 2021) ) and joint ones (Lai et al., 2021;Cong et al., 2021;Chen et al., 2021).The former adopt a two-stage (i.e., identification and classification) process, while the latter regard the two sub-processes as a joint one.Since the joint approaches alleviate the error propagation problem appeared in pipeline ones, they have become the mainstream ones.In the joint approaches, FSED is formulated as a sequence tagging task, where each word in a sequence is assigned a label.The label consists of two parts: the position part and the type part (Fritzler et al., 2019).There are three types of labels for the position part, i.e., B, I and O, where B and I indicate the beginning and inside positions of the corresponding words in the event triggers, respectively, and O refers to other words (i.e., non-trigger ones).The type part indicates the event type of the instance.Moreover, this kind of approaches usually adopts Prototypical Network (PN) (Snell et al., 2017) as their classifier, whose main idea is to learn a metric space in which classification of an instance can be performed by measuring its distance to different prototypes.An example of "BIO"-based sequence tagging PN for FSED is shown in Figure 1.Furthermore, the B-Arranging and I-Arranging prototypes compose the Arranging event prototype.
A challenging problem of these PN-based FSED approaches is how to obtain accurate prototype representations, because of two main reasons: First, the number of instances for calculating event proto- types is limited in the few-shot scenarios.In FSED, the trigger words account for only a small proportion of all tokens in a word sequence, which makes the tokens with labels B and I even fewer.Hence, the prototype representations for labels B and I become less accurate.Second, existing approaches usually assume by default that event prototypes are independent.Therefore, they fail to capture the relationships (i.e., the parent-child relationship and the sibling relationship) among these prototypes.
To solve the above problem, we propose a novel Knowledge-Enhanced self-supervised Prototypical Network, called KE-PN, for FSED.To obtain more accurate prototype representations, KE-PN adopts a novel knowledge enhancement method which introduces knowledge from an external knowledge base, i.e., FrameNet (Baker et al., 1998), when computing the prototypes.Unlike recent approaches relying on a mixture of string matching and human annotation (Shen et al., 2021), KE-PN applies hybrid rules which align event types to the frames in FrameNet via a completely automatic manner.Then, KE-PN replaces the triggers of the support instances with the LexiUnits in the aligned frames to form new instances.To reduce the noise brought by the above method, KE-PN adopts a self-supervised learning method to filter out noise from the enhanced support set.Moreover, so as to inject relationship information into prototype representations, KE-PN is equipped with an auxiliary event type relationship classification module.
In summary, the main contributions of this paper are three-fold.
1) We propose a novel knowledge-enhanced selfsupervised learning method to well calculate the representations of event prototypes for the prototypical network, by introducing the knowledge from an external knowledge base, i.e., FrameNet.
2) We adopt event type relationship classification as an auxiliary module, to inject relationship information into prototype representations.
3) Extensive experiments on three benchmark datasets, i.e., FewEvent, MAVEN and ACE2005 demonstrate the state-of-the-art performance of KE-PN.
2 Related Works Later on, in order to solve the trigger curse problem in FSED which means overfitting the trigger will harm the generalization ability, whilst underfitting it will hurt the detection performance, Chen et al. (2021) proposed a structural causal model.These joint approaches usually employ PN (Snell et al., 2017) as their classifier and have achieved promising performance.However, they still suffer from inaccurate prototype representations.To overcome this challenge, we propose KE-PN to enhance the prototype representations and thus obtain more accurate prototypes.

Prototypical Network
The original PN is to learn a metric space in which classification of an instance can be performed by computing its distances to different prototypes (Snell et al., 2017).The prototype c i of the i-th class in the support set is a representative vector.It is calculated upon averaging the vectors of the support instances as c i = 1 K K j=1 x j i , where x j i indicates the representation of the j-th instance of the i-th class.Then, by calculating the distance be- tween the representation vector of a query instance and the prototype vectors, we can obtain a distancebased distribution over the possible classes in the current episode, where d(., .) is a distance function (e.g., Euclidean distance).For sequence tagging based PN, the prototype c i is calculated upon averaging the representations of tokens with the i-th label, which can be B-EventType, I-EventType or O.The prototypes of B-EventType and I-EventType compose the corresponding event prototype.

Notations
In FSED, two datasets are given: D train and D test , which have disjoint event type sets.Each dataset contains several tasks and each task consists of a support set and a query set, which is formulated in the N -way K-shot paradigm.Given the support set S = {(x i , l i )} N ×K i=1 which has N classes and each class has K labeled instances, FSED aims to predict the labels of tokens in the query set Q = {q i } U i=1 .In the support set S, x i = {w 1 i , w 2 i , ..., w n i } denotes a n-word sequence, and l i = {l 1 i , l 2 i , ..., l n i } denotes its label sequence.The query set Q contains U instances, where q i refers to a sequence of unlabeled tokens.Since FSED is formulated as a sequence tagging process, the total number of prototype labels is 2N + 1 (N for B-EventType, another N for I-EventType, and 1 for label O).

The KE-PN Method
The KE-PN method consists of three modules, i.e., representation learning, event detection and event type relationship classification, as shown in Figure 2.
Representation Learning.This module aims to obtain the representations of event prototypes and query instances.To obtain accurate prototype representations, a knowledge enhancement self-supervised learning method is applied.
Event Detection.This module takes the representations of prototypes and query instances as its input, and predicts the labels of tokens in the query set.We adopt PA-CRF, which is the state-of-the-art FSED model, as the event detection method.
Event Type Relationship Classification.This module takes the prototype representations as its input, and predicts whether two prototypes have concerned relationships.This module injects relationship information into prototype representations by working as an auxiliary module.For any ten in p(ti): ten.con(f ) or f .con(ten)7 For any nou in n(ti): nou.con(f ) or f .con(nou)8 For any syn in s(ti): syn.con(f ) or f .con(syn) Table 1: The conditions for aligning event types to frames.

Representation Learning
This module includes four components, i.e., knowledge enhancer, base encoder, noise filter and prototype calculator.Given the support set S, the knowledge enhancer produces the enhanced support set upon introducing external knowledge.Then, by inputting the enhanced support set, the base encoder maps these instances into a semantic embedding space.After that, the noise filter removes noise from the enhanced support set.Finally, the prototype calculator computes the prototypes upon averaging the instance vectors obtained from the last step.

Knowledge Enhancer
The knowledge enhancer presents a novel knowledge enhancement method for FSED, which is based on hybrid rules.Previous works have aligned event types to external knowledge bases (Shen et al., 2021).However, they use manpower, which is time-consuming and expert-driven.For this reason, we design hybrid rules which align the event types to the frames in FrameNet via a completely automatic manner, as shown in Table 1.Let t i denote the i-th event type, f denote the candidate frame and F i denote the corresponding frame set.We adopt WordNet (Miller, 1995)   of enhanced instances for each class.To ensure the same number of instances for all classes, we add zero vectors for the classes whose number of enhanced instances is less than M .Therefore, the original instances and the enhanced ones compose the final enhanced support set S ′ with N ×(K +M ) instances.

Base Encoder
The base encoder aims to map the instances in the enhanced support set S ′ and query set Q into the embedding space to express their semantic meanings.Given the input x i = {w 1 i , w 2 i , ..., w n i }, BERT (Kenton and Toutanova, 2019) is employed to get the embedding representations of x i as follows, where w j i denotes the representation of token w j i , which is of B dimension.Thus, the embedding set S of S ′ can be formulated as (3) Similarly, the embedding set where q i denotes the embedding representation of q i by q i = BERT (q i ). (5)

Noise Filter
Given S from the base encoder, we have statistically analyzed how many wrong frames are aligned by the knowledge enhancer.The analysis results are shown in wrong frames have accounted for most on the Few-Event and ACE2005 datasets.Note that the statistics of MAVEN is not presented, as it has golden alignment to FrameNet.Especially, the incorrect instances obtained by the knowledge enhancer from the wrong frames are called noise.
Due to the different reliability of hybrid rules, the enhanced instances can be divided into certain instances and uncertain instances.Among them, the uncertain ones contain more noise.To filter out the noise from the uncertain ones, we propose a self-supervised learning method, which utilizes the certain instances as the training set.
Specifically, we divide the given enhanced support set S into the certain set cer i and the uncertain set unc i for event type t i .The positive instances of cer i include the original support instances of t i and the instances obtained via the hybrid rules of Nos. 1, 2, 3 and 4. The negative instances of cer i include the certain set of other event types in the given support set.unc i is composed of the instances obtained by the hybrid rules of Nos. 5, 6, 7 and 8.Then, cer i is exploited to train a binary classifier, for which a Multi-Layer Perception (MLP) (Murtagh, 1991) is adopted.Then, we obtain the predicted positive instances in unc i upon inputting unc i into the trained MLP.Finally, the original positive instances in cer i and the predicted positive instances in unc i compose the complete support set CS.

Prototype Calculator
The prototype calculator is to obtain the prototype representation upon averaging the token vectors for each label.The prototype representation c i for the i-th label (i ∈ [1, 2N + 1]) is calculated by where W (CS, l i ) indicates the token set with label l i in CS and w refers to the representation of token w.In addition, | • | denotes the number of elements in the set.Finally, the embedding set P of all prototypes is presented as

Event Detection
In this module, we adopt PA-CRF for event detection, which mainly consists of three sub-modules, i.e., emission module, transition module and decoding module.
The emission module aims to calculate the emission score for each token in the query set Q.The emission scores are obtained upon calculating the similarities between the presentations of the query token and the prototypes.In practice, the dot product operation is chosen to measure the similarity.
The transition module is to generate the distributional parameters (i.e., mean and variance) of transition scores based on the label prototypes.
The decoding module derives the probability for a specific label sequence of the query tokens according to the emission scores and approximated Gaussian distributions of transition scores.The Monte Carlo sampling technique (Gordon et al., 2019) is employed to approximate the integral.In the inference phase, PA-CRF adopts the Viterbi algorithm (Forney, 1973) to decode the probability distribution of the best-predicted label sequence p 1 to different label sequences for the query tokens.The event detection process can be simplified as Then, the loss l 1 of this module is obtained upon the cross entropy loss function L(., .) as where y 1 denotes the ground truth distribution of the query tokens to different label sequences.

Event Type Relationship Classification
So as to inject relationship information into prototype representations, we exploit event type relationship classification as an auxiliary module for FSED.In this module, the relationships which we concerned are parent-child and sibling relationships.Therefore, two prototypes are related, if they or their corresponding frames have the above two relationships.
First of all, a Graph Conventional Network (GCN) (Scarselli et al., 2009) is pre-trained on the graph which contains these two relationships of FrameNet.The representation of each frame is obtained upon inputting its definition into the base encoder.
Given the embedding set P of prototypes and the relationship graph as input, we construct the adjacency matrix A for prototypes with label B. If two prototypes are related, their adjacency weight is set to 1; Otherwise, the weight is 0.Then, we adopt the pre-trained GCN as the encoder, which takes P and their adjacency matrix A as its input, to obtain the updated prototype representations P ′ as Then, we concatenate any two prototypes by where m ̸ = n and m, n The Conventional Neural Network (CNN) decoder is employed as the relationship classifier, to predict whether two prototypes are related.It slides a conventional kernel, whose window size is k, over the concatenated embeddings to get the output hidden embeddings, where Con (•) is a conventional operation.
A max pooling operation is then applied over these hidden embeddings to output the final embedding h ′ m,n as follows: Then, we employ Sigmoid as the activation function and thus obtain the probability distribution p 2 of whether two prototypes are related.The loss l 2 of the relationship classification module is calculated by the cross entropy loss function L(., .) as where y 2 denotes the ground truth distribution to different relationships between two prototypes.For the MAVEN dataset, its classification labels include parent-child and non-parent-child relationships.Moreover, the relationship labels are sibling and non-sibling for FewEvent and ACE2005 datasets.Table 3: The statistic of classes split in the three benchmark datasets.
Finally, the overall loss l is obtained by the sum of l 1 and l 2 as The parameters of KE-PN is updated by minimizing the loss l through applying gradient-based optimization.

Datasets and Evaluation Metrics
As aforesaid, we conduct experiments on the three FSED benchmark datasets, i.e., FewEvent (Deng et al., 2020), MAVEN (Wang et al., 2020), and ACE2005 (Doddington et al., 2004).For FewEvent, we adopt the version split by (Cong et al., 2021), which contains 80, 10 and 10 event types for training, validation and test, respectively.To match the number of classes for the standard few-shot dataset, i.e., FewEvent, we adopt 100 classes in MAVEN which have more than 200 instances and randomly divide them into subsets with 64, 16 and 20 classes for training, validation and test, respectively.For ACE2005 which have 33 classes, we randomly partition it into 13, 10 and 10 classes for training, validation and test, respectively.The statistics of the three datasets are shown in Table 3.
We set up four configurations, namely, 5-way 1-shot, 5-way 5-shot, 10-way 1-shot and 10-way 5-shot, for each FSED task on three datasets.In addition, the same as the previous works (Chen et al., 2015;Liu et al., 2018;Cui et al., 2020), we adopt the standard micro F1 score as the evaluation metric and report the averages and standard deviations upon 5 randomly initialized runs.

Implementation Details and Parameter Setting
The parameter setting is as follows.For the representation learning module, the number of enhanced instances M is set to 25, for the balance of performance and resource.BERT-base-uncased (Kenton and Toutanova, 2019) is employed as the base encoder, whose input sentence is of 128 max length and the hidden size B is 768.For the noise filter, we Table 4: F1 scores (%) on all tasks and on three benchmark datasets: FewEvent, MAVEN and ACE2005.The best results among all models are marked in bold, which indicates statistically significant improvements over the best baseline with p < 0.01 under a boostrap test, and ± marks the standard deviation.
adopt a 3-layer MLP classifier, whose hidden size is 768.For event type relationship classification, we employ a 3-layer GCN, whose hidden size is also 768.Furthermore, for the CNN decoder, the hidden size is 768×2, the kernel size k is 3 and padding is 1.KE-PN is trained with the 1e-5 learning rate with the AdamW optimizer.We train KE-PN with 10,000 iterations on the training set and evaluate its performance with 3,000 iterations on the test set following the episodic paradigm (Vinyals et al., 2016), with batch size 1.Moreover, the dropout is 0.1.We run all experiments using PyTorch 1.5.1 on the Nvidia V100 GPU with 32GB memory.

Baseline Models
In the experiments, we adopt the representative and state-of-the-art joint models as baselines in order to verify the effectiveness of KE-PN on different tasks.More specifically, we choose the following baselines, which employ BERT as their base encoder: 1) Match (Vinyals et al., 2016), adopts Cosine similarity as the distance function; 2) Proto (Snell et al., 2017), uses Euclidean dis-  tance as the similarity metric; 3) Proto-dot, the Proto method that uses dot product to calculate the similarity; 4) Relation (Sung et al., 2018), adopts a twolayer neural network to measure the similarity; 5) PA-CRF (Cong et al., 2021), the state-of-theart model on FewEvent by now.achieves the state-of-the-art performance on all datasets and tasks.The F1 score of KE-PN increases by 16-30% on FewEvent, comparing to those baseline models.Furthermore, the improvements on MAVEN and ACE2005 are about 17-33% and 4-8%, respectively.The improvement on MAVEN is larger than the other two datasets, which is probably due to the strong association between MAVEN and FrameNet.The overall experimental results clearly demonstrate that KE-PN is effective on different datasets and tasks.

Ablation Study
In this subsection, we conduct ablation studies to investigate the effectiveness and impact of, both Knowledge-Enhanced Self-Supervised Learning (KESSL) and Event Type Relationship Classification (ETRC), as well as their impacts on the performance of KE-PN on the dev sets of FewEvent and MAVEN.Moreover, KESSL can be further divided into Knowledge Enhancement (KE) and Self-Supervised Learning (SSL).As shown in Table 5, the performance of ablated models without KE, SSL or ETRC consistently falls on all tasks.
It suggests that all KE, SSL and ETRC contribute to the effectiveness of KE-PN.Besides, it can be observed that the improvement brought by ETRC is relatively small, which may be due to the sparsity of relationships in many tasks, as we count in Table 6.For example, the average number of relationships in 5-way tasks on MAVEN is only 0.12, which indicates that most 5-way tasks do not even include the concerned relationships.As a result, KESSL plays a more important role in KE-PN than ETRC.

Visualization
To investigate the effectiveness and impact of ETRC, we adopt the Embedding projector1 to visualize a 5-way 5-shot task on PA-CRF and KE-PN without KESSL, as shown in Figure 4.The sibling Table 7: The case study on PA-CRF and KE-PN without ETRC.The blue label denotes the right answers, and the red one indicates the wrong answers.

Case Study
To illustrate the effectiveness of KESSL, we choose two cases of event types Music.Compose and Contact.E-Mail from the FewEvent test set.As shown in Table 7, KE-PN without ETRC correctly predicts all labels of tokens on both two instances.Nevertheless, the baseline PA-CRF wrongly classifies the word "music" to I-Music.Compose and "get" to B-Contact.E-Mail.It indicates that KESSL can help the model more effectively distinguish between trigger words and non-trigger words.

Conclusion and Future Work
In this paper, we proposed a novel knowledgeenhanced self-supervised prototypical network, called KE-PN, for FSED.KE-PN proposes hybrid rules which align the event types to FrameNet and then introduces knowledge to obtain more instances.Furthermore, KE-PN presents a novel self-supervised learning method to filter out noise from enhanced instances.Moreover, KE-PN adopts event type relationship classification as an auxiliary module, to inject relationship information into prototype representations.Extensive experiments on three benchmark FSED datasets, i.e., FewEvent, MAVEN and ACE2005, demonstrate the state-ofthe-art performance of KE-PN.In the future work, we will explore FSED into a lifelong learning architecture, as the continuous FSED is an important problem in the real world.

Limitations
The limitations of KE-PN lie in two aspects, i.e., the method aspect and the resource aspect.From the method aspect, the hybrid rules by now are designed from an literal view, which can be explored to conduct semantic matching in the future work.
In addition, we only take parent-child and sibling relationships into account in KE-PN, where more relationships between event types should be further studied.From the resource aspect, the GPU memory usage for training KE-PN increases due to the instances enhancement.So as to reduce the GPU memory usage, the batch size is set to 1, which may cause the learning process to be unstable.

Figure 1 :
Figure 1: An example of sequence tagging based PN for FSED.

2. 1
Few-Shot Event Detection As aforesaid, there are two kinds of approaches in FSED, i.e., pipeline and joint approaches.Under the pipeline framework, Lai et al. (2020) were the first to apply few-shot learning into event detection, and thus introduced two regularization matching losses to improve the performance of the models.Then, Deng et al. (2020) proposed a standard FSED dataset, called FewEvent, and designed DMB-PN, a dynamic memory based network.To introduce the external knowledge, Shen et al. (2021) presented AKE-BML based on the Bayesian method, which adopts string matching and human annotation to align the event types to FrameNet.However, these pipeline approaches follow the identificationthen-classification process and thus suffer from the error propagation problem.Due to this reason, joint approaches in FSED have attracted much attention.Cong et al. (2021) firstly solved FSED with two sub-processes in a unified manner and proposed PA-CRF based on the sequence tagging method.

Figure 2 :
Figure 2: The diagram of the KE-PN model.
in p(ti): ten.eql(f ) 3 For any nou in n(ti): nou.eql(f ) 4 For any syn in s(ti): syn.eql(f ) 5 ti.con(f ) or f .con(ti)6 to get the nouns and synonyms of a word, where p(•) represents the past tense and the present progressive of a word, n(•) denotes the nouns of a word, and s(•) indicates the synonyms of a word.And, a.eql(b) means string a is the same as string b and a.con(b) indicates b is a substring of a.If t i and f satisfy any of these rules, f should be put into F i .Then, we replace the triggers in the support instances with the LexiUnits of the aligned frames in FrameNet to obtain the enhanced instances, as shown in Figure3.Let M denote the max number

Figure 3 :
Figure 3: An example aligning an Arranging event to the Making_arrangements frame in FrameNet.

Figure 4 :
Figure 4: Visualization of instances and their prototypes of PA-CRF (a) and KE-PN without KESSL (b).The dots denote the instance embeddings of B-EventType and the triangles indicate the prototype representations of B-EventType.

Table 2 :
Table 2, where we can see that the Dataset #Class #Aligned Frames #Wrong Frames #Aligned Frames per Class #Wrong Frames per Class The statistic of aligned frames and wrong frames in FewEvent and ACE2005.

Table 5 :
The results of the ablation study on 5-way tasks on the dev sets of FewEvent and MAVEN.
Table 4 presents the overall experimental results, where KE-PN outperforms all baselines and

Table 6 :
The statistic of relationships in three datasets.