Adaptive Knowledge-Enhanced Bayesian Meta-Learning for Few-shot Event Detection

Event detection (ED) aims at detecting event trigger words in sentences and classifying them into specific event types. In real-world applications, ED typically does not have sufficient labelled data, thus can be formulated as a few-shot learning problem. To tackle the issue of low sample diversity in few-shot ED, we propose a novel knowledge-based few-shot event detection method which uses a definition-based encoder to introduce external event knowledge as the knowledge prior of event types. Furthermore, as external knowledge typically provides limited and imperfect coverage of event types, we introduce an adaptive knowledge-enhanced Bayesian meta-learning method to dynamically adjust the knowledge prior of event types. Experiments show our method consistently and substantially outperforms a number of baselines by at least 15 absolute F1 points under the same few-shot settings.


Introduction
Event detection is an important task in information extraction, aiming at detecting event triggers from text and then classifying them into event types (Chen et al., 2015). For example, in "The police arrested Harry on charges of manslaughter", the trigger word is arrested, indicating an "Arrest" event. Event detection has been widely applied in Twitter analysis (Zhou et al., 2017), legal case extraction (de Araujo et al., 2017), and financial event extraction (Zheng et al., 2019), to name a few.
Typical approaches to event detection (Chen et al., 2015;McClosky et al., 2011;Liu et al., 2019) generally rely on large-scale annotated datasets for training. Yet in real-world applications, adequate labeled data is usually unavailable. Hence, methods that generalize effectively with small quantities * Corresponding author. of labeled samples and adapt quickly to new event types are highly desirable for event detection.
Various approaches have been proposed to enable learning from only a few samples, i.e., fewshot learning (Finn et al., 2017;Snell et al., 2017;Zhang et al., 2018a). Yet few-shot event detection (FSED) has been less studied until recently (Lai et al., 2020a;Deng et al., 2020). Although these methods achieve encouraging progress on typical N -way M -shot setting (Figure 1), the performance remains unsatisfactory as the diversity of examples in the support set is usually limited.
Intuitively, introducing high-quality semantic knowledge, such as FrameNet (Baker et al., 1998), is a potential solution to the insufficient diversity issue (Qu et al., 2020;Tong et al., 2020;Liu et al., , 2020. However, as shown in Figure 2, such knowledge-enhanced methods also suffer from two major issues: (1) the incomplete coverage by the knowledge base and (2) the uncertainty caused by the inexact alignment between predefined knowledge and diverse applications.
To tackle the above issues, in this paper, we propose an Adaptive Knowledge-Enhanced Bayesian Meta-Learning (AKE-BML) framework. More specifically, (1) we align the event types between the support set and FrameNet via heuristic rules. 1 (2) We propose encoders for encoding the sam-Figure 2: An example of FrameNet. Left side: the relation between frame 'Chatting' and its sub-frame. Right side: the definitions and LUs (Lexical Units) of frame Chatting and Discussion. The blue words represent the mentions of arguments in definition. It can be seen that, in FrameNet, the definition of the sub-frame is similar to the definition of its super-frame. External knowledge base can provide rich semantic information, yet the knowledge base is typically incomplete, such as the missing of a desired frame "online-chat".
ples and knowledge-base in the same semantic space. (3) We propose a learnable offset for revising the aligned knowledge representations to build the knowledge prior distribution for event types and generate the posterior distribution for event type prototype representations. (4) In the prediction phrase, we adopt the learned posterior distribution for prototype representations to classify query instances into event types.
We conduct comprehensive experiments on the aggregated benchmark dataset of few-shot event detection (Deng et al., 2020). The experimental results show that our method consistently and substantially outperforms state-of-the-art methods. In all six N -way-M -shot settings, our model achieves a large F 1 superiority of at least 15 absolute points.

Related Work
Event Detection. Recent event detection methods based on neural networks have achieved good performance (Chen et al., 2015;Sha et al., 2016;Nguyen et al., 2016;Lou et al., 2021). These methods use neural networks to construct the context features of candidate trigger words to classify events. Pre-trained language models such as BERT (Devlin et al., 2019) have also become an indispensable component of event detection models Wadden et al., 2019;Shen et al., 2020). However, neural models rely on largescale labeled event datasets and fail to predict the labels of new event types. A recent study utilized the basic metric-based few-shot learning method for event detection (Lai et al., 2020b). Deng et al. (2020) tackles few-shot learning for event classification with a dynamic memory network. To enhance background knowledge, ontology embedding is used in ED . These methods have achieved encouraging results in the few-shot learning setting. However, they do not address the problem of insufficient sample diversity in the support set. Our method leverages the knowledge in FrameNet to augment the support set for event detection.
Few-shot Learning and Meta-learning. Fewshot learning trains a model with only a few labeled samples in a support set and predicts the labels of unlabeled samples in the query set. Various approaches have been proposed to solve the fewshot learning problem, which mainly fall into three categories: (1) metric-based methods (Vinyals et al., 2016;Snell et al., 2017;Garcia and Bruna, 2012;Sung et al., 2018), (2) optimization-based methods (Finn et al., 2017;Nichol et al., 2018;Ravi and Larochelle, 2016), and (3) model-based methods (Yan et al., 2015;Zhang et al., 2018b;Sukhbaatar et al., 2015;Zhang et al., 2018a). However, these methods rely heavily on the support set and suffer from poor robustness caused by insufficient sample diversity of the support set.
Bayesian meta-learning (Ravi and Larochelle, 2016;Yoon et al., 2018) can construct the posterior distribution of the prototype vector through external information outside the support set. The effectiveness of this method has been shown in the few-shot relation extraction task (Qu et al., 2020). It inspires us to solve the problem of insufficient sample diversity in the task of few-shot event detection by introducing external knowledge. However, this method ignores the semantic deviation between knowledge and target types. Specifically, a knowledge base may provide incomplete coverage of target types in a given support set, which leads to inaccurate matching between a target type and knowledge.

Problem Definition
In this paper, the Few-Shot Event Detection (FSED) problem is defined as a typical N-way-M-shot problem. Specifically, a tiny labeled support set S is provided for model training. S contains N distinct event types and each event type has only M labeled samples, where M is typically small (e.g. M = 5, 10, 15). More precisely, in each FSED task we are given a small support set S = {(x s , y s )}.
Let X S = {x s } s∈S represent the samples in the support set S, i.e. x s = (I s , tt s ), where I s is the sentence of the sample x s and tt s is the candidate trigger word of x s . We denote by Y S an ordered list of event types, i.e. Y S = {y s } s∈S , where each y s is the ground-truth event type of sample x s . For each support set S, we only consider a subset of event types T s from the entire set of event types T . Hence, in the N -way-M -shot setting, |T S | = N and |X S | = |Y S | = N * M .
Moreover, we assume an external knowledge base F that contains a number of frames. Each frame F t ∈ F consists of three parts: F t = (D t , A t , L t ), where D t , A t and L t are the definition, arguments, and linguistic units (LUs) of the frame respectively. Please see Appendix A for details of FrameNet.
For each support set S, we are also given a query set Q composed of some unlabeled samples X Q = {x q } q∈Q , where x q = (I q , tt q ), I q is the sentence of sample x q , and tt q is the candidate trigger word of x q . Our goal is to learn a neural classifier for these event types by using the external knowledge and the support set. We will apply the classifier to predict the labels of the query samples in Q, i.e., Y Q = {y q } q∈Q with each y q ∈ T S . We do this by learning p(Y Q |X Q , X S , Y S , F ).

Adaptive Knowledge-Enhanced Bayesian Meta-Learning
We now present our adaptive knowledge-enhanced few-shot event detection approach. The overall structure of our method is shown in Figure 3. Our method represents each event type t with a prototype vector v t , which is then used to classify the query sentences. We use V T S = {v t } t∈T S to represent the collection of prototype vectors for all event types in T S . Then the conditional distribution p(Y Q |X Q , X S , Y S , F ) can be represented as: To calculate Eq. 1, we first introduce the sample encoder and knowledge encoder to give the vector representations of samples and the knowledge of event types. Then we use the sample representations and knowledge representation to construct the adaptive knowledge-enhanced posterior distribution p(V T S |X S , Y S , F ) of V T S and give the likelihood p(Y Q |X Q , V T S ) by V T S and sample representations. Finally we leverage Monte Carlo sampling to approximate the posterior distribution and draw each prototype sample by the stochastic gradient Langevin dynamics (Welling and Teh, 2011) to optimize model parameters in an end-to-end fashion. We now explain the framework in more details.

Sample and Knowledge Encoder
The purpose of encoding knowledge is to make up for the lack of diversity and coverage of the support set. Thus we align the knowledge and sample encoding and map them into the same semantic space. Intuitively, trigger and arguments are the main factors for entity detection. Hence, to align the trigger and arguments from samples and external knowledge, we design two encoders for the knowledge and samples, generating the final knowledge encoding h t and the sample encoding E(x) with the same dimensions. Knowledge Encoder. Given a knowledge frame F t = {D t , A t , L t } for the event type t, we encode it into a real-valued vector to represent the semantics of t. As shown in Figure 2, for a frame F t , the linguistic units L t can represent the features of the trigger words, the arguments A t can represent the context of the trigger words in samples, and D t describes the semantic relationship between A t and t.
For each event type t, the proposed knowledge encoder uses BERT to generate the text encoding E Dt and E Lt from the description D t and the LUs L t respectively. Moreover, the arguments encoding E At is a sequence of e (i) At , i.e., the average token encoding in the i-th argument mention in D t , which ensures that the encoding of A t fully contains the semantics of the event type t. Then, as shown in Figure 3, the trigger word prior encoding and argument prior encoding are generated by follows: • Trigger word prior encoding. We use attention to get the weighted sum of words in L t as the trigger word prior encoding e * Lt . The query of the attention is E Dt , key and value are both E Lt . • Argument prior encoding. An attention mechanism is used to aggregate the arguments information into e * At , where the query of the attention is e * Lt , key and value are both E At . Finally, we concatenate the trigger word prior encoding e * Lt and the argument prior encoding e * At , and use a feed forward network f h to generate the knowledge encoding vector h t of event type t, (2) Figure 3: Framework overview. Our method combines both the external knowledge and the support set into a prior distribution of event prototype. We customize two encoders to generate sample representations and knowledge representations. Then we utilize the support set to generate a learnable offset for revising the aligned knowledge representations to generate the prior distribution for prototype representations. Finally, we use Monte-Carlo sampling and stochastic gradient Langevin dynamics to draw samples of prototypes for prediction.
Sample encoder. We follow the same strategy to build a sample encoder. Given each sample x = (I, tt), i.e., a candidate trigger word tt and its context I, we first utilize BERT to encode x and select the encoding of tt as the trigger representation e * tt . As arguments are not explicitly given in x, we use an attention mechanism to aggregate the implicit argument information for current trigger tt, in which the query is e * tt , key and value are both token encoding generate from I. We denote the argument encoding as e * a . Finally, we concatenate the trigger word encoding e * tt and the argument encoding e * a , and use a feed forward network f E to generate the sample encoding vector E(x),

Adaptive Knowledge-Enhanced Posterior
The posterior distribution can be factorized into a prior distribution (given the event knowledge) and a likelihood on the support set (Qu et al., 2020) as, is the likelihood on the support set, and p(V T S |F ) is the adaptive knowledgebased prior for the prototype vectors. We describe the details of these two components as follows: Adaptive Knowledge-based Prior. As we discussed in Section 1, an event type t may not have an exact/perfect match in the knowledge base F . In such situations, we resort to finding the superordinate frame of t, which is semantically closest to t. As shown in Figures 1 and 2, where the event type t in the support set 'online-chat' is matched against the knowledge prior F t 'Chatting' in FrameNet, a super-ordinate frame. In order to enable the knowledge encoding to accurately reflect the characteristics of the corresponding event type, we add a learnable knowledge offset to h t . We denote the knowledge offset between the event type t and its knowledge encoding h t by ∆h t . Recall that the knowledge in h t is encoded from the exactly-matched frame or the super-ordinate frame. ∆h t is defined as follows: where is the element-wise product, and m t is the mean of the encodings E(x) of all the samples x in the support set. λ t ∈ [0, 1] |ht| is the adaptive weight (gate), which is obtained from the sample encoding m t and the knowledge encoding h t : where σ is the nonlinear sigmoid function, and W λ and b λ are trainable parameters.
Putting it altogether, the knowledge prior distribution has the following form, where N (v t |h t + ∆h t , I) is multivariate Gaussian with the mean h t + ∆h t and covariance I (the identity matrix). So, each prototype vector has a prior distribution containing knowledge from FrameNet adaptively adjusted according to the support set.
Likelihood. With the given prototype vectors V Ts distributed according to p(V T S |X S , Y S , F ), the likelihood for support samples is defined as, .
The dot product of the sample encoding E(x q ) and the event type prototype vector v t estimates their similarity. We use sof tmax to normalize the result to the probability of x s belonging to event type t.

Optimization and Prediction
For prediction, the model computes and maximizes the log-probability log p(Y Q |X Q , X S , Y S , F ). However, according to Eqn (1), the log-probability relies on the integration over prototype vectors, which is difficult to compute. Hence, we estimate it with Monte Carlo sampling (Qu et al., 2020), where N s is the number of samples, and V is the likelihood for query samples which has the same form as Eqn 8. To sample from the posterior, we use the stochastic gradient Langevin dynamics (Welling and Teh, 2011) with multiple stochastic updates. Formally, we initialize the sampleV T S and iteratively update the sample as, where z ∼ N (0, I), and is a small real number representing the update step size. The gradient ∇V in Eqn 10 balances the effect of the knowledge and the support set on the prototype vector. Please see Appendix B for derivation details and intuitive explanations of its influence.
The Langevin dynamics requires a burn-in period. To speed up the convergence, we follow the previous method (Qu et al., 2020) and initialize the sample as follows, where m is the mean encoding of all the samples in the support set. Update prototype vectors iteratively (Eq. 12)

10
Compute and maximize log-likelihood (Eq. 9) After obtaining prototype samples from the posterior, log p(Y Q |X Q , X S , Y S , F ) is end-to-end approximated according to Eqn (9). During the training stage, we optimize the log-likelihood of the query set and update the model parameters by gradient descent. In the prediction stage, the loglikelihood will determine the probability that a query sample belongs to each event type. The training process is shown in Algorithm 1.

Experiments
We conduct evaluation with the following goals: (1) to compare our adaptive knowledge-enhanced Bayesian meta-learning method with existing fewshot event detection methods and few-shot learning baseline methods; (2) to assess the effectiveness of introducing external knowledge in different Nway-M -shot settings; and (3) to provide empirical evidence that our adaptive knowledge offset can flexibly adjust the impact of the support set and prior knowledge on event prototypes, making the model more accurate and generalizable.

Experimental Settings
We evaluate our method on an aggregated few-shot event detection dataset FewEvent 2 (Deng et al., 2020). FewEvent combines two currently widelyused event detection datasets, the ACE-2005 corpus 3 and the TAC-KBP-2017 Event Track Data 4 , and adds external event types in specific domains including music, film, sports and education (Deng et al., 2020). As a result, FewEvent contains 70,852 samples for 19 event types that are further divided into 100 event subtypes.
In order to match the few-shot settings , we use 88 event types covering a total of 15,681 samples to construct experimental data. 68 event types are selected for training, 10 for validation, and the rest 10 for testing. Note that there are no overlapping types between the training, validation and testing sets. In order to obtain a convincing result, we conducted 5 random divisions of training and testing for all event types, and the experimental results are averaged as the final result.
The comparisons with our AKE-BML are performed in two aspects, the sample encoder and the few-shot learner. We combine different encoders and few-shot learners to obtain different baseline models. We consider four sample encoders including CNN (Kim, 2014), Bi-LSTM (Huang et al., 2015), DMN (Kumar et al., 2016) and our trigger-attention-based sample encoder TA. For few-shot learners, we consider Matching Networks (MN) (Vinyals et al., 2016) and Prototypical Networks (PN) (Snell et al., 2017). We also compare to the SOTA few-shot event detection method DMN-MPN (Deng et al., 2020), which uses a dynamic memory network (DMN) as the sample encoder and a memory-based prototypical network as the few-shot learner. In addition, in order to verify the effectiveness of our proposed method, we perform an ablation study on our model, which evaluate the model without external knowledge and without dynamic knowledge adaptation.
As a result, the following methods are compared in our experiments: • AKE-BML, our adaptive knowledge-enhanced Bayesian meta-learning method which uses TA encoder as the sample encoder. • KB-BML, a variant of AKE-BML without dynamic knowledge adaption. • TA-BML, a variant of AKE-BML using our TA encoder but without using external knowledge. • DMN-MPN, dynamic-memory-based prototypical network (Deng et al., 2020). • Encoder+Learner, combinations of various sample encoders and event type learners (e.g. CNN+MN and TA+PN).
We use stochastic gradient descent (Bottou, 2012) as the optimizer in training with the learning rate 1 × 10 −5 . The sampling times N s of Monte Carlo sampling and update step size are set to 10 and 0.01 respectively. The update times of stochastic gradient Langevin dynamics M is set to 5. We use dropout after the sample encoder and the knowledge encoder to avoid over-fitting; the dropout rate is set to 0.5. We evaluate the performance of event detection with F 1 and Accuracy scores. Table 1, we compare methods on F 1 and Accuracy scores. We observe the followings: • Our full model AKE-BML outperforms all other methods on both Accuracy and F 1 scores across all settings. Compared with the SOTA method DMN-MPN, AKE-BML achieves a substantial improvement of 15-23 absolute F 1 points in all N -way-M -shot settings. It shows our adaptive knowledge-enhanced Bayesian metalearning method can effectively utilize external knowledge and adjust it according to the support set, thus build better prototypes of event types. Please see Appendix C and D for a detailed performance analysis over various N -way and M -shot settings. • With the sample encoders (Bi-LSTM, CNN, DMN and TA) fixed, it can be observed that prototypical networks (PN) consistently outperforms matching networks (MN). DMN-MPN performs better than PN-based methods, because the dynamic memory network can extract key information from the support set through multiple iterations. However, DMN-MPN only considers the information of a few samples in each support set, hence suffering from insufficient sample diversity similar to PN-and MN-based methods. • TA-BML performs similarly with DMN-MPN under the settings of N -way-5-shot and N -way-10-shot, but slightly worse under the N -way-15shot setting. One possible explanation is that when the number of samples in the support set is larger, MPN can generate higher-quality prototypes. In addition, the performance of TA-BML is not as good as KB-BML, which shows the importance of introducing external knowledge. • Compared with KB-BML, our full model AKE-BML can effectively solve the problem of deviation between knowledge and event types, and generate event prototypes with better generalization through knowledge. Compared with TA-BML, which does not incorporate external knowledge, AKE-BML achieves an even larger performance advantage, which further demonstrates the effectiveness of external knowledge.

Case Study
We present a case study on the dynamic knowledge adaptation between the support set and the Model 5-Way-5-Shot 5-Way-10-Shot 5-Way-15-Shot 10-Way-5-Shot 10-Way-10-Shot 10-Way-15-Shot F 1 /Accuracy F 1 /Accuracy F 1 /Accuracy F 1 /Accuracy F 1 /Accuracy F 1 /Accuracy  Table 1: Accuracy and F 1 scores of all compared methods. § denotes the results that are directly taken from the original paper (Deng et al., 2020), due to the unavailability of the source code.
corresponding event knowledge to demonstrate our model's ability to learn robust event prototypes.

Predictions for Specific Cases
We

Visualization of Prototypes
We use Latent Dirichlet Allocation (LDA) (Blei et al., 2003) to reduce the dimensionality of the prototypes, sample encodings and prior knowledge encodings. Figure 4 visualizes five event type prototypes (large solid shapes), their aligned frames (large solid shapes with circle outlines) in FrameNet and some corresponding samples (small Figure 4: Visualization of event prototypes, prior knowledge, and event samples learned by AKE-BML in the 5-way-5-shot setting. The large, solid shapes denote event prototypes, the large shapes with circle outlines denote the prior knowledge, and the small shapes denote samples. Samples are marked by the color of their corresponding event types. The arrows indicate the adaption of prior knowledge to the prototype. Note that Music.Compose and Film.Film-Production share the same frame Behind the scenes. solid shapes). Each event type and its samples are coded with the same color.
In general, the samples and prototypes belonging to one event type are close in the space and different event types are far away from each other. Prior knowledge is distributed in different places in the space, which roughly determines the distribution of event prototypes. For example, the samples of Life.Pregnancy and Sports.Fair-Play are close to their respective event prototypes. Meanwhile, the distances between their prior knowledge is large, making their prototypes easily distinguishable.
It can also be seen that the event prototypes are closer to their samples than to the prior knowledge, which reflects the benefits of our proposed learnable knowledge offset. The visualization demonstrates the effectiveness of introducing external  knowledge and our adaptive knowledge offset's ability to balance the impact of the support set and prior knowledge on the event prototypes.

λ t of Different Event Types
As shown in Formula (5), we use the learnable parameter λ t to generate knowledge offsets. λ t accounts for the deviation of the prior knowledge (i.e. a frame) from the event type it represents, and adaptively corrects this deviation using information of the support set. When the frame corresponding to the event type accurately expresses its semantics, the λ t value should be small. When the knowledge is the super-ordinate frame of the event type (i.e., the frame cannot accurately describe the event semantics), the λ t value should be large, so that the support set can be used to modify the prior knowledge to ensure that the prototype precisely represents the current event type. Table 3 shows four different event types, their corresponding frames and λ t values. The λ t of Conflict.Attack is a small value 0.132, as the event type Conflict.Attack closely matches the frame Attack. The event type Contact.Letter-Communication matches the frame Communication. Communication does not contain the semantics of "by writing letters", but the core semantics is the same as Contact.Letter-Communication. Therefore, λ t is small, at 0.228, which is still larger than the λ t of Conflict.Attack. The event types Film.Film-Production and Music.Compose share the same super-ordinate frame Behind the scenes as prior knowledge, but the semantics of Behind the scenes is too abstract for Film.Film-Production and Music.Compose. Thus, it can be seen that the λ t values corresponding to these two event types are relatively large: 0.386 for Film.Film-Production and 0.421 for Music.Compose.
The above cases demonstrate that our model is able to balance the influence of the support set and the knowledge on event prototypes through λ t , and consequentially obtain highly accurate and generalizable prototypes.

Conclusion
In this paper, we proposed an Adaptive Knowledgeenhanced Bayesian Meta-Learning (AKE-BML) method for few-shot event detection. We alleviate the insufficient sample diversity problem in fewshot learning by leveraging the external knowledge base FrameNet to learn prototype representations for event types. We further tackle the uncertainty and incompleteness issues in knowledge coverage with a novel knowledge adaptation mechanism.
The comprehensive experimental results demonstrate that our proposed method substantially outperforms state-of-the-art methods, achieving a performance improvement of at least 15 absolute points of F 1 . In the future, we plan to extend our proposed AKE-BML method to the few-shot event extraction task, which considers both event detection and argument extraction. We also plan to explore the zero-shot and incremental event extraction scenarios.

A FrameNet
An important problem in the few-shot event detection task is the insufficient diversity of support set samples. There are only a few labeled samples in the support set, which results in the model unable to construct high-quality prototype features of event types. To address this problem, we introduce the FrameNet (Baker et al., 1998) as an external knowledge base of event types. FrameNet is a linguistic resource storing information about lexical and predicate-argument semantics. Each frame in FrameNet can be taken as a semantic frame of an event type , which can be used as background knowledge for event types to assist event detection Fillmore et al., 2006). Figure 2 shows an example frame defining Attack. We can see the arguments involved in an Attack event and their roles. The linguistic units (LUs) of the frame Attack are the possible trigger words for the corresponding event. The frame is an important complementary source of knowledge to the support set. We match a frame in FrameNet to each event type, based on the event name, as its knowledge. In practice, FrameNet does not provide complete coverage of all event types, nor does every event type have an exact frame matched in FrameNet. For event types that cannot be exactly matched, we assign the frame corresponding to their super-ordinate event. For example, there is no corresponding frame for Contact.Online-Chat, so we assign it to the frame Chatting, which corresponds to the event type Contact.Chat.

B Gradient of posterior distribution
In order to show the change of the prototype vector after adding the knowledge shift, the gradient ∇V is expanded. For ease of explanation, we only calculate the gradient of the prototype vectorv t . We denote the gradient of the original posterior distribution as g ô vt , the gradient of the knowledge-shifted posterior distribution as g ŝ vt . We first calculate g ô vt : where g o,l vt = ∇v t log s∈S,ys=t p(y s |x s ,v t ) and the gradient of the logarithm of prior distribution tov t is: where C = log (2π) − d 2 is a constant, d is the dimension of prototype. The gradient of the loglikelihood on support set is .
The gradient of the log-likelihood tov t is where p (t) s = p(y s |x s ,v t ) is the probability of correct classification of sample s in support set. Then we get (18) Then we calculate g ŝ vt , The only difference between calculating g ŝ vt and g ô vt is g ŝ vt use the knowledgeshifted prior distribution Same as the original posterior gradient, we have where g s,l s )E(x s ). The gradient of the logarithm of knowledge-shifted prior distribution tov t is: where 1 is a |h t |-dimensional vector, and each element of 1 is 1. then we get bring m t = 1 M s∈S,ys=t E(x s ) into the above formula, we get (23) Note that, when knowledge adaption is not used, the form of the prior knowledge distribution of the prototype is as follows, To intuitively show the influence of the knowledge-adapted posterior distribution on the prototype vector, we expand the gradient ∇V T S log p(Y S |X S ,V T S )p(V T S |F ) in Eqn 12. For ease of explanation, we only calculate the gradient of the prototype vectorv t . Denote the gradient of the original posterior distribution without knowledge adaption as g ô vt , and the gradient of the knowledge-adapted posterior distribution as g ŝ vt from Eqn 23. Comparing Eqn 23 and Eqn 25, it can be seen that the posterior distribution without knowledge adaption cannot dynamically balance the influence of the knowledge and the support set on the prototype vector, whereas the knowledge-adapted posterior distribution can adjust their contributions to the prototype vector through λ t . The parameters in Eqn 6 will be updated by the log-likelihood on the query set. This allows the model to reasonably choose the weight of the knowledge and the support set, and obtain prototype vectors with better generalization.

C M -shot Evaluation
In this section, we illustrate the effectiveness of adaptive knowledge-enhanced Bayesian metalearning under different M -shot settings, such as N -way-5-shot, N -way-10-shot and N -way-15shot. As shown in Table 1 in the main paper, as M increases, the performance of all models improves, which shows that increasing the number of samples in the support set can provide more pertinent event type-related features. At the same time, it can be seen that from 15-shot to 5-shot, the previous methods suffer a significantly larger performance degradation than AKE-BML. This observation shows our model's strong robustness against low sample diversity due to the incorporation of external knowledge.
The performance of KB-BML is close to that of DMN-MPN in the case of N -way-5-shot and Nway-10-shot, and the performance is better in the case of N -way-15-shot. This can be attributed to two factors: (1) the introduction of knowledge can improve the generalization of event prototypes; and (2) increasing the number of samples can reduce the impact of the deviation between knowledge and event types. When the support set is sufficiently large, the samples in the support set can compensate for the deviation between knowledge and event types, and the knowledge can also improve the generalization of the prototype vector. However, when M is small, the deviation between knowledge and event types will affect the quality of the prototype vectors.
AKE-BML can well balance the effects of samples and knowledge on the event type prototypes. It can be seen that when M is small, the performance of AKE-BML does not decline as quickly as other models, which also proves the effectiveness of knowledge in dealing with the problem of insufficient diversity of the support set. At the same time, compared with KB-BML, our adaptive knowledge offset can effectively use the information in the support set to correct the knowledge deviation.

D N -Way Evaluation
Figure 5 also illustrates model performance with respect to different way values (i.e. N ), while fixing the shot values. It can be seen from the figure that when N increases, the performance of previous models decreases faster than AKE-BML, which shows that those models, only relying on the support set, cannot generate more recognizable event prototypes. The performance of KB-BML also declines significantly when N increases. This is because many event types can only be partially aligned in FrameNet, to its super-ordinate frame, which causes the event prototypes to be indistinguishable to similar event types.
On the contrary, the performance of AKE-BML does not decrease significantly when N increases, which shows that our adaptive knowledgeenhanced Bayesian meta-learning method can enhance the distinguishability of prototype vectors through the learnable knowledge offset. These results indicate that our adaptive knowledge-enhanced Bayesian meta-learning is more robust to the changes in the number of ways.