Learning Prototype Representations Across Few-Shot Tasks for Event Detection

We address the sampling bias and outlier issues in few-shot learning for event detection, a subtask of information extraction. We propose to model the relations between training tasks in episodic few-shot learning by introducing cross-task prototypes. We further propose to enforce prediction consistency among classifiers across tasks to make the model more robust to outliers. Our extensive experiment shows a consistent improvement on three few-shot learning datasets. The findings suggest that our model is more robust when labeled data of novel event types is limited. The source code is available at http://github.com/laiviet/fsl-proact.


Introduction
In Information Extraction, Event Detection (ED) is an important task that aims to identify and classify event triggers of predefined event types in text (Walker et al., 2006). Event triggers are words/phrases that most clearly indicate the occurrence of events. For example, an event detector should recognize the word homicide in the following sentence as a trigger word of event type life.die.death-caused-by-violent-events: ... the medical examiner believed the manner of death was an accident rather than a homicide.
Typical ED systems follow a supervised learning scheme that requires a large amount of labeled data for each predefined event type (Ji and Grishman, 2008;Nguyen and Grishman, 2015;Chen et al., 2015;Nguyen et al., 2021). Unfortunately, this requirement is usually too costly to achieve in real applications where novel event types emerge and only a few examples are available (Huang et al., 2018). As such, an ED model should be prepared to extract triggers of novel event types (i.e., beyond those provided in the training data) for which only a few examples are provided. This learning schema is known as Few-Shot Learning (FSL) for ED.
To emulate the learning from few examples in ED, N -way K-shot episodic training is often used to exploit existing datasets (Lai et al., 2020b;Deng et al., 2020;Lai et al., 2020aLai et al., , 2021. In each training iteration, a small subset (i.e. support set) of N event types with K examples per type is sampled from the training data. Unfortunately, the sample size is so small (K ∈ [1, 10]) that the FSL models might suffer from sample bias, thus hindering the generalization to novel event types.
Prototypical network is a popular metric-based few-shot learning model (Snell et al., 2017) that has been explored for FSL ED (Lai et al., 2020b;Deng et al., 2020). It introduces a prototype vector for each event type by averaging the representations of the instances of that type. A non-parametric classifier then predicts the event type of a query instance based on its distances from the prototypes (Snell et al., 2017). Hence, an outlier in the support set might significantly change the prototypes and flip the label of the query instance. In addition, in ED, a NULL class is introduced to represent non-eventive mentions. This type covers every domains and every surface form except the relevant event types. Thus, this unbounded class might also present a great source of outliers for the support set.
In this paper, we mitigate the effects of poor sampling and outliers by modeling cross-task relation. First, we propose to augment the support data of the current task with those from prior tasks that essentially helps increase the population of the current support set. Therefore, it can mitigate the sample bias in the support set. Second, the averaging in prototypical network allows outliers to contribute equally to the prototype representation. We propose to use soft-attention to select the most related data samples as well as reduce the contribution of the outliers to the prototype representation. Third, a FSL model that is resistant to the outliers should produce consistent predictions regardless of support data. To implement this, we produce two prototypical-based classifiers from the two support sets of the two tasks. After that, we enforce the consistency of their predictions on query instances.

Model
Preliminary: In this paper, the event detection problem is formulated as a N + 1-way K-shot episodic few-shot learning problem (Vinyals et al., 2016;Lai et al., 2020b). The model is given two sets of data: a support set S of labeled data, and a query set Q of unlabeled data. S consists of (N + 1) × K data points in which N is the number of positive event types and K is the number of samples per event type. The model is supposed to predict the labels of the data in the query set based on the observation of the novel event types given in the support set. Formally, a FSL task with a support set and a query set is defined as follows: where a data point (s j i , a j i , y j ) denotes a sentence s j i with trigger candidate a j i and event type y j . Similar to prior studies in event detection, we add y 0 = N U LL to represent non-eventive type.
During training, development, and testing, the task T is sampled from three sets of data D train , D dev , and D test whose sets of classes are Y train , Y dev , and Y test , respectively. These sets of classes are mutually disjoint to ensure that the model observes no more than K examples from a novel class.
A typical FSL model has two main modules: an encoder and a few-shot classifier. An encoder, denoted as φ, encodes an instance into a fixeddimension vector v j i = φ(s j i , a j i ) ∈ R u where u is the dimension of the representation vector. A fewshot classifier classifies a query instance among classes appearing in the support set. For instance, in a prototypical network, a prototype v j is a classrepresentative instance that is an average of all vectors of the j-th class v j = 1 Then the distance distribution of the query instance q = {s q , a q , y q } (Snell et al., 2017) is: The training minimizes the cross-entropy loss, de-noted by L ce , over all query instances: Cross-task data augmentation: In conventional episode training, two consecutive training tasks T 1 and T 2 are not likely to share an identical event type sets, Y 1 = Y 2 . We assume that our training process has a memory to save the latest samples of every event type used in prior tasks. Using this memory, after a certain number of training iterations, for a new task T 1 , a second sample T 2 can always be sampled from the memory such that Y 2 = Y 1 . The expected value of delaying iterations for 5-way on ACE dataset is 13 iterations (stdev = 4) and RAM dataset is 98 iterations (stdev = 24) based on 1M simulations.
Prototype Across Task We are given two tasks T 1 = (S 1 , Q 1 ) and T 2 = (S 2 , Q 2 ) sampled with the same set of event type Y. The prototypes are induced from both tasks as follow: Then, an attention module, denoted by att, induces intermediate representations for the support and query instances of T 1 via weighted sums of the support vectors of the T 2 , and vice versa: The final representations for both tasks are then the sum of their original representations and the cross-task representations: H (·) = E (·) +Ĥ (·) . Then, the prototypes for tasks T 1 and T 2 are computed by averaging vectors of the same class from H S 1 and H S 2 , respectively (Snell et al., 2017). Cross Task Consistency The Cross Task Consistency (CTC) further reduces the sample bias by introducing prediction consistency between classifiers generated from two tasks. Without loss of generation, we assume that one of the classifiers is impaired by the poor sampling. We employ the knowledge distillation technique (Hinton et al., 2015) that helps transfer knowledge from the stronger classifier to the weaker one. This thus makes the model more robust to the sample bias. We enforce the cross task consistency by minimizing the differences between predicted label distributions from the classifiers of two tasks as follow: ( 3) where f S is a prototypical classifier trained from a support set S and KL denotes the Kullback-Leibler divergence. Finally, to train the model, we minimize the total loss (α is a hyper-parameter): Testing: As the model does not have access to the prior task of the novel class, the prototypes are computed based on the vectors of the current task only. Hence, the model turns into the original Prototypical Network (Snell et al., 2017). Our proposed methods only appy to the training process, hence, it provides a fair performance compared with prior FSL ED models.

Experiment
Dataset: We evaluate the proposed model on three event detection datasets. RAMS is a recently released large scale dataset; it provides 9124 humanannotated event triggers for 139 event subtypes (Ebner et al., 2020). ACE is a benchmark dataset in event extraction with 33 event subtypes (Walker et al., 2006). LR-KBP is a large scale event detection dataset for FSL. It merges ACE-2005 and TAC-KBP datasets and extends some event types by automatically collecting data from Freebase and Wikipedia (Deng et al., 2020 FSL setting: We evaluate the model on 5+1way 5-shot and 10+1-way 10-shot FSL settings. As it has been observed that training with more classes helps improve the model performance, we use 18+1 classes during training, while keeping 5+1 and 10+1 novel classes during testing. Baseline: We consider three strong baselines for FSL ED. Proto features a prototype for each novel class and Euclidean distance function, presented in equation 1 (Snell et al., 2017). InterIntra is an extension of the prototypical network with two auxiliary training signals. It minimizes the distances among data points of the same class and maximizes the distances among prototypes (Lai et al., 2020b). DMB-Proto extends the prototypical network in a way that the representation vector for each data point is induced by a dynamic memory network running on the data of the same class (Deng et al., 2020). Since the source code of DMB-Proto is not published, we reimplement the few-shot classifier with a dynamic memory module (Xiong et al., 2016). We examine two state-of-theart BERT-based sentence encoders φ for ED, i.e. BERTMLP (Yang et al., 2019) and BERTGCN (Lai et al., 2020c).
Hyperparameters: In this paper, stochastic gradient decent optimizer is used with learning rate 1e −4 . The training/evaluation are set to 6,000 and 500 iterations respectively; the evaluation is done after every 500 training iterations. The dimension of the final representation is set to 512. We use dropout rate of 0.5 to prevent overfitting. The coefficient of the cross-task consistency loss is set to α = 10 based on the best development performance (α ∈ {1, 10, 100, 1000}.
We evaluate our ED model using the micro F1score. The training and evaluation are done on a single Nvidia GTX 2080Ti with 11GB of GPU RAM. The training and evaluation take approximately 4 hours. We implement the model using Pytorch version 1.6.0.
Result: Table 2 reports the F-scores on the de-  .1%] on the 10-shot setting. Second, the F-score margin between ProAcT and Proto decreases as the shot number increases. This indicates that the proposed model performs better when the number of observed samples is small. As the number of shots increases, the improvement gets saturated. This finding is parallel with the fact that sample bias is more likely when the number of shots is small. Hence, our proposed method is more suitable to event detection in few-shot learning schema, especially in the case where the number of shots is limited.
Ablation study: Our proposed model involves three factors: the cross-task data (data), the crosstask attentive prototype (attention) and the crosstask consistency (consistency). To analyze the efficiency of these modules, we incrementally eliminate these modules from the full ProAcT model and evaluate the remaining model on 5+1-way 5-shot setting. If attention and loss are removed while data remains, the model and setting become a prototypical network with 5+1-way 10-shot setting during the training. This model has the same amount of support data that our model has during the training process. Note that the testing with novel classes remains 5+1-way 5-shot setting for every model. If the cross-task data is eliminated, the attentive prototype and consistency loss are also removed and the model and setting return to a prototypical  Table 3: Ablation study of our proposed components on 5+1 ways 5-shot setting on the RAMS dataset with BERTGCN encoder. P, R, F denote precision, recall, and f-score metrics. network with 5+1-way 5-shot setting. Table 3 reports the performance on 5+1-way 5shot FSL setting on RAMS with BERTGCN encoder. As shown in the table, removing any module leads to a decrease between [0.8%-1.3%] in performance. When both attention and consistency are eliminated, the performance drops of 2.3%. A further drop of 2.4% is seen if the cross-task data is eliminated. These suggest that the improvement originates from the use of cross-task data, the attention for prototype computation and the consistency of cross-task predictions.
Analysis: To further analyze the efficiency of our proposed method, we aim to discover which classes benefit the most. To do that, we compute two confusion matrices for ProAcT and Proto models on the test set of RAMS. We fix the random seed to make sure the sampling during testing are identical between two runs, hence ensuring that the proportion of classes are identical. Figure 1 presents the difference of two confusion matrices exhibited by the proposed model ProAct and the prototypical network Proto. There are two major observations from the figure. First, overall ProAcT produces more accurate predictions than Proto, as shown on the diagonal. Second, ProAcT involves remarkably more correct predictions for negative examples than Proto. In the mean time, it generates significantly lower number of errors in both false positive and false negative related to the NULL class, i.e. Other class in Figure 1, suggesting that our proposed model effectively mitigates the effect of noise introduced by the NULL class.
FSL has been extensively studied in computer vision (Vinyals et al., 2016;Snell et al., 2017;Finn et al., 2017;Lee et al., 2019;Fei et al., 2021). Recent work has also considered FSL for tasks in natural language processing (Han et al., 2018;Bao et al., 2020). For ED, prior FSL work has mostly relied on Prototypical network (Lai et al., 2020b;Deng et al., 2020). However, these models do not explore cross-task modeling as we do.

Conclusion
In this paper, we propose to exploit the relationship between training tasks for few-shot learning event detection. We compute prototypes based on cross-task modeling and present a regularization to enforce prediction consistency of classifiers across tasks. The experiment results show that exploiting cross-task relation can alleviate the poor sampling and outliers in the support set for FSL in ED. In the future, we will extend our method to other tasks in information extraction such as named entity recognition and argument extraction. Thien Huu Nguyen and Ralph Grishman. 2015. Event detection and domain adaptation with convolutional neural networks. In Proceedings of the 53rd Annual Figure 1: The differences of confusion matrices between ProAcT and Proto models. On the main diagonal, a positive value implies that ProAcT predicts more accurate than Proto, whereas on the rest of the matrix, a negative value indicates that ProAcT creates less error than Proto. Visually, a green cell indicates that the prediction of ProAcT is more accurate than those from Proto. Red cells suggests the cases where Proto is better than ProAcT.