Unleash GPT-2 Power for Event Detection

Event Detection (ED) aims to recognize mentions of events (i.e., event triggers) and their types in text. Recently, several ED datasets in various domains have been proposed. However, the major limitation of these resources is the lack of enough training data for individual event types which hinders the efficient training of data-hungry deep learning models. To overcome this issue, we propose to exploit the powerful pre-trained language model GPT-2 to generate training samples for ED. To prevent the noises inevitable in automatically generated data from hampering training process, we propose to exploit a teacher-student architecture in which the teacher is supposed to learn anchor knowledge from the original data. The student is then trained on combination of the original and GPT-generated data while being led by the anchor knowledge from the teacher. Optimal transport is introduced to facilitate the anchor knowledge-based guidance between the two networks. We evaluate the proposed model on multiple ED benchmark datasets, gaining consistent improvement and establishing state-of-the-art results for ED.


Introduction
An important task of Information Extraction (IE) involves Event Detection (ED) whose goal is to recognize and classify words/phrases that evoke events in text (i.e., event triggers). For instance, in the sentence "The organization donated 2 million dollars to humanitarian helps.", ED systems should recognize "donated" as an event trigger of type Pay. We differentiate two subtasks in ED, i.e., Event Identification (EI): a binary classification problem to predict if a word in text is an event trigger or not, and Event Classification (EC): a multi-class classification problem to classify event triggers according to predefined event types.
Several methods have been introduced for ED, extending from feature-based models (Ahn, 2006;Liao and Grishman, 2010a; to advanced deep learning methods Nguyen et al., 2016c;Zhang et al., 2020b;Nguyen et al., 2021). Although deep learning models have achieved substantial improvement, their requirement of large training datasets together with the small sizes of existing ED datasets constitutes a major hurdle to build high-performing ED models.
Recently, there have been some efforts to enlarge training data for ED models by exploiting unsupervised  or distantly-supervised Araki and Mitamura, 2018) techniques. The common strategy in these methods is to exploit unlabeled text data that are rich in event mentions to aid the expansion of training data for ED. In this work, we explore a novel approach for training data expansion in ED by leveraging the existing pre-trained language model GPT-2 (Radford et al., 2019) to automatically generate training data for models. Motivated by the promising performance of GPT models for text generation, we expect our approach to produce effective data for ED in different domains. Specifically, we aim to fine-tune GPT-2 on existing training datasets so it can generate new sentences annotated with event triggers and/or event types, serving as additional training data for ED models. One direction to achieve this idea is to explicitly mark event triggers along with their event types in sentences of an existing ED dataset that can be used to fine-tune the GPT model for new data generation. However, one issue with this direction is that in existing ED datasets, numbers of examples for some rare event types might be small, potentially leading to the poor tuning performance of GPT and impairing the quality of generated examples for such rare events. In addition, large num-bers of event types in some ED datasets might make it more challenging for the fine-tuning of GPT to differentiate event types and produce high-quality data. To this end, instead of directly generating data for ED, we propose to use GPT-2 to only generate samples for the event identification task to simplify the generation and achieve data with better annotated labels (i.e., output sentences only are only marked with positions of event triggers). As such, to effectively leverage the generated EI data to improve ED performance, we propose a multitask learning framework to train the ED models on the combination of the generated EI data and the original ED data. In particular, for every event trigger candidate in a sentence, our framework seeks to perform two tasks, i.e., EI to predict a binary label for being an event trigger or not, and ED to predict the event type (if any) evoked by the word via a multi-class classification problem. An input encoder is shared for both tasks that allow training signals from both generated EI data and original ED data to contribute to the representation learning in the encoder (i.e., transferring knowledge in generated EI data to ED models).
Despite the simplification to EI for better annotated labels of data, the generated sentences might still involve noises due to the inherent nature of the language generation, e.g., grammatically wrong sentences, inconsistent information, or incorrect event trigger annotations. As such, it is crucial to introduce mechanisms to filter the noises in generated data to enable effective transfer learning from generated EI data. To this end, prior works for GPTbased data generation for other tasks has attempted to directly remove noisy generated examples before actual usage for model training via some heuristic rules (Anaby-Tavor et al., 2020;. However, heuristic rules are brittle and restricted in their coverage so they might overly filter the generated data or incorrectly retain some noisy generated samples. To address this issue, we propose to preserve all generated data for training and devise methods to explicitly limit impacts of noisy generated sentences in the models. In particular, we expect the inclusion of generated EI data into the training process for ED models might help to shift the representations of the models to better regions for ED. As such, we argue that this representation transition should only occur at a reasonable rate as drastic divergence of representations due to the generated data might be associated with noises in the data. Motivated by this intuition, we propose a novel teacher-student framework for our multi-task learning problem where the teacher is trained on the original clean ED datasets to induce anchor representation knowledge for data. The student, on the other hand, will be trained on both generated EI data and original ED data to accomplish transfer learning. Here, the anchor knowledge from the teacher will be leveraged to guide the student to prevent drastic divergence of representation vectors for noisy information penalization. Consequently, we propose a novel anchor information to implement this idea, seeking to maintain the same level of differences between the generated and original data (in terms of representation vectors) for both the teacher and the student (i.e., generated-vsoriginal data difference as the anchor). At the core of this techniques involves the computation of distance/difference between samples in generated and original data. In this work, we envision two types of information that models should consider when computing such distances for our problem: (1) representation vectors of the models for the examples, and (2) event trigger likelihood scores of examples based on the models (i.e., two examples in the generated and original data are more similar if they both correspond to event triggers). As such, we propose to cast this distance computation problem of generated and original data into an Optimal Transport (OT) problem. OT is an established method to compute the optimal transportation between two data distributions based on the probability masses of data points and their pair-wise distances, thus facilitating the integration of the two criteria of event trigger likelihoods and representation vectors into the distance computation between data point sets.
Extensive experiments and analysis reveal the effectiveness of the proposed approach for ED in different domains, establishing new state-of-theart performance on the ACE 2005, CySecED and RAMS datasets.

Model
We formulate the task of Event Detection as a word-level classification problem as in prior work Ngo et al., 2020). Formally, given the sentence S = [w 1 , w 2 , . . . , w n ] and the candidate trigger word w t , the goal is to predict the event type l from a pre-defined set of event types L. Note that if the word w t is not a trigger word, the gold event type is N one. Our proposed approach for this task consist of two stages: (1) Data Augmentation: to employ natural language generation to augment existing training datasets for ED, (2) Task Modeling: to propose a deep learning model for ED, exploiting available training data.

Data Augmentation
As presented in the introduction, our motivation in this work is to explore a novel approach for training data augmentation for ED based on the powerful pre-trained language model for text generation GPT2. Our overall strategy involves using some existing training dataset O for ED (i.e., original data) to fine-tune GPT-2. The fine-tuned model is then employed to generate a new labeled training set G (i.e., synthetic data) that will be combined with the original data O to train models for ED.
To simplify the training data generation task and enhance the quality of the synthetic data, we seek to generate data only for the subtask EI of ED where synthesized sentences are annotated with positions of their event triggers (i.e., event types for triggers are not required for the generation to avoid the complication with rare event types for fine-tuning). To this end, we first enrich each sentence S ∈ O with positions of event triggers that it contains to facilitate the GPT fine-tuning process. Formally, assume that S = w 1 , w 2 , . . . , w n is a sentence of n words with only one event trigger word located at w t , the enriched sentence S for S would have the form: S = [BOS, w 1 , . . . , T RG s , w t , T RG e , . . . , w n , EOS] where T RG s and T RG e are special tokens to mark the position of the event trigger, and BOS and EOS are special tokens to identify the beginning and the end of the sentence. Next, the GPT-2 model will be fine-tuned on the enriched sentences S of O in an auto-regressive fashion (i.e., predicting the next token in S given prior ones). Finally, using the fine-tuned GPT-2, we generate a new dataset G of |O| sentences (|G| = |O|) to achieve a balanced size. Here, we ensure that only generated sentences that contain the special tokens T RG s and T RG e (i.e., involving event trigger words) are added into G, allowing us to identify the candidate trigger word in our word-level classification formulation for ED. As such, the combination A of the synthetic data G and the original data O (A = O ∪ G) will be leveraged to train our ED model in the next step.
To assess the quality of the synthetic data, we randomly select 200 sentences from G (generated by the fine-tuned GPT-2 model over the popular ACE 2005 training set for ED) and evaluate them regarding grammatical soundness, meaningfulness, and inclusion and correctness of annotated event triggers (i.e., whether the words between the tokens T RG s and T RG e evoke events or not). Among the sampled set, we find that 17% of the sentences contains at least one type of such errors.

Task Modeling
This section describes our model for ED to overcome the noises in the generated data G for model training. As discussed in the introduction, we employ the Teacher-Student framework with multitask learning to achieve this goal. In the proposed framework, the teacher and student employs a base deep learning model with the same architecture and different parameters. Base Model: Following the prior work (Wang et al., 2019), our base model consists of the BERT base model to represent each word w i in the input sentence S with a vector e i . Formally, the input sentence [[CLS], w 1 , w 2 , . . . , w n , [SEP ]] is fed into the BERT base model and the hidden states of the last layer of BERT are taken as the contextualized embeddings of the input words, i.e., E = [e 1 , e 2 , . . . , e n ]. Note that if w i contains more than one word-piece, the average of its word-piece embeddings is used for e i . In our experiments, we find that fixing the BERT base parameters achieve higher performance. As such, to fine-tune the contextualized embeddings E for ED, we employ a Bi-directional Long Short-Term Memory (BiL-STM) network to consumes E; its hidden states, i.e., H = [h 1 , h 2 , . . . , h n ], are then employed as the final representations for the words in S. Finally, to create the final vector V for ED prediction, the max-pooled representation of the sentence, i.e., h = M AX P OOL(h 1 , h 2 , . . . , h n ), is concatenated with the representation of the trigger candidate, i.e., h t . V is consumed by a feed-forward network, whose last layer has |L| neurons, followed by a softmax layer to predict the distribution P (·|S, t) over possible event types in L. To train the model, we use negative log-likelihood as the loss function: L pred = − log P (l|S, t) where l is the gold label.
As the synthetic sentences in G only involve information about positions of event triggers (i.e., no event types included), we cannot directly combine G with O to train ED models with the loss L pred . To facilitate the integration of G into the training process, we introduce an auxiliary task of EI for the multi-task learning in the training process, seeking to predict the binary label l aux for the trigger candidate w t in S, i.e., l aux = 1 if w t is an event trigger. To perform this auxiliary task, we employ another feed-forward network, i.e., FF aux , which also consumes the overall vector V as input. This feedforward network has one neuron with the sigmoid activation function in the last layer to estimate the event trigger likelihood score: P (l aux = 1|S, t) = FF aux (V ). Finally, to train the base model with the auxiliary task, we exploit the binary crossentropy loss: ). Note that the main ED task and the auxiliary EI task are done jointly in a single training process where the loss L pred for ED is computed only for the original data O. The loss L aux , in contrast, will be obtained for both original and synthetic data in A.
Knowledge Consistency: The generated data G is not noise-free. As such, training the ED model on A could lead to inferior performance. To address this issue, as discussed in the introduction, we propose to first learn the anchor knowledge from the original data O, then use that to lead the model training on A to prevent drastic divergence from the anchor knowledge (i.e., knowledge consistency promotion), thus constraining the noises. Hence, we propose a teacher-student network, in which the teacher is first trained on O to learn the anchor knowledge. The student network will be trained on A afterward leveraging the consistency guidance with the induced anchor knowledge from the teacher. We will also use the student network as the final model for our ED problem in this work.
In our framework, both teacher and student networks will be trained in the multi-task setting with ED and EI tasks. In particular, the training losses for both ED and EI will be computed based on O for the teacher (the loss to train the teacher is: In contrast, the combined data A will be used to compute the EI loss for the student while the ED loss for the student can only be computed on the original data O. As such, we propose to enforce the knowledge consistency between the two networks for both the main task ED and the auxiliary task EI during the training of the student model. First, to achieve the knowledge consistency for ED, we seek to minimize the KL divergence between the teacher-predicted label-probability distri-bution and the student-predicted label-probability distributions. Formally, for a sentence S ∈ O, the label-probability distributions of the teacher and the student, i.e., P t (·|S, t) and P s (·|S, t) respectively, are employed to compute the KL-divergence loss L KL = −Σ l∈L P t (l|S, t) log( Pt(l|S,t) Ps(l|S,t) ). By decreasing the KL-divergence during the student's training, the model is encouraged to make similar predictions as the teacher for the same original sentence, thereby preventing noises to mislead the student. Note that different from traditional teacherstudent networks that employ KL to achieve knowledge distillation on unlabelled data (Hinton et al., 2015), the KL divergence in our model is leveraged to enforce knowledge consistency to prevent noises in labeled data automatically generated by GPT-2.
Second, for the auxiliary task EI, instead of enforcing the student-teacher knowledge consistency via similarity predictions, we argue that it will be more beneficial to leverage the difference between the original data O and the generated data G as an anchor knowledge to promote consistency. In particular, we expect that the student which is trained on A, should discern the same difference between G and O as the teacher which is trained only on the original data O. Formally, during student training, for each mini-batch, the distances between the original data and the generated data detected by the teacher and the student are denoted by d T O,G and d S O,G , respectively. To enforce the O-G distance consistency between the two networks, the following loss is added into the overall loss function: , where |B| is the mini-batch size. The advantage of this novel knowledge consistency enforcement compared to the KL-divergence is that it explicitly exploits the different nature of the original and generated data to facilitate the mitigation of noises in the generated data.
A remaining question for our proposed knowledge consistency concerns how to assess the difference between the original and the generated data from the perspective of the teacher, i.e., d T O,G , and the student networks, i.e., d S O,G . In this section, we will describe our method from the perspective of the student (the same method is employed for the teacher network). In particular, we define the difference between the original and the generated data as the cost of transforming O to G such that for the transformed data the model will make the same predictions as G. How can we compute the cost of such transformation? To answer this ques-tion, we propose to employ Optimal Transport (OT) which is an established method to find the efficient transportation (i.e., transformation with the lowest cost) of one probability distribution to another one. Formally, given the probability distributions p(x) and q(y) over the domains X and Y, and the cost function C(x, y) : X × Y → R + for mapping X to Y, OT finds the optimal joint distribution π * (x, y) (over X × Y) with marginals p(x) and q(y), i.e., the cheapest transportation from p(x) to q(y), by solving the following problem: where Π(x, y) is the set of all joint distributions with marginals p(x) and q(y). Note that if the distributions p(x) and q(y) are discrete, the integrals in Equation 1 are replaced with a sum and the joint distribution π * (x, y) is represented by a matrix whose entry (x, y) represents the probability of transforming the data point x ∈ X to y ∈ Y to convert the distribution p(x) to q(y). By solving the problem in Equation 1 1 , the cost of transforming the discrete distribution p(x) to q(y) (i.e., Wasserstein distance Dist W ) is defined as: Dist W = Σ x∈X Σ y∈Y π * (x, y)C(x, y).
In order to utilize OT to compute the transformation cost between O and G, i.e., d S O,G , we propose to define the domain X and Y as the representation spaces of the sentences in O and G, respectively, obtained from the student network. In particular, a data point x ∈ X represents a sentence X o ∈ O. Similarly, a data point y ∈ Y stands for a sentence Y g ∈ G. To define the cost function C(x, y) for OT, we compute the Euclidean distance between the representation vectors of the sentences X o and Y g (obtained by max-pooling over representations of their words): and h Y g,i are the representation vectors of the ith words of X o and Y g , respectively, obtained from the student's BiLSTM. Also, to define the discrete distribution p(x) for OT over X , we employ the event trigger likelihood Score X o for the trigger candidate of each sentence X o in X that is returned by the feed-forward network FF S aux 1 It is worth mentioning that this problem is intractable so we solve its entropy-based approximation using the Sinkhorn algorithm (Peyre and Cuturi, 2019). for the auxiliary task EI in the student model, i.e, Score X o = FF S aux (X o ). Afterward, we apply the softmax function over the scores of the original sentences in the current mini-batch to obtain p(x), i.e., p(x) = Sof tmax(Score X o ). Similarly, the discrete distribution q(y) is defined as q(y) = Sof tmax(Score Y g ). To this end, by solving the OT problem in Equation 1 and obtaining the efficient transport plan π * (x, y) using this setup, we can obtain the distance d S O,G . In the same way, the distance d T O,G can be computed using the representations and event trigger likelihoods from the teacher network. Note that in this way, we can integrate both representation vectors of sentences and event trigger likelihoods into the distance computation between data as motivated in the introduction.
Finally, to train the student model, the following combined loss function is used in our framework: L = L pred + αL aux + βL KL + γL dist , where α, β, and γ are the trade-off parameters.

Datasets, Baselines & Hyper-Parameters
To evaluate the effectiveness of the proposed model, called the GPT-based data augmentation model for ED with OT (GPTEDOT), we conduct experiments on the following ED datasets: ACE 2005 (Walker et al., 2006): This dataset annotates 599 documents for 33 event types that cover different text domains(e.g., news, weblog or conversation documents). We use the same preprocessing script and data split as prior works (Lai et al., 2020c;Tong et al., 2020b) to achieve fair comparisons. In particular, the data split involves 529/30/40 articles for train/dev/test sets respectively. For this dataset, we compare our model with prior state-of-the-art models reported in the recent works (Lai et al., 2020c;Tong et al., 2020b), including BERT-based models such as DMBERT, AD-DMBERT (Wang et al., 2019), DRMM, EKD (Tong et al., 2020b), and GatedGCN (Lai et al., 2020c).
CySecED (Man Duc Trong et al., 2020): This dataset provides 8,014 event triggers for 30 event types from 300 articles of the cybersecurity domain (i.e., cybersecurity events). We follow the the same pre-processing and data split as the original work (Man Duc Trong et al., 2020) with 240/30/30 documents for the train/dev/test sets. To be consistent with other experiments and facilitate the data generation based on GPT-2, the experiments on Cy-SecED are conducted at the sentence level where inputs for models involve sentences. As such, we employ the state-of-the-art sentence-level models reported in (Man Duc Trong et al., 2020), i.e., DM-BERT (Wang et al., 2019), BERT-ED , as the baselines for CySecED.
RAMS (Ebner et al., 2020): This dataset annotates 9,124 event triggers for 38 event types. We use the official data split with 3,194, 399, and 400 documents for training, development, and testing respectively for RAMS. We also perform ED at the sentence level in this dataset. For the baselines, we utilize recent state-of-the-art BERT-based models for ED, i.e., DMBERT (Wang et al., 2019) and GatedGCN (Lai et al., 2020c). For a fair comparison, the performance of such baseline models is obtained via their official implementations from the original papers that are fine-tuned for RAMS.
For each dataset, we use its training and development data to fine-tune the GPT-2 model. We tune the hyperparameters for the proposed teacherstudent architecture using a random search. All the hyperparameters are selected based on the F1 scores on the development set of the ACE 2005 dataset. The same hyper-parameters from this finetuning are then applied for other datasets for consistency. In our model we use the small version of GPT-2 to generate data. In the base model, we use BERT base , 300 dimensions in the hidden states of BiLSTM and 2 layers of feed-forward neural networks with 200 hidden dimensions to predict events. The trade-off parameters τ , α, β and γ are set to 0.1, 0.1, 0.05, and 0.08, respectively. The learning rate is set to 0.3 for the Adam optimizer and the batch size of 50 are employed during training. Finally, note that we do not update the BERT model for word embeddings in this work due to its better performance on the development data of ACE 2005.
Results of experiments on the CySecED test set are presented in Table 2. This table reveals that the teacher-student architecture GPTEDOT significantly improves the performance over previous state-of-the-art models for ED in cybersecurity domain. This is important as it shows that the proposed model is effective in different domains. In addition, our results also suggest that GPT-2 can be employed to generate effective data for ED in domains where data annotation for ED requires extensive domain expertise and expensive cost to obtain such as the cybersecurity events. Moreover, the higher margin of improvement for GPTEDOT on CySecED compared to the those on the ACE 2005 dataset suggests the necessity of using more training data for ED in technical domains.
Finally, results of experiments on the RAMS test set are reported in Table 3. Consistent with our experiments on ACE 2005 and CySecED, our proposed model achieve significantly higher performance than existing state-of-the-art models (p < 0.01), thus further confirming the advantages of GPTEDOT for ED.

Ablation Study
This ablation study evaluates the effectiveness of different components in GPTEDOT for ED. First, for the importance of the generated data G from GPT-2 and the teacher-student architecture to mitigate noises, we examine the following baselines: (1) Base O : The baseline is the base model trained only on the original data O, thus being equivalent to the teacher model and not using the student model; and (2) Base A : This baseline trains the base model on the combination of the original and generated data, i.e., A, using the multi-learning setting (i.e., the teacher model is excluded).
Second, for the multi-task learning design in the teacher network, we explore the following ablated models: (3) Teacher −A : This baseline removes the auxiliary task EI in the teacher from GPTEDOT. As such, the OT-based knowledge consistency for EI is also eliminated; (4) Teacher −M : In this model, the main task ED is utilize to train the teacher, so the corresponding KL-based knowledge consistency for ED is also removed.
Third, for the design of the knowledge consistency losses in the student network, we evaluate the following baselines: (5) Student −OT : This ablated model eliminates the OT-based knowledge consistency loss for the auxiliary task EI in the student's training of GPTEDOT (the auxiliary task is still employed for the teacher and the student); (6) Student −KL : For this model, the KL-based knowledge consistency for the main task ED is ignored in the student's training; (7) Student +OT : In this baseline, we use OT for the knowledge consistency on both the main and the auxiliary tasks. Here, for the main task ED, the cost function C(x, y) for OT is still obtained via the Euclidean distances between representation vectors while the distributions p(x) and p(y) are based on the maximum probabilities of the label-probability distributions P s (.|X o , t o ) and P s (Y g , t g ) for the ED task; and (8) Student +KL : This baseline employs the KL di-  vergence between models' predicted distributions to enforce the teacher-student consistency for both the main task and the auxiliary task. To this end, for the auxiliary task EI, we convert the final activation of FF aux into a distribution with two data points (i.e., [FF aux (X), 1 − FF aux (X)]) to compute the KL divergence between the teacher and the student. Finally, for the importance of Euclidean distances and event trigger likelihoods in the OTbased distance between O and G for knowledge consistency in EI, we investigate two baselines: (9) OT −Rep : Here, to compute OT, we use constant cost between every pair of sentences, i.e., C(x, y) = 1 (i.e., ignoring representation-based distances); and (10) OT −Score : This model uses uniform distributions for p(x) and q(y) to compute the OT (i.e., ignoring event trigger likelihoods).
We report the performance of the models (on the ACE 2005 development set) for the ablation study in Table 4. There are several observations from this table. First, the generated data G and the teacher-student architecture are necessary for GPTEDOT to achieve the highest performance. In particular, comparing with Base O , the better performance of GPTEDOT indicates the benefits of the GPT-generated data. Moreover, the better performance of Base O over Base A reveals that the simple combination of the synthetic and original data without any effective method to mitigate noises might be harmful. Second, the lower performance of Teacher −A and Teacher −M shows that both the auxiliary and the main task (i.e., multi-task learning) in the teacher are integral to produce the best performance. Third, the choice of methods to promote knowledge consistency is important and the proposed combination of KL and OT for the ED and EI tasks (respectively) are necessary. In particular, removing or replacing each of them with the other one (i.e., Student +OT and Student +KL ) would de-Dataset Sentence ACE 2005 I was totally shocked by the court's decision to agree with Sam Sloan after he TRG s sued TRG e his children. CySecED According to the last update by the company, the following techniques are used to protect against such TRG s malware TRG e .

RAMS
The Russian officials TRG s vowed TRG e to bomb the ISIS bases after the last week's TRG s attack TRG e . Do you think the TRGs attack TRGe will happen to you or do you think the TRGs attack TRGe will happen to you? 15% Inconsistency this morning we were watching the news and heard the news about the tragic TRGs death TRGe of a young boy and her mother in Iraq. 12% Missing Labels Aaron Tramailer's story is the story of a woman who was forced into suicide. 29% Incorrect Labels The SEC is a very good place to TRGs hide TRGe money. 26%  crease the performance significantly. Finally, in the proposed consistency method based on OT for EI, it is beneficial to employ both representation-level distances (i.e., OT −Rep ) and models' predictions for event trigger likelihoods (i.e., OT −Score ) as removing any of them hurts the performance.

Analysis
To provide more insights into the quality of the synthetic data G, we provide samples of sentences that are generated by the fine-tuned GPT-2 model on each dataset in Table 5. This table illustrates that the generated sentences also belong to the domains of the original data (i.e., the cybersecurity domain). As such, combining synthetic data with original data is promising for improving ED performance as demonstrated in our experiments.
As discussed earlier, the generated data G is not free of noise. In order to better understand the types of errors existing in generated sentences, we manually assess 200 sentences randomly selected from the set G generated by the fine-tuned GPT-2 model on the ACE 2005 dataset. We categorize the errors into five types and provide their proportions along with example for each error type in Table  6. This table shows that the majority of errors are due to missing labels (i.e., no special tokens TRG s and TRG e are generated) or incorrect labels (i.e., marked words are not event triggers of interested types) generated by the language model. Finally, to study the importance of the size of the generated data to augment training set for ED, we conduct an experiment in which different numbers of generated samples in G (for the ACE 2005 dataset) are combined with the original data O. The results are shown in Table 7. According to this table, the highest performance of the proposed model is achieved when the numbers of the generated and original data are equal. More specifically, decreasing the number of generated samples potentially limits the benefits of data augmentation. On the other hand, increasing the size of generated data might introduces extensive noises and become harmful to the ED models.

Related Work
ing. In this work, we propose a novel method to augment training data for ED by exploiting the powerful language model GPT-2 to automatically generate new samples.
Leveraging GPT-2 for augmenting training data has also been studied for other NLP tasks recently (e.g., relation extraction, commonsense reasoning) (Papanikolaou and Pierleoni, 2020;Zhang et al., 2020a;Madaan et al., 2020;Bosselut et al., 2019;Kumar et al., 2020;Anaby-Tavor et al., 2020;Peng et al., 2020). However, none of those works has explored GPT-2 for ED. In addition, existing methods only resort to heuristics to filter out noisy samples generated by GPT-2. In contrast, we propose a novel differentiable method capable of preventing noises from diverging representation vectors of the models for ED.

Conclusion
We propose a novel method for augmenting training data for ED using the samples generated by the language model GPT-2. To avoid noises in the generated data, we propose a novel teacher-student architecture in a multi-task learning framework. We introduce a mechanism for knowledge consistency enforcement to mitigate noises from generated data based on optimal transport. Experiments on various ED benchmark datasets demonstrate the effectiveness of the proposed method.