Continual Event Extraction with Semantic Confusion Rectification

We study continual event extraction, which aims to extract incessantly emerging event information while avoiding forgetting. We observe that the semantic confusion on event types stems from the annotations of the same text being updated over time. The imbalance between event types even aggravates this issue. This paper proposes a novel continual event extraction model with semantic confusion rectification. We mark pseudo labels for each sentence to alleviate semantic confusion. We transfer pivotal knowledge between current and previous models to enhance the understanding of event types. Moreover, we encourage the model to focus on the semantics of long-tailed event types by leveraging other associated types. Experimental results show that our model outperforms state-of-the-art baselines and is proficient in imbalanced datasets.


Introduction
Event extraction (Grishman, 1997;Ahn, 2006) aims to detect event types and identify their event arguments and roles from natural language text.Given a sentence "The Oklahoma City bombing conspirator is already serving a life term in federal prison", an event extraction model is expected to identify "bombing" and "serving", which are the event triggers of the "Attack" and "Sentence" types, respectively.Also, the model should identify arguments and roles of corresponding event types such as "conspirator" and "Oklahoma City" are two arguments involved in "Attack" and as argument roles of "Attacker" and "Place", respectively.
Conventional studies (McClosky et al., 2011;Nguyen et al., 2016;Du and Cardie, 2020;Lin et al., 2020;Nguyen et al., 2021;Wang et al., 2022) model event extraction as a task to extract from the pre-defined event types and argument roles.In practice, new event types and argument roles emerge   continually.We define a new problem called continual event extraction for this scenario.Compared to conventional studies, continual event extraction expects the model not only to detect new types and identify corresponding event arguments and roles but also to remember the learned types and roles.This scenario belongs to continual learning (Ring, 1994), which learns from data streams with new emerging data.
To alleviate the so-called catastrophic forgetting problem (Thrun and Mitchell, 1995;French, 1999), existing works focus on event detection and apply knowledge transfer or prompt engineering (Cao et al., 2020;Yu et al., 2021;Liu et al., 2022).On one hand, they do not consider the task of argument extraction, making them incomplete in event extraction.On the other hand, they ignore that the semantic understanding by the model deviates from correct semantics when new types emerge, which we call semantic confusion.
First, semantic confusion is caused by the annotations of previous types and new types do not generate at the same time.As shown in Figure 1(a), a sentence may have multi-type annotations.However, current training data only contains new annotations, and the model misunderstands the text "died" with the previous annotation "Die" as a negative label "NA".Similarly, the model that is only trained on the previous data would identify the new types as negative labels.Existing works (Cao et al., 2020;Yu et al., 2021;Liu et al., 2022) simply transfer all learned knowledge to the current model, which would disturb new learning.The second problem is the imbalanced distribution of event types in natural language text.Figure 1(b) shows the number distribution of event types in three widely used event extraction datasets.The model is confused with the semantics of the longtailed event types in two aspects.On one hand, it suffers from the lack of training on long-tailed event types due to their few instances.On the other hand, the semantics of long-tailed types would be disturbed by popular types during training.This paper proposes a novel continual event extraction method to rectify semantic confusion and address the imbalance issue.Specifically, we propose a data augmentation strategy that marks pseudo labels of each sentence to avoid the disturbance of semantic confusion.We apply a pivotal knowledge distillation to further encourage the model to focus on vital knowledge during training at the feature and prediction levels.Moreover, we propose prototype knowledge transfer, which leverages the semantics of other associated types to enrich the semantics of long-tailed types.
Our main contributions are outlined as follows: 2 Related Work

Event Extraction
Conventional event extraction models (McClosky et al., 2011;Li et al., 2013;Nguyen et al., 2016;Lin et al., 2020;Wang et al., 2021) regard the event extraction as a multi-class classification task.In recent years, several new paradigms have been proposed to model event extraction.The works (Du and Cardie, 2020;Liu et al., 2020;Li et al., 2020;Lyu et al., 2021) treat event extraction as a question-answering task.They take advantage of the pre-defined question templates and have specific knowledge transfer abilities on event types.The work (Wang et al., 2022) refines event extraction as a query-and-extract process by leveraging rich semantics of event types and argument roles.These models cannot apply to continual event extraction as they learn all event types at once.

Continual Learning
Mainstream continual learning methods can be distinguished into three families: regularization-based methods (Li and Hoiem, 2017;Kirkpatrick et al., 2017), dynamic architecture methods (Aljundi et al., 2017;Rosenfeld and Tsotsos, 2018;Qin et al., 2021), and memory-based methods (Lopez-Paz and Ranzato, 2017;Rebuffi et al., 2017;Castro et al., 2018;Wu et al., 2019).For many NLP tasks, the memory-based methods (Wang et al., 2019;de Masson d'Autume et al., 2019;Cao et al., 2020) show superior performance than other methods.Existing works (Cao et al., 2020;Yu et al., 2021;Liu et al., 2022) make use of knowledge transfer to alleviate catastrophic forgetting in event detection.KCN (Cao et al., 2020) employs memory reply and hierarchical distillation to preserve old knowledge.KT (Yu et al., 2021) transfers knowledge between related types to enhance the learning of old and new event types.EMP (Liu et al., 2022) leverages soft prompts to preserve the knowledge learned from each task and transfer it to new tasks.All the above models are unable to identify event arguments and roles, so they are incomplete in continual event extraction.Furthermore, they ignore semantic confusion on event types while training.We address these problems and propose a new model.

Overall Framework
Our framework for continual event extraction consists of two models: event detection model F i and argument extraction model G i .When a new task T i comes, we detect the candidate event types for each sentence by F i .The framework of our proposed model F i is shown in Figure 2. We first augment current training data with pseudo labels.Then, we train the current model on augmented data and memory data with pivotal knowledge distillation.For long-tailed event types, we enhance their semantics with event prototypes.At last, we pick and store a few instances for new types and augment them with pseudo labels for the next task T i+1 .The parameters are updated during training.After predicting each candidate event type, we train G i to obtain corresponding event arguments and roles.Similar to event detection, we also pick and store a few instances.The accuracy of argument extraction highly depends on correct event types.

Base Model and Experience Replay
Base model.Following (Cao et al., 2020;Yu et al., 2021;Liu et al., 2022;Du and Cardie, 2020), we use the pre-trained language model BERT (Devlin et al., 2019) as the encoder to extract the hidden representation of text.
Given a sentence w, we first use the BERT encoder to get the hidden representation h x j for each token x j in w.Then, we obtain the feature representation f x j of x j by where W ∈ R h×d and b ∈ R h are trainable parameters.h, d are the dimensions of feature repre-sentations and hidden layers in BERT, respectively.LayerNorm(•) is the normalization operation.
We use a linear softmax classifier to get x j 's output probability distribution on the basis of f x j .The cross-entropy classification loss of the current dataset is defined as follows: where N is the token set from the current training data.E is the seen event types set Ẽi and "NA".y x,e indicates whether the reference type of x is e.P (e | x; F i ) is the probability of x classified as e by the event detection model F i .
Experience replay.Inspired by the previous works on continual learning (Wang et al., 2019;de Masson d'Autume et al., 2019;Yu et al., 2021;Liu et al., 2022), we pick and store a small number m of instances for each event type.At the i-th stage, the memory space to store the current training data is denoted by δ i , so the accumulated memory space is δi = i t=1 δ t .Note that we do not store negative instances in δi , owing to the fact that negative instances are prevalent at each stage.At the i-th stage, we train the model with current training data D train i and memory space δi−1 .We leverage the k-means algorithm to cluster the feature representations of each event type's instances, where the number of clusters equals the memory size m.We select the instances closest to the centroid of each cluster and store them.

Data Augmentation
In the event detection task, a sentence may have several annotations of event types.For example, in the sentence "Melony Marshall was married before she left for Iraq", "married" and "left" indicate the event types "Marry" and "Transport", respectively.Let us assume that "Marry" is the previously seen type and "Transport" is the newly emerging type.If the current memory space does not include this sentence, the annotation corresponding to "married" would be "NA".Thus, the model would treat the text "married" as a negative label "NA" at the current stage.However, "married" has been considered as the event trigger of "Marry" at the previous stage.It is noticeable that the semantics of "married" are confused at these two different stages.Previous works (Cao et al., 2020;Yu et al., 2021;Liu et al., 2022) simply ignore this problem and suffer from semantic confusion.
To address this issue, we propose a data augmentation strategy with pseudo labels to excavate potential semantics and rectify semantic confusion.
Before training the current model, we first augment the training data with the previous model.For each instance in the training set D train i , it is just annotated by the new event types E i .We regard this sentence as a test instance and leverage the previous model to predict the event types for each token.Once the prediction confidence exceeds a threshold τ , we mark this token as the predicted type, serving as a pseudo label.Then, the augmented data can be used to train the current model.After training, we also leverage the trained model to obtain pseudo labels for the memory data.Note that we just use augmented task data and memory data for training, rather than for prototype generation in prototype knowledge transfer, since the pseudo labels are not completely reliable.

Pivotal Knowledge Distillation
Knowledge distillation (Hinton et al., 2015) aims to remind the current model about learned knowledge by leveraging the knowledge from the previous model.It is important to leverage precise learned knowledge, otherwise, it would lead to semantic confusion like in previous works (Cao et al., 2020;Yu et al., 2021;Liu et al., 2022).In this paper, we propose pivotal knowledge distillation, which enables the model to focus on critical knowledge and transfers precise knowledge between the previous model and the current model at the feature and prediction levels.
Attention feature distillation.At the feature level, we expect the features extracted by the current model similar to those by the previous model.Unlike existing works (Lin et al., 2020;Cao et al., 2020;Liu et al., 2022), we consider that each token in a sentence should not have an equal feature weight toward an event trigger.The tokens associated closely with the event trigger are more important than others.To capture such context information, we propose attention feature distillation.We first apply in-context attention to obtain attentive feature A x j for each token x j in a sentence: where W denotes all tokens in this sentence.ϕ(•) is an attention function, which is calculated as the average of the self-attention weights from the last L layers of BERT, where L is a hyperparameter.f x is the feature representation of x.
After capturing attentive feature A x j , we preserve previous features through an attention feature distillation loss: where cos(•) is the cosine similarity between two features.A i x and A i−1 x are two attentive features computed by F i and F i−1 , respectively.
During feature distillation, the current model would pay more attention to the associated tokens and obtain the critical and precise knowledge of these tokens from the previous model to remember the seen event types.Moreover, with the lower attention to irrelevant tokens, the current model avoids being confused by irrelevant semantics.
Selective prediction distillation.At the prediction level, we enforce that the probability distribution predicted by the current model F i does not deviate from that of the previous model F i−1 .The previous methods (Yu et al., 2021;Song et al., 2021;Liu et al., 2022) transfer the probability distribution of each token in a sentence.However, we argue that this brings semantic confusion from the previous model.The tokens of emerging event types should not be transferred.The previous model F i−1 gives an inaccurate probability distribution of these tokens due to that it has not been trained on emerging event types.Therefore, if we transfer the current model F i with a wrong probability distribution, it would be confused with the semantics of new event types.To overcome this problem, we directly leverage the tokens of the previously seen types and the "NA" type to transfer knowledge.Furthermore, we do not transfer the probability distribution of "NA" owing to the availability of negative training data on every task.
Based on the above observation, we propose a selective prediction distillation to avoid semantic confusion in knowledge distillation: where Ñ is the token set excluding the tokens of new types.Ẽ is the previously seen type set.
Inspired by (Yu et al., 2021), we optimize the classification loss and distillation loss with multi-task learning.The final loss is where α and β are hyperparameters.

Prototype Knowledge Transfer
The distribution of event types is naturally imbalanced in the real world, where the majority of instances belong to a few types.Due to the lack of instances, the semantics of long-tailed event types are difficult to capture by the model.Moreover, if a long-tailed event type is analogous to a popular event type with many instances, its semantics are likely to be biased toward that of the popular one.Consequently, the model obtains confused semantics on long-tailed event types.
To address this issue, we propose a prototype knowledge transfer method to enhance the semantics of long-tailed types with associated event prototypes.In our viewpoint, the prototypes imply the semantics of their event types.To get the exact prototypes of emerging event types, we first train the base model with the current training data D train i .
For each event type e in the seen event types Ẽi , we calculate the average µ e and the standard deviation σ e of feature representations of corresponding tokens in the current training data D train i or memory space δi−1 .If the event type is newly emerging, we calculate its prototype by where N e is the tokens of event type e in D train i .For the previous event types, we compute their prototype by tokens in memory space as above.
For each token x of long-tailed type e, we clarify its semantics through the associated event prototypes.We first measure the similarity of e and another event type e ′ in the seen event types Ẽi by the cosine distance.Then, we calculate the associated standard deviation with associated event prototypes by where P is the prototypes of all seen event types.
We assume that the hidden representations of event types follow the Gaussian distribution and generate the intensive vector fe by fe ∼ G(0, σ2 e ). (10) We add the intensive vector fe with the feature vector f x as the final representation f * x of this longtailed token by f * x = fe + f x .We leverage the final representation for further learning like the feature representation of popular types.Note that we do not apply the average to generate the intensive vector.We think that the use of average would align the semantics of long-tailed event types and their associated event types, causing semantic confusion to a certain extent.In this paper, we categorize the last 80% of types in terms of the number of instances as long-tail types.

Argument Extraction
After obtaining the event types of sentences, we further identify their arguments and roles based on the pre-defined argument roles of each event type.The arguments are usually identified from the entities in the sentence.Here, we first recognize the entities in each sentence with the BERT-BiLSTM-CRF model, which is optimized on the same current training data.We treat these entities as candidate arguments.Then, we leverage another BERT encoder to encode each candidate argument and concatenate its starting and ending token representations as the final feature vector.Finally, for each candidate event type, we classify each entity into argument roles by a linear softmax classifier.The training objective is to minimize the following cross-entropy loss function: where Q is the set of candidate entities.R is the pre-defined argument roles of the corresponding event type.y e,r denotes whether the reference role of e is r.P (r | e) denotes the probability of e classified as r.
For argument extraction, we also apply another memory space to store a few instances to alleviate catastrophic forgetting.We pick and store instances in the same way as described in Section 4.2.1.

Experiments and Results
In this section, we evaluate our model and report the results.The source code is accessible online.1

Experiment Setup
Datasets.We carry out our experiments on 3 widely-used benchmark datasets.(1) ACE05-EN+ (Doddington et al., 2004) is a classic event extraction dataset containing 33 event types and 22 argument roles.We follow (Lin et al., 2020) to preprocess the dataset.Since several event types are missing in the development and test sets, we resplit the dataset for a better distribution of the event types.( 2) ERE-EN (Song et al., 2015) is another popular event extraction dataset, which contains 38 event types and 21 argument roles.We pre-process and re-split the data like ACE05-EN+.( 3) MAVEN (Wang et al., 2020) is a large-scale dataset with 168 event types.Due to the lack of argument annotations, we can only evaluate the event detection task.Since its original dataset does not provide the annotations of the test set, we re-split the dataset as well.More details about the dataset statistics and splits can be seen in Appendix A.
Implementation.As defined in Section 3, we conduct a sequence of event extraction tasks.Following (Yu et al., 2021;Liu et al., 2022), we partition each dataset into 5 subsets to simulate 5 disjoint tasks, denoted by {T 1 , . . ., T 5 }.As the majority of event types have more than 10 training instances, we set the memory size to 10 for all competing models.To reduce randomness, we run every experiment 6 times with different subset permutations.See Appendix B for hyperparameters.
Competing models.We compare our model with 7 competitors.Fine-tuning simply trains on the current training set without any memory.It forgets the previously learned knowledge seriously.Joint-training stores all previous training data as memory and trains the model on all training data for each new task.It simulates the behavior of retraining.It is viewed as an upper bound in continual event extraction.To reduce catastrophic forgetting, KCN (Cao et al., 2020) combines experience reply with hierarchical knowledge distillation.KT (Yu et al., 2021) transfers knowledge between learned event types and new event types.EMP (Liu et al., 2022) uses prompting methods to instruct model Models Datasets

ACE05-EN+ ERE-EN MAVEN
Event detection  (Li et al., 2023), we use the same prompt in the experiments.
Evaluation metrics.We assess the performance of models with F1 scores and report the average F1 scores over all permutations for each model.For event detection, we follow (Cao et al., 2020;Yu et al., 2021;Liu et al., 2022) to calculate F1 scores.
For argument extraction, an argument is correctly classified if its event type, offsets, and role label all match the ground truth.

Main Results
Table 1 shows the main results of event detection and argument extraction.Note that the results on T 1 are based on each model's own baseline.
For continual event detection, our model outperforms all competitors on three datasets by a large margin.After training on all tasks, compared with the state-of-the-art models KT and EMP, our model gains 14.89%, 4.78%, and 10.89% improvement of F1 scores on ACE05-EN+, ERE-EN, and MAVEN, respectively.The significant gap demonstrates that semantic confusion on event types is the key to continual event extraction.We can also conclude that our confusion rectification is effectively adaptable to different datasets.Furthermore, the results For continual argument extraction, our model also achieves superior performance.This indicates that based on accurately detected event types, our model can extract corresponding arguments well with a few memorized instances.Thus, the performance of continual event extraction highly relies on the effectiveness of continual event detection.
Regarding BERT_QA_Arg, it performs poorly under our setting.This shows that conventional event extraction models may not handle continual event extraction well, although it has transferability on event types.The results of ChatGPT indicate that large language models, which can be viewed as zero-shot event extraction methods, also struggle with continual event extraction.

Ablation Study
To validate the effectiveness of each module in our model, we conduct an ablation study on continual event detection, and the results are listed in Table 2. Specifically, for "w/o DA", we use the original dataset rather than the augmented data with pseudo labels; for "w/o AFD", we disable the attention feature distillation module; for "w/o SPD", we disable the selective prediction distillation module; for "w/o PKD", we remove the pivotal knowledge distillation module; for "w/o PKT", we directly train the model without transferring the knowledge of event prototypes to long-tailed event types.We observe that all modules are effective.

Analysis of Long-Tailed Event Types
We analyze the performance of long-tailed event types in continual event detection and the best performance on all datasets.Compared with the competing models, our model obtains the lowest performance decline on ACE05-EN+ and even gains improvement on ERE-EN and MAVEN.Thus, our model is effective in saving long-tailed event types from semantic confusion and classifying them correctly.

Analysis of Knowledge Transfer Ability
We use a widely-used metric backward transfer (BWT) (Lopez-Paz and Ranzato, 2017) to measure the knowledge transfer ability and how well the model alleviates catastrophic forgetting.The BWT score is defined as follows: where K is the number of tasks.F1 i,j is the F1 score on the test set of task T j after training the model on task T i .Note that BWT scores are negative due to catastrophic forgetting.A higher score indicates a better performance.
Table 4 shows the results.Our model performs best, indicating its superiority in alleviating catastrophic forgetting.Benefiting from semantic confusion rectification, our model has better transferability than the competing models.

Analysis of Memory Size Influence
The performance of memory-based models is highly related to memory size.We conduct an experiment with different memory sizes.Table 5 lists the results on ACE05-EN+, ERE-EN, and MAVEN.It is observed that our model maintains state-ofthe-art performance with different memory sizes.Compared to other models, the performance gap of our model between memory sizes 5 and 20 is the smallest, which demonstrates the robustness of our model to the change of memory size.

Limitations
Our model may have two limitations: (1) It requires an additional memory space to store a few instances, which is sensitive to storage capacity.
(2) It relies on the selection of instances in memory space.The prototype knowledge transfer may suffer from the low quality of selected instances in memory space, causing a performance decline.
Imbalanced number distribution of event types.

Data Augmentation Prototype Knowledge Transfer Pivotal Knowledge Distillation
tasks {T 1 , T 2 , . . ., T K }.Each individual task T i is a conventional event extraction task that contains its own event type set E i , role type set R i , and respective training set D train At the i-th stage, the continual event extraction model is trained on D train i , development set D dev i , and test set D test i .E i of each task T i is disjoint with other tasks.D train i contains instances for E i and negative instances for "NA".D dev i and D test i only contain sentences for E i .

Table 2 :
F1 scores of ablation study on event detection.All models have the same results on T 1 since continual learning has not been executed.

Table 3 :
F1 scores of event detection on long-tailed event types.All models are trained with all event types but only evaluated on long-tailed event types.

Table 5 :
F1 scores of event detection w.r.t.memory size on ACE05-EN+,ERE-EN, and MAVEN.andpropose a novel continual event extraction model.Specifically, we mark pseudo labels in training data for previously seen types.For newly emerging types, we select accurate knowledge to transfer.For long-tailed types, we enhance their semantic representations by the semantics of associated event types.Experiments on three benchmark datasets show that our model achieves superior performance, especially on long-tailed types.Also, the results verify the effectiveness of our model in alleviating catastrophic forgetting and rectifying semantic confusion.In future work, we plan to study continual few-shot event extraction or other classification-based continual learning tasks.