Teamwork Is Not Always Good: An Empirical Study of Classifier Drift in Class-incremental Information Extraction

Class-incremental learning (CIL) aims to develop a learning system that can continually learn new classes from a data stream without forgetting previously learned classes. When learning classes incrementally, the classifier must be constantly updated to incorporate new classes, and the drift in decision boundary may lead to severe forgetting. This fundamental challenge, however, has not yet been studied extensively, especially in the setting where no samples from old classes are stored for rehearsal. In this paper, we take a closer look at how the drift in the classifier leads to forgetting, and accordingly, design four simple yet (super-) effective solutions to alleviate the classifier drift: an Individual Classifiers with Frozen Feature Extractor (ICE) framework where we individually train a classifier for each learning session, and its three variants ICE-PL, ICE-O, and ICE-PL&O which further take the logits of previously learned classes from old sessions or a constant logit of an Other class as a constraint to the learning of new classifiers. Extensive experiments and analysis on 6 class-incremental information extraction tasks demonstrate that our solutions, especially ICE-O, consistently show significant improvement over the previous state-of-the-art approaches with up to 44.7% absolute F-score gain, providing a strong baseline and insights for future research on class-incremental learning.


Introduction
Conventional supervised learning assumes the data are independent and identically distributed (i.i.d.) and usually requires a pre-defined ontology, which may not be realistic in many applications in natural language processing (NLP).For instance, in event detection, the topics of interest may keep shifting over time (e.g., from attack to pandemic), and new event types and annotations could emerge incessantly.Previous studies (Ring et al., 1994;Kirkpatrick et al., 2017;Lopez-Paz and Ranzato, 2017) therefore proposed continual learning (CL), a.k.a., lifelong learning or incremental learning, a learning paradigm aiming to train a model from a stream of learning sessions that arrive sequentially.In this work, we focus on the class-incremental learning (CIL) setting (Wang et al., 2019), where a new session2 is composed of previously unseen classes and the goal is to learn a unified model that performs well in all seen classes.
When new learning sessions arrive sequentially, the classification layer must be constantly updated and/or expanded to accommodate new categories to the model.The change of the classifier between different sessions, i.e., classifier drift, can disturb or overwrite the classifier trained on previous classes, which consequently causes catastrophic forgetting (Biesialska et al., 2020).On the other hand, in many NLP tasks such as information extraction, the model also needs to classify nega-tive instances into the Other type (i.e., none-of-theabove).The Other type adds extra difficulty to classification, and even worse, the meaning of Other varies as the model learns new sessions (Zheng et al., 2022).The CIL problem thus becomes even more challenging when Other is involved.We illustrate the event detection task in CIL (Yu et al., 2021) and the classifier drift problem in Figure 1.
Despite the progress achieved in CIL (Zhao et al., 2022;Zheng et al., 2022), there are two critical limitations that are still remained: (1) Most previous CIL approaches heavily rely on the rehearsalbased strategy which stores samples from previously learned sessions and keeps re-training the model on these examples in subsequent sessions to mitigate catastrophic forgetting, which requires high computation and storage costs and raises concerns about privacy and data leakage (Shokri and Shmatikov, 2015); (2) Previous approaches have mainly focused on regularizing or expanding the overall model, especially feature extractor, to tackle the forgetting issue (Cao et al., 2020), but they rarely investigate whether the drift of the classifier also leads to forgetting, especially in classification tasks that involve the Other category.In this work, we aim to tackle these limitations by answering the following two research questions: RQ1: how does classifier drift lead to forgetting in the setting where no samples are stored from old sessions for rehearsal?, and RQ2: how to devise an effective strategy to alleviate classifier drift, especially when there is an Other category involved?
In this paper, we aim to answer the two research questions above.First, to study how classifier drift alone affects the model, we build a baseline where we use a pre-trained language model as a fixed feature extractor, such that only the parameters in the classification layer will be updated.Second, to alleviate classifier drift, we propose a simple framework named Individual Classifiers with Frozen Feature Extractor (ICE).Instead of collectively tuning the whole classification layer, we individually train a classifier for the classes in each new session without updating old classifiers and combine all learned classifiers to classify all seen classes during inference.As individually trained classifiers may lack the context of all learned sessions (Zhang et al., 2021), they may not be comparable to each other.We further devise a variant ICE-PL which takes the logits of previous classifiers as constraints to encourage contrastivity among all the classes when learning a new classifier for a new session.Third, both ICE and ICE-PL cannot be applied to detection tasks where an Other class is involved, thus we further design two variants of them: ICE-O and ICE-PL&O, which introduce a constant logit for the Other class and use it to enforce each individual classifier to be bounded by a constraint shared across different learning sessions during training.
We extensively investigate the classifier drift and evaluate our approach on 6 essential information extraction tasks across 4 widely used benchmark datasets under the CIL setting.Our major findings and contributions are: (1) By comparing the drifted baseline and our ICE, we find that the classifier drift alone can be a significant source of forgetting and our approaches effectively mitigate the drift and forgetting.Our results reveal that training the classifier individually can be a superior solution to training the classifier collectively in CIL.(2) We find that the Other type can effectively improve individually trained classifiers, and it is also helpful when we manually introduce negative instances during training on the tasks that do not have Other.
(3) Experimental results demonstrate that our proposed approaches, especially ICE-O, significantly and consistently mitigate the forgetting problem without rehearsal and outperform the previous stateof-the-art approaches by a large margin.(4) Our study builds a benchmark for 6 class-incremental information extraction tasks and provides a superstrong baseline and insights for the following studies on class-incremental information extraction.

Related Work
Existing approaches for CIL can be roughly categorized into three types (Chen et al., 2022).Rehearsal-based approaches (a.k.a.experience replay) (Lopez-Paz and Ranzato, 2017;de Masson d'Autume et al., 2019;Guo et al., 2020;Madotto et al., 2021;Qin and Joty, 2021) select some previous examples (or generate pseudo examples) for rehearsal in subsequent tasks.While such approaches are effective in mitigating forgetting, they require high computation and storage costs and suffer from data leakage risk (Shokri and Shmatikov, 2015;Smith et al., 2021;Wang et al., 2022).Regularization-based approaches (Chuang et al., 2020) aim to regularize the model's update by only updating a subset of parameters.Architecturebased approaches (Lee et al., 2020;Ke et al., 2021a,b,c;Feng et al., 2022;Zhu et al., 2022) adap-tively expand the model's capacity via parameterefficient techniques (e.g., adapter, prompt) to accommodate more data.While most existing approaches consider alleviating the forgetting of the whole model or transferring previous knowledge to new sessions, few of them thoroughly investigate how the classification layer of the model is affected as it expands to incorporate more classes into the model.Wu et al. (2019) find that the classification layer has a strong bias towards new classes, but they only study this issue in image recognition that doesn't contain the Other class.To fill the blank in current research, we aim to take a closer look at how the drift in the classifier alone affects the model under the CIL setting, especially when Other is involved.
For class-incremental information extraction, several studies tackle the CIL problem in relation learning (Wu et al., 2021), and many of them apply prototype-based approaches equipped with memory buffers to store previous samples (Han et al., 2020;Cui et al., 2021;Zhao et al., 2022).Others investigate how to detect named entities (Monaikul et al., 2021;Xia et al., 2022) or event trigger (Cao et al., 2020;Yu et al., 2021;Liu et al., 2022) in the CIL setting.For instance, Zheng et al. (2022) propose to distillate causal effects from the Other type in continual named entity recognition.One critical disadvantage of existing approaches for continual IE is they heavily rely on storing previous examples to replay, whereas our method does not require any examplar rehearsal.

Problem Formulation
Class-incremental learning requires a learning system to learn from a sequence of learning sessions D = {D 1 , ..., D T } and each session D k = {(x k , y k )|y k ∈ C k } where x k is an input instance for the session D k and y k ∈ C k denotes its label.The label set C k for session D k is not overlapping with that of other sessions, i.e., ∀k, j and k ̸ = j, C k C j = ∅.Given a test input x and a model that has been trained on up to t sessions, the model needs to predict a label ŷ from a label space that contains all learned classes, i.e., C 1 ... C t and optionally the Other class.Generally, the training instances in old classes are not available in future learning sessions.
We consider a learning system consisting of a feature extractor and a classifier.Specifically, we use a linear layer G 1:t ∈ R c×h as the classification layer, where c is the number of classes that the model has learned up to session t and h is the hidden dimension size of features.We denote the number of classes in a learning session k as n k , i.e., n k = |C k |.The classification layer G 1:t can be viewed as a concatenation of the classifiers in all learned sessions, i.e., G 1:t = [W 1 ; ...; W t ], where each of the classifier W k ∈ R n k ×h is in charge of the classes in C k .The linear layer outputs the logits o 1:t ∈ R c for learned classes, where o k refers to the logits for the classes in C k .The term logit we use in this paper refers to the raw scores before applying the Softmax normalization.
In this work, we focus on studying the classincremental problem in information (entity, relation, and event) extraction tasks.We consider two settings for each task: the detection task that requires the model to identify and classify the candidate mentions or mention pairs into one of the target classes or Other, and the classification task that directly takes the identified mentions or mention pairs as input and classifies them into the target classes without considering Other.

RQ1: How does Classifier Drift Lead to
Forgetting?
We first design a DRIFTED-BERT baseline to investigate how classifier drift alone leads to forgetting, and then provide an insightful analysis of how classifier drift happens, especially in the setting of class-incremental continual learning.

DRIFTED-BERT Baseline
In the current dominant continual learning frameworks, both the feature extractor and classifier are continually updated, which results in drift in both components towards the model's predictions on old classes.To measure how the classifier drift along leads to forgetting, we build a simple baseline that consists of a pretrained BERT (Devlin et al., 2019)   class which has different meanings from other sessions, we follow (Yu et al., 2021) to set the logit for Other to a constant value δ, i.e., o 0 = δ.We combine o 0 and o 1:t and pick the label with the maximum logit as the prediction.That is, we predict a sample as Other if and only if max(o 1:t ) < δ.
We freeze the parameters in the feature extractor so that the encoded features of a given sample remain unchanged in different learning sessions.In this way, the updates in the classification layer become the only source of forgetting.Note that we do not apply any continual learning techniques (e.g., experience replay) to DRIFTED-BERT.We denote p(x t ) as the predicted probability to compute the loss in training, where p(x t ) = Softmax(o 0:t ).At the learning session t, the model is trained on D t with the Cross Entropy (CE) loss: A Closer Look at Classifier Drift When the model has learned t sessions and needs to extend to the (t + 1)-th session, the classification layer As we assume that all previous training instances in D 1:t are not accessible anymore, solely training the model on D t+1 would lead to an extreme class-imbalance problem (Cao et al., 2020), which consequently causes catastrophic forgetting.However, most existing works rarely discuss how the drift in the classifier alone leads to forgetting, especially when the Other class is involved.We first define the classifier drift between two consecutive learning sessions D t and D t+1 as the change from G 1:t to G 1:t+1 that makes the model lose (part of) its acquired capability on the seen classes in C 1:t .Intuitively, the CE loss aims to maximize the probability of the correct label while minimizing the probabilities of all other labels.Thus, there are two possible causes of classifier drift: (1) new logit explosion: the new classifier W t+1 tends to predict logits o t+1 that are higher than those of all previous classes o 1:t so that the model can trivially discriminate new classes, which causes the old classes being overshadowed by new classes.( 2) diminishing old logit: as the old instances are not accessible in future learning sessions, the parameters in previous classifiers will be updated from the previous local optimum to a drifted sub-optimum, such that the classifier outputs low logits for old classes and cannot predict correctly.We empirically analyze the DRIFTED-BERT baseline to investigate the classifier drift in Section 5.2 and discuss the drifting patterns in different classification and detection tasks in Section 5.4.

RQ2: How to Alleviate Classifier Drift?
To alleviate the classifier drift, we introduce two solutions ICE and its variant ICE-PL for the classification tasks without Other, and further design two additional variants ICE-O and ICE-PL&O for detection tasks where Other is involved.We illustrate the training process in a new learning session for ICE and its variants in Figure 2. Note that we only focus on the setting of continual learning without experience replay, i.e., the model does not have access to the data of old sessions.

ICE: Individual Classifiers with Frozen Feature
Extractor We revisit the idea of classifier ensemble (Dietterich, 2000) and separated output layers in multi-task learning (Zhang and Yang, 2018) where task-specific parameters for one task do not affect those for other tasks.Inspired by this, we propose to individually train a classifier for each session without updating or using previously learned classifiers G 1:t (shown in Figure 2 (b)).In this way, previous classifiers can avoid being drifted to the sub-optimum, and the new classifier is less prone to output larger logits to overshadow old classes.Specifically, for an incoming session t + 1, we initialize a set of new weights and train the new classifier W t+1 on D t+1 .We only use the logits for the classes in the new session o t+1 to compute the Cross-Entropy loss in optimization, i.e., p(x t+1 ) = Softmax(o t+1 ).During inference, as we need to classify all seen classes without knowing the session identity of each instance, we combine the logits from all classifiers W 1 , ..., W t+1 together to get the prediction for all learned classes, i.e., o 1:t+1 = [o 1 ; ...; o t+1 ], where each classifier yields the logits via o k = W k •h given the encoded feature h for each mention.
ICE+Previous Logits (ICE-PL) One limitation of ICE is the classifier individually trained in one session may not be comparable to others.To provide contrastivity to classifiers, we first explore a variant named ICE-PL where we preserve the previous classifiers and only freeze their parameters, such that the new classifier is aware of previous classes during training (shown in Figure 2 (c)).That is, the model uses the logits from all classifiers o 1:t+1 to compute the Cross-Entropy loss, i.e., p(x t+1 ) = Softmax(o 1:t+1 ), while only the parameters in the new classifier are trainable.ICE-PL uses the same inference process as ICE.
ICE+Other (ICE-O) Both ICE-O and ICE-PL can only be applied to classification tasks and handling the Other category for detection tasks is challenging as each session D t only contains the annotated mentions for the classes C t , while the mentions from all the other classes such as C 1:t−1 are labeled as Other, making the meaning of Other varies in different sessions.To tackle this problem, we purpose the ICE-O variant (shown in Figure 2 (d)) where we assign a constant value δ as the logit of the Other category.Specifically, for each prediction, we combine the logit of Other with the logits from the new session o t+1 to obtain the output probability, i.e., p(x t+1 ) = Softmax([δ; o t+1 ]), and then compute the Cross-Entropy loss to train the classifier to make predictions for both positive classes and Other.During the inference, we combine the Other's logit δ with the logits from all trained classifiers o 1:t+1 , i.e., o 0:t+1 = [δ; o 1 ; ...; o t+1 ] to predict for all learned positive types and Other.We select the label with the highest logit among o 0:t+1 as the prediction, and a candidate will be predicted as Other if and only if max(o 1:t+1 ) < δ.
While the Other class introduces additional difficulties to CIL, we argue that it can also be a good remedy to classifier drift.In particular, in each learning session k, while the classifier W k is independently trained on D k , the output logits o k also need to satisfy the constraint that max(o k ) < δ when the classifier is trained on negative instances.Although the logits from any two distinct classifiers W k and W j (k ̸ = j) do not have explicit contrastivity, both classifiers are trained under the constraint that max(o k ) < δ and max(o j ) < δ, which provides a weak contrastivity between them.
While ICE-O and ICE-PL&O are naturally applied to detection tasks, for classification tasks without the Other class, we can also manually create negative instances based on the tokens or entity pairs without positive labels.Section 5.1 provides more details regarding how to apply ICE-O and ICE-PL&O to classification tasks.

Datasets and Experiment Setup
We use Few-NERD (Ding et al., 2021) for classincremental named entity recognition and split all the 66 fine-grained types into 8 learning sessions by following Yu et al. (2021) which apply a greedy algorithm to split the types into sessions and ensure each session contains the roughly same number of training instances.We use two benchmark datasets MAVEN (Wang et al., 2020) and ACE-05 (Doddington et al., 2004) for class-incremental event trigger extraction and following the same setting as (Yu et al., 2021) to split them into 5 learning sessions, respectively.For class-incremental relation extraction, we use TACRED (Zhang et al., 2017) and follow the same setting as Zhao et al. (2022) to split the 42 relations into 10 learning sessions.
For each dataset, we construct two settings: (1) detection where the model classifies each token (or a candidate entity pair in relation extraction task) in a sentence into a particular class or Other; and (2) classification where the model directly takes in a positive candidate (i.e., an entity, trigger, or a pair of entities) and classify it into one of the classes.For the classification setting, as there are no negative candidates that are labeled as Other, we automatically create negative candidates and introduce the Other category so that we can investigate the effect of Other using ICE-O and ICE-PL&O.Specifically, we assign the Other label to the tokens if they are not labeled with any classes for entity and event trigger classification, and assign the Other label to the pairs of entity mentions if they are not labeled with any relations for relation classification.When we apply ICE-O and ICE-PL&O to classification tasks, during inference, we do not consider the logit of the Other class.
Evaluation We use the same evaluation protocol as previous studies (Yu et al., 2021;Liu et al., 2022).Every time the model finishes the training on Session t, we evaluate the model on all test samples from Session 1 to Session t for classification tasks.For detection tasks, we evaluate the model on the entire test set where we take the mentions or mention pairs of unlearned classes as Other.Following Yu et al. (2021), we randomly sample 5 permutations of the orders of learning sessions and report the average performance.
Baselines We compare our approaches with the DRIFTED-BERT baseline and several state-ofthe-art methods for class-incremental information extraction, including ER (Wang et al., 2019), KCN (Cao et al., 2020), KT (Yu et al., 2021), EMP (Liu et al., 2022), CRL (Zhao et al., 2022).All these methods adopt experience replay to alleviate catastrophic forgetting.We also design two approaches to show their performance in the conventional supervised learning setting where the model is trained with the annotated data from all the sessions, as the approximate upperbound of the continual learning approaches: (i) BERT-FFE consists of a pre-trained BERT as the feature extractor and a classifier, where, during training, we fix the feature extraction and only tune the classifier; and (ii) BERT-FT which shares the same architecture as BERT-FFE but both the feature extractor and classifier are tuned during training.More details about the datasets, baselines, and model implementation can be found in Appendix A.

RQ1: How does Classifier Drift Lead to Forgetting?
We conduct an empirical analysis on event detection and classification tasks on MAVEN to answer RQ1 and gain more insight into the classifier drift.

Analysis of Old and New Classes Performance
Our first goal is to analyze the classifier drift during the incremental learning process.In    4) Previous methods generally perform worse than our solutions even with experience replay.The possible reasons include overfitting to the stored examples in the small memory buffer or the regularization from replay may not be effective enough to mitigate the forgetting.
Comparison with CRL (Zhao et al., 2022) Note that, among all the baselines, CRL consistently outperforms others on the classification tasks.CRL is based on a prototypical network where each class is represented with a prototype computed from an embedding space and performs the classification with the nearest class mean (NCM) classifier.Compared with other Softmax-based classification approaches, CRL can accommodate new classes more flexibly without any change to the architecture.However, it still suffers from the semantic drift (Yu et al., 2020) problem as the embedding network must be continually updated to learn new classes, and it is non-trivial to adapt it to detection tasks where an Other class is involved under the class-incremental learning setting and the meanings of Other in different learning sessions are also different.

Comparison with Trainable Feature Extractor
We also investigate if our proposed approaches can be further improved by tuning the BERT-based feature extractor.However, it naturally leads to forgetting as demonstrated by previous studies (Wang et al., 2019;Cao et al., 2020;Yu et al., 2021).
Thus, following these studies, we adopt experience replay and design a new variant named ICE-O with Tunable Feature Extractor and Experience Replay (abbreviated as ICE-O+TFE&ER), which tunes the BERT-based feature extractor and adopts the same replay strategy as ER that preserves 20 samples for each class.From Table 2, 3 and 4, ICE-O+TFE&ER significantly improves over ICE-O and achieves comparable performance to the supervised BERT-FT upperbound on all the classification tasks.However, ICE-O+TFE&ER performs much worse than ICE-O on all the detection tasks.We hypothesize that this is due to the meaning shift of the Other class when incrementally training it on a sequence of learning sessions.Experience replay may not be enough to constrain the feature extractor to handle the Other class properly.

Analysis of Drifting Patterns
To take a closer look into how the classifier drift leads to forgetting and verify the two hypothetical drifting patterns we discuss in Section 4.1, we analyze the output logits (i.e., the scores before Softmax) from the old and new classifiers for DRIFTED-BERT and our ICE, ICE-PL, and ICE-O.Specifically, we take the test samples whose ground truth labels are learned in Session 1 (denoted as X 1 test ), for analysis.Every time the classifier is trained on a new session, we evaluate the classifier on X 1 test , and then take (1) the logit of the gold class (Gold), and (2) the maximum logit from the new classifier (NCP), i.e., New Classifier's Prediction, for analysis.For each type of logit, we report the average of the logits on all the samples in X 1 test .We have the following findings: (1) By examining the Gold logits and the logits from the new classifier (NCP) of DRIFTED-BERT, we observe that every time a new classifier is added and trained on the new session, the new classifier incrementally outputs higher logits than those in the previous session on X 1 test (blue solid line), whereas the Gold logits first decline a bit and stay at a certain level in the remaining sessions (blue dashed line).This observation confirms that two possible drifting patterns (i.e., new logit explosion and diminishing old logit) exist, and they can happen simultaneously and cause the new classifier overshadows the previously learned classifiers, which consequently leads to forgetting.(2) We find that while the old classifiers are not updated in ICE-PL, the new logit explosion issue gets even more severe (orange solid line), which explains why ICE-PL performs worse than ICE and ICE-O.We hypothesize that the presence of previous logits may encourage the new classifier to predict larger logits.(3) When the classifier in each session is trained individually instead of collectively (i.e., in ICE and ICE-O), the Gold logits from the old classifiers stay at a constant level (red dashed lines), whereas the logits from the new classifier are at a relatively lower level (green and red solid line).As such, the new classifier's logits do not have much impact on those of old classes, which mitigates the drift and forgetting.

The Effect of the Logit for Other Class
Throughout all the experiments, we set the logit for Other class δ as 0 constantly.In this section, we further discuss the effect of the value of δ, and the effect of tuning the Other classifier.We show the results of event detection on MAVEN based on different fixed values or a tunable value of δ in Table 5.We found that the value of Other class's logit does Table 5: Results (Micro-F1 score, %) on the effect of the Other class's logit on the event detection task on MAVEN.We show the performance of the models that have learned all 5 sessions."Tune" means we used a tunable logit for Other class instead of a fixed value.
not affect the model's performance much as long as it is fixed.However, we noticed a significant performance decrease if we continually tuned it with a classifier, demonstrating that it is necessary to fix the Other class's logit during the continual learning process in our approach.

Comparison with Recent LLMs
More recently, very large language models (LLMs) such as ChatGPT (OpenAI, 2022) demonstrate strong in-context learning ability without the need of gradient update.Thus, class-incremental learning may also be tackled as a sequence of in-context learning.However, several recent studies (Gao et al., 2023;Qin et al., 2023) have benchmarked several LLMs with in-context few-shot learning on various IE tasks and show worse performance than our approach.Our approach can efficiently achieve a good performance that is close to the supervised performance by only finetuning the last linear layer using a much smaller frozen BERT backbone.More critically, the knowledge LLMs are often bounded by the training data, whereas the goal of our continual learning approach focuses on incorporating up-to-date information into models.

Conclusion
In this paper, we investigate the answers and the solutions to the research questions that how the classifier drift alone affects a model in the classincremental learning setting, and how to alleviate the drift without retraining the model on previous examples.We, therefore, propose to train a classifier individually for each task and combine them together during inference, such that we can maximally avoid the drift in the classifier.Extensive experiments show that our proposed approaches significantly outperform all the considered baselines on both class-incremental classification and detection benchmarks and provide super-strong baselines.We hope this work can shed light on future research on continual learning in broader research communities.

Limitations
Our approaches mainly leverage a fixed feature extractor together with a set of individually trained classifiers to mitigate catastrophic forgetting whereas a tunable feature extractor may also be helpful and complement the individually trained classifiers, so a future direction is to design advanced strategies to efficiently tune the feature extractor in combination with our proposed ICE based classifiers.In addition, we mainly investigate the classifier drift and demonstrate the effectiveness of our solutions under the class-incremental continual learning setting.Another future direction is to explore similar ideas under other continual learning settings, e.g., task-incremental learning, online learning, or the setting where new sessions also contain annotations for old classes.
A More Details on Experiment Setup (2) a classification task where the model only needs to classify a positive trigger mention into a learned event type without considering Other.We did not construct the classification task for the ACE dataset as the majority of instances only contain the Other type and removing such instances will result in a very small dataset.

A.2 Baselines
We use the following baselines for our experiments: (1) DRIFTED-BERT: we build a baseline with a fixed pre-trained BERT as the feature extractor and only train its classification layer.We do not apply any other continual learning techniques to it.We primarily use this baseline to study the classifier drift discussed in this work.
(2) ER (Wang et al., 2019): experience replay is introduced to continual IE by (Wang et al., 2019).In this work, we use the same strategy as in (Liu et al., 2022) to select examples to store in the memory and replay them in subsequent sessions.(3) KCN (Cao et al., 2020): the original work proposes a prototype-based method to sample examples to store for replay as well as a hierarchical knowledge distillation (KD) to constrain the model's update.We adapt their hierarchical distillation along with ER as the KCN baseline.( 4) KT (Yu et al., 2021): a framework that transfers knowledge between new and old event types.( 5) EMP (Liu et al., 2022): propose a prompt-based technique to dynamically expand the model architecture to incorporate more classes.( 6) CRL (Zhao et al., 2022) proposes consistent representation learning to keep the embeddings of historical relations consistent.Since CRL is designed for the classification tasks without Other, we only evaluate this baseline on the classification tasks we build.( 7) Upperbound: we train a model jointly on all classes in the dataset as an upperbound in the conventional supervised learning setting.We devise two different upperbounds: (i) BERT-FFE is the upperbound of our ICE-O model, where we only train the classifier and the feature extractor is fixed.The negative instances are used in the classification tasks without Other; and (ii) BERT-FT is the upperbound that trains both the whole BERT and the classifier.

A.3 Implementation Details
We use the pre-trained BERT-large-cased (Devlin et al., 2019) as the fixed feature extractor.We use AdamW (Loshchilov and Hutter, 2019) as the optimizer with the weight decay set to 1e − 2 and a learning rate of 1e − 4 for detection tasks and 5e − 4 for classification tasks.We apply gradient accumulation and set the step to 8. In each learning session D k , we establish a limit of 15 maximum training epochs.We also adopt the early stopping strategy with a patience of 3, where training will be halted if there is no improvement in performance on the development set for 3 epochs.We set the constant value for the Other class δ to 0. We apply the experience replay strategy with the same setting as in (Liu et al., 2022) to ER, KCN, KT, and EMP as an assistant technique to mitigate forgetting.We store 20 examples for each class using the herding algorithm (Welling, 2009) and replay one stored instance in each batch during training to limit the computational cost brought by rehearsal.
For CRL, we use the same sample selection and replay strategy as in the original work.For baselines, we adopt a frozen pre-trained BERT-large and a trainable Multi-Layer Perceptron (MLP) as the feature extractor.

B More Discussions
B.1 More Analysis on Old and New Type Performance Table 6 and 7 show the performance of old and new classes for each learning session of the classincremental named entity detection and classification and class-incremental relation detection and classification tasks.

Figure 1 :
Figure 1: Illustration of class-incremental event detection where the model needs to classify each candidate mention into a label from all learned types or Other.The figure shows two classifiers that are incrementally trained from Session 1 and Session 2 and are evaluated on the same sample.After training on session 2, the classifier mistakenly predicts Other for an Arrest mention due to the classifier drift.The model here uses pre-trained features and only the classifier is trained.

Figure 2 :
Figure 2: Illustration of the training process in a new learning session for DRIFTED-BERT as well as ICE and its variants."FE" stands for feature extractor."O" stands for the Other type.Each circle in the classifier represents a category.The models have learned a classifier (W 1 ) with 3 classes in Session 1 (S1) and they are learning a new classifier (W 2 ) with 2 classes in Session 2 (S2).ICE and ICE-PL can only handle classification tasks without Other, whereas ICE-O and ICE-PL&O are devised for detection tasks involving Other where we use a fixed value as the logit of Other class since it has distinct meanings in different sessions.Note that DRIFTED-BERT is applied to both classification (w/o the Other classifier) and detection (w/ the Other classifier) tasks.

Figure 3 :
Figure3: Analysis of output logits on the event trigger classification task on MAVEN.Gold refers to the gold logit and NCP refers to the maximum logit from the new classifier.We keep track of how these two types of logits change throughout 5 learning sessions.

Table 1
k, we compute the (1) F-score on the new classes (C k ) learned in the current session, (2) accumulated F-score on the old classes (C 1:k−1 ) from all previous sessions, and (3) F-score on the old classes (C k−1 ) from the previous session, respectively.By

Table 2 :
Results (Micro-F1 score, %) on event detection and classification on 5 learning sessions.We highlight the best scores in bold and the second best with underline.† indicates approaches with experience replay.

Table 3 :
Results (Micro-F1 score, %) on named entity recognition and classification on 8 learning sessions.We highlight the best scores in bold and the second best with underline.† indicates approaches with experience replay.

Table 4 :
Results (Micro-F1 score, %) on relation detection and classification on 10 learning sessions.We highlight the best scores in bold and the second best with underline.† indicates approaches with experience replay.
Yu et al. (2021)s, we followYu et al. (2021)to use the same train/dev/test split and use the same ontology partition to create 5 incremental learning sessions for each dataset, where each session contains approximately the same number of training instances.We create two settings for event trigger: (1) two event detection tasks, where the model is required to evaluate each token in the sentence and assign it with a learned event type or Other, and; A.1 Details of the DatasetsNamed Entity We use Few-NERD(Ding et al.