Distilling Causal Effect from Miscellaneous Other-Class for Continual Named Entity Recognition

Continual Learning for Named Entity Recognition (CL-NER) aims to learn a growing number of entity types over time from a stream of data. However, simply learning Other-Class in the same way as new entity types amplifies the catastrophic forgetting and leads to a substantial performance drop. The main cause behind this is that Other-Class samples usually contain old entity types, and the old knowledge in these Other-Class samples is not preserved properly. Thanks to the causal inference, we identify that the forgetting is caused by the missing causal effect from the old data.To this end, we propose a unified causal framework to retrieve the causality from both new entity types and Other-Class.Furthermore, we apply curriculum learning to mitigate the impact of label noise and introduce a self-adaptive weight for balancing the causal effects between new entity types and Other-Class. Experimental results on three benchmark datasets show that our method outperforms the state-of-the-art method by a large margin. Moreover, our method can be combined with the existing state-of-the-art methods to improve the performance in CL-NER.


Introduction
Named Entity Recognition (NER) is a vital task in various NLP applications (Ma and Hovy, 2016).
Traditional NER aims at extracting entities from unstructured text and classifying them into a fixed set of entity types (e.g., Person, Location, Organization, etc).However, in many real-world scenarios, the training data are streamed, and the NER systems are required to recognize new entity types to support new functionalities, which can be formulated into the paradigm of continual learning (CL, a.k.a.incremental learning or lifelong learning) Suppose that a model learns four entity types in CoNLL2003 sequentially."LOC": Location; "MISC": Miscellaneous; "ORG": Organisation; "PER": Person.(Thrun, 1998;Parisi et al., 2019).For instance, voice assistants such as Siri or Alexa are often required to extract new entity types (e.g.Song, Band) for grasping new intents (e.g.GetMusic) (Monaikul et al., 2021).
However, as is well known, continual learning faces a serious challenge called catastrophic forgetting in learning new knowledge (McCloskey and Cohen, 1989;Robins, 1995;Goodfellow et al., 2013;Kirkpatrick et al., 2017).More specifically, simply fine-tuning a NER system on new data usually leads to a substantial performance drop on previous data.In contrast, a child can naturally learn new concepts (e.g., Song and Band) without forgetting the learned concepts (e.g., Person and Location).Therefore, continual learning for NER (CL-NER) is a ubiquitous issue and a big challenge in achieving human-level intelligence.
In the standard setting of continual learning, only new entity types are recognized by the model in each CL step.For CL-NER, the new dataset contains not only new entity types but also Other-class tokens which do not belong to any new entity types.For instance, about 89% tokens belongs to Otherclass in OntoNotes5 (Hovy et al., 2006).Unlike accuracy-oriented tasks such as the image/text classification, NER inevitably introduces a vast number of Other-class samples in training data.As a result, the model strongly biases towards Other-class (Li et al., 2020).Even worse, the meaning of Otherclass varies along with the continual learning pro- cess.For example, "Europe" is tagged as Location if and only if the entity type Location is learned in the current CL step.Otherwise, the token "Europe" will be tagged as Other-class.An illustration is given in Figure 1 to demonstrate Other-class in CL-NER.In a nutshell, the continually changing meaning of Other-class as well as the imbalance between the entity and Other-class tokens amplify the forgetting problem in CL-NER.
Figure 2 is an illustration of the impact of Otherclass samples.We divide the training set into 18 disjoint splits, and each split corresponds to one entity type to learn.Then, we only retain the labels of the corresponding entity type in each split while the other tokens are tagged as Other-class.
Next, the NER model learns 18 entity types one after another, as in CL.To eliminate the impact of forgetting, we assume that all recognized training data can be stored.Figure 2 shows two scenarios where Other-class samples are additionally annotated with ground-truth labels or not.Results show that ignoring the different meanings of Otherclasses affects the performance dramatically.The main cause is that Other-class contains old entities.From another perspective, the old entities in Other-class are similar to the reserved samples of old classes in the data replay strategy (Rebuffi et al., 2017).Therefore, we raise a question: how can we learn from Other-class samples for anti-forgetting in CL-NER?
In this study, we address this question with a Causal Framework for CL-NER (CFNER) based on causal inference (Glymour et al., 2016;Schölkopf, 2022).Through causal lenses, we determine that the crux of CL-NER lies in establishing causal links from the old data to new entity types and Other-class.To achieve this, we utilize the old model (i.e., the NER model trained on old entity types) to recognize old entities in Other-class samples and distillate causal effects (Glymour et al., 2016) from both new entity types and Other-class simultaneously.In this way, the causality of Otherclass can be learned to preserve old knowledge, while the different meanings of Other-classes can be captured dynamically.In addition, we design a curriculum learning (Bengio et al., 2009) strategy to enhance the causal effect from Other-class by mitigating the label noise generated by the old model.Moreover, we introduce a self-adaptive weight to dynamically balance the causal effects from Other-class and new entity types.Extensive experiments on three benchmark NER datasets, i.e., OntoNotes5, i2b2 (Murphy et al., 2010) and CoNLL2003 (Sang and De Meulder, 2003), validate the effectiveness of the proposed method.The experimental results show that our method outperforms the previous state-of-the-art method in CL-NER significantly.The main contributions are summarized as follows: • We frame CL-NER into a causal graph (Pearl, 2009) and propose a unified causal framework to retrieve the causalities from both Otherclass and new entity types.
• We are the first to distillate causal effects from Other-class for anti-forgetting in CL, and we propose a curriculum learning strategy and a self-adaptive weight to enhance the causal effect in Other-class.
• Through extensive experiments, we show that our method achieves the state-of-the-art performance in CL-NER and can be implemented as a plug-and-play module to further improve the performances of other CL methods.
2 Related Work

Continual Learning for NER
Despite the fast development of CL in computer vision, most of these methods (Douillard et al., 2020;Rebuffi et al., 2017;Hou et al., 2019) are devised for accuracy-oriented tasks such as image classification and fail to preserve the old knowledge in Other-class samples.In our experiment, we find that simply applying these methods to CL-NER does not lead to satisfactory performances.
In CL-NER, a straightforward solution for learning old knowledge from Other-class samples is self-training (Rosenberg et al., 2005;De Lange et al., 2019).In each CL step, the old model is used to annotate the Other-class samples in the new dataset.Next, a new NER model is trained to recognize both old and new entity types in the dataset.The main disadvantage of self-training is that the errors caused by wrong predictions of the old model are propagated to the new model (Monaikul et al., 2021).Monaikul et al. (2021) proposed a method based on knowledge distillation (Hinton et al., 2015) called ExtendNER where the old model acts as a teacher and the new model acts as a student.Compared with self-training, this distillation-based method takes the uncertainty of the old model's predictions into consideration and reaches the state-of-the-art performance in CL-NER.
Recently, Das et al. (2022) alleviates the problem of Other-tokens in few-shot NER by contrastive learning and pretraining techniques.Unlike them, our method explicitly alleviates the problem brought by Other-Class tokens through a causal framework in CL-NER.

Causal Inference
Causal inference (Glymour et al., 2016;Schölkopf, 2022) has been recently introduced to various computer vision and NLP tasks, such as semantic segmentation (Zhang et al., 2020), long-tailed classification (Tang et al., 2020;Nan et al., 2021), distantly supervised NER (Zhang et al., 2021) and neural dialogue generation (Zhu et al., 2020).Hu et al. (2021) first applied causal inference in CL and pointed out that the vanishing old data effect leads to forgetting.Inspired by the causal view in (Hu et al., 2021), we mitigate the forgetting problem in CL-NER by mining the old knowledge in Other-class samples.

Causal Views on (Anti-) Forgetting
In this section, we explain the (anti-) forgetting in CL from a causal perspective.First, we model the causalities among data, feature, and prediction at any consecutive CL step with a causal graph (Pearl, 2009) to identify the forgetting problem.The causal graph is a directed acyclic graph whose nodes are variables, and directed edges are causalities between nodes.Next, we introduce how causal effects are utilized for anti-forgetting.

Causal Graph
Figure 3a shows the causal graph of CL-NER when no anti-forgetting techniques are used.Specifically, we denote the old data as S; the new data as D; the feature of new data extracted from the old and new model as X 0 and X; the prediction of new data as Ŷ (i.e., the probability distribution (scores)).The causality between notes is as follows: (1) D → X → Ŷ : D → X represents that the feature X is extracted by the backbone model (e.g., BERT (Devlin et al., 2019)), and X → Ŷ indicates that the prediction Ŷ is obtained by using the feature X with the classifier (e.g., a fully-connected layer); (2) S → X 0 ← D: these links represent that the old feature representation of new data X 0 is determined by the new data D and the old model trained on old data S. Figure 3a shows that the forgetting happens because there are no causal links between S and Ŷ .More explanations about the forgetting in CL-NER are demonstrated in Appendix A.

Colliding Effects
In order to build cause paths from S to Ŷ , a naive solution is to store (a fraction of) old data, resulting in a causal link S → D is built.However, storing old data contradicts the scenario of CL to some extent.To deal with this dilemma, Hu et al. (2021) proposed to add a causal path S ↔ D between old and new data by using Colliding Effect (Glymour et al., 2016).Consequently, S and D will be correlated to each other when we control the collider X 0 .Here is an intuitive example: a causal graph sprinkler → pavement ← weather represents the pavement's condition (wet/dry) is determined by both the weather (rainy/sunny) and the sprinkler (on/off).Typically, the weather and the sprinkler are independent of each other.However, if we observe that the pavement is wet and know that the sprinkler is off, we can infer that the weather is likely to be rainy, and vice versa.

A Causal Framework for CL-NER
In this section, we frame CL-NER into a causal graph and identify that learning the causality in Other-class is crucial for CL-NER.Based on the characteristic of CL-NER, we propose a unified causal framework to retrieve the causalities from both Other-class and new entity types.We are the first to distillate causal effects from Other-class for anti-forgetting in CL.Furthermore, we introduce

Problem Formulation
In the i-th CL step, given an NER model M i which is trained on a set of entities Suppose the model consists of a backbone network for feature extraction and a classifier for classification.As a common practice, M i+1 is first initialized by the parameter of M i , and then the dimensions of the classifier are extended to adapt to new entity types.Then, M i will guide the learning process of M i+1 through knowledge distillation (Monaikul et al., 2021) or regularization terms (Douillard et al., 2020) to preserve old knowledge.Our method is based on knowledge distillation where the old model M i acts as a teacher and the new model M i+1 acts as a student.Our method further distillates causal effects in the process of knowledge distillation.

Distilling Colliding Effects in CL-NER
Based on the causal graph in Figure 3a, we figure out that the crux of CL lies in building causal paths between old data and prediction on the new model.If we utilize colliding effects, the causal path between old and new data can be built without storing old data.
To distillate the colliding effects, we first need to find tokens in new data which have the same feature representation X 0 in the old feature space, i.e., condition on X 0 .However, it is almost impossible to find such matched tokens since features are sparse in high dimensional space (Altman and Krzywinski, 2018).Following Hu et al. (2021), we approximate the colliding effect using K-Nearest Neighbor (KNN) strategy.Specifically, we select a token as anchor token and search the k-nearest neighbor tokens whose features bear a resemblance to the anchor token's feature in the old feature space.Next, when calculating the prediction of the anchor token, we use matched tokens for joint prediction.Note that in backpropagation, only the gradient of the anchor token is computed.Figure 4 shows a demonstration for distilling colliding effects.
Although Other-class tokens usually do not directly guide the model to recognize new entity types, they contain tokens from old entity types, which allow models to recall what they have learned.Naturally, we use the old model to recognize the Other-class tokens which actually belong to old entity types.Since these Other-class tokens belong to the predefined entity types, we call them as Defined-Other-Class tokens, and we call the rest tokens in Other-class as Undefined-Other-Class tokens.
Based on the characteristics of NER, we extend the causal graph in Figure 3a to Figure 3b.The key adjustment is that the node of new data is split into two nodes, including new entity tokens D E and Defined-Other-Class tokens D O .Then, we apply the colliding effects on D E and D O respectively, resulting in that D E and D O collide with S on nodes X E 0 and X O 0 in the old feature space.In this way, we build two causal paths from old data S to new predictions Ŷ .In the causal graph, we ignore the token from Undefined-Other-Class since they do not help models learn new entity types or review old knowledge.Moreover, we expect the model to update instead of preserving the knowledge about Other-class in each CL step.Here, we consider two paths separately because the colliding effects are distilled from different kinds of data and calculated in different ways.

A causal framework for CL-NER
Formally, we define the total causal effects Effect as follow: In Eq.( 1), Effect E and Effect O denote the colliding effect of new entity types and Defined-Other-Class.
In Eq.( 2), CE(•, •) and KL(•, •) represent the crossentropy and KL divergence loss, and D i ,D j are the i-th,j-th token in new data.In cross-entropy loss, Y i is the ground-truth entity type of the i-th token in new data.In KL divergence loss, Y j is the soft label of the j-th token given by the old model over old entity types.In both losses, Y represents the weighted average of prediction scores over anchor and matched tokens.When calculating KL(•, •), we follow the common practice in knowledge distillation to introduce temperature parameters T t , T s for teacher (old model) and student model (new model) respectively.Here, we omit T t , T s in KL(•, •) for notation simplicity.The weighted average scores of the i-th token is calculated as follow: where the i-th token is the anchor token and K matched tokens are selected according to the KNN strategy.We sort the K matched tokens in ascending order according to the distance to the anchor token in the old feature space.Ŷi , Ŷik are the prediction scores of the i-th token and its k-th matched token, and W i , W ik are the weight for Ŷi , Ŷik , respectively.The weight constraints ensure that the token closer to the colliding feature has a more significant effect.
Until now, we calculate the effects in D E and D O and ignore the Undefined-Other-Class.Following Monaikul et al. (2021), we apply the standard knowledge distillation to allow new models to learn from old models.To this end, we just need to rewrite Effect O in Eq.( 1) as follow: where D U O is the data belong to the Undefined-Other-Class.The Eq.( 4) can be seen as calculating samples from D O and D U O in the same way, except that samples from D U O have no matched tokens.We summarize the proposed causal framework in Figure 5.

Mitigating Label Noise in Effect O
In our method, we use the old model to predict the labels of Defined-Other-Class tokens.However, it inevitably generates label noise when calculating Effect O .To address this problem, we adopt curriculum learning to mitigate the label noise in the proposed method.Curriculum learning has been widely used for handling label noises (Guo et al., 2018) in computer vision.Arazo et al. (2019) empirically find that networks tend to fit correct samples before noisy samples.Motivated by this, we introduce a confidence threshold δ (δ ∈ [0, 1]) to encourage the model to learn first from clean Other-class samples and then noisier ones.Specifically, when calculating Effect O , we only select Defined-Other-Class tokens whose predictive confidences are larger than δ for distilling colliding effects while others are for knowledge distillation.The value of δ changes along with the training process and the value of δ in the i-th epoch is calculated as follow: (5) where m, δ 1 and δ m are the predefined hyperparameters and δ m should be smaller than δ 1 .

Balancing Effect E and Effect O
Figure 5 shows that the total causal effect Effect consists of Effect E and Effect O , where Effect E is for learning new entities while Effect O is for reviewing old knowledge.With the learning process of CL, the need to preserve old knowledge varies (Hou et al., 2019).For example, more efforts should be made to preserve old knowledge when there are 15 old classes and 1 new class v.s. 5 old classes and 1 new class.In response to this, we introduce a self-adaptive weight for balancing Effect E and Effect O : where λ base is the initial weight and C O ,C N are the numbers of old and new entity types respectively.In this way, the causal effects from new entity types and Other-class are dynamically balanced when the ratio of old classes to new classes changes.Finally, the objective of the proposed method is given as follow: 5 Experiments

Settings
Datasets.We conduct experiments on three widely used datasets, i.e., OntoNotes5 (Hovy et al., 2006), i2b2 (Murphy et al., 2010) and CoNLL2003 (Sang and De Meulder, 2003).To ensure that each entity type has enough samples for training, we filter out the entity types which contain less than 50 training samples.We summarize the statistics of the datasets in Table 5 in Appendix F. Following Monaikul et al. (2021), we split the training set into disjoint slices, and in each slice, we only retain the labels which belong to the entity types to learn while setting other labels to Other-class.Different from Monaikul et al. (2021), we adopt a greedy sampling strategy to partition the training set to better simulate the real-world scenario.Specifically, the sampling algorithm encourages that the samples of each entity type are mainly distributed in the slice to learn.We provide more explanations and the detailed algorithm in Appendix B. Training.We use bert-base-cased (Devlin et al., 2019) as the backbone model and a fully-connected layer for classification.Following previous work in CL (Hu et al., 2021), we predefine a fixed order of Table 1: Comparisons with state-of-the-art methods on I2B2 and OntoNotes5.The average results as well as standard derivations are provided.Mi-F1: micro-F1; Ma-F1: macro-F1; Forget: Forgetting; : higher is better; : lower is better.The best F1 results are bold.For testing, we retain the labels of all recognized entity types while setting others to Other-class in the test set.
Metrics.Considering the class imbalance problem in NER, we adopt Micro F1 and Macro F1 for measuring the model performance.We report the average result on all CL steps (including the first step) as the final result.
Baselines.We consider four baselines: Extend-NER (Monaikul et al., 2021), Self-Training (ST) (Rosenberg et al., 2005;De Lange et al., 2019), LUCIR (Hou et al., 2019) and PODNet (Douillard et al., 2020).ExtendNER is the previous state-ofthe-art method in CL-NER.LUCIR and PODNet are state-of-the-art CL methods in computer vision.Detailed descriptions of the baselines and their training settings are demonstrated in Appendix C. Hyper-Parameters.We set the number of matched tokens K = 3, the weights W i = 1/2 and W ik = 1 2K .For parameters in the curriculum learning strategy, we set δ 1 = 1, δ m = 0 and m = 10.We set the initial value of balancing weight λ base = 2.More training details are shown in Appendix D.

Results and Analysis
Comparisons with State-Of-The-Art.We consider two scenarios for each dataset: (1) training the first model the same as the following CL steps; ( 2) training the first model with half of all entity types.The former scenario is more challenging, whereas the latter is closer to the real-world scenario since it allows models to learn enough knowledge before incremental learning.Apart from that, we consider fine-tuning without any anti-forgetting techniques (Finetune Only) as a lower bound for comparison.
The results on I2B2 and OntoNotes5 are summarized in Table 1 and Figure 6.Due to the space limitation, we provide the results on CoNLL2003 in Table 6 and Figure 9 in Appendix F. In most cases, our method achieves the best performance.Especially, our method outperforms the previous state-of-the-art method in CL-NER (i,e., Extend-NER) by a large margin.Besides, we visualize the features of our method and ExtendNER for comparison in Appendix E. The performances of POD-Net and LUCIR are much worse than our methods when more CL steps are performed.The reason could be that neither of them differentiates Otherclass from entity types, and the old knowledge in Other-class is not preserved.Our method encourages the model to review old knowledge from both new entity types and Other-class in the form of distilling causal effects.Ablation Study.We ablate our method, and the results are summarized in Table 2. To validate the effectiveness of the proposed causal framework, we only remove the colliding effects in Other-class and new entity types for the settings w/o Effect O Table 2: The ablation study of our method on three datasets in the setting FG-1-PG-1.AW: adaptive weight; CuL: curriculum learning strategy; Mi-F1: micro-F1; Ma-F1: macro-F1.

I2B2
OntoNotes5 CoNLL2003 Mi-F1  Hyper-Parameter Analysis.We provide hyperparameter analysis on I2B2 with the setting FG-8-PG-2.We consider three hyper-parameters: the number of matched tokens K, the initial value of balancing weight λ base and the initial value of confidence threshold δ 1 .The results in Table 3 shows that a larger K is beneficial.However, as K becomes larger, the run time increases correspondingly.The reason is that more forward passes are required during training.Therefore, We select K = 3 by default to balance effectiveness and efficiency.Results also show that δ 1 = 1 reaches the best result, which indicates that it is more effective to learn Effect E first and then gradually introduce Effect O during training.Otherwise, the old model's wrong predictions will significantly affect the model's performance.Additionally, we find that the performance drops substantially when λ base is too large.Note that we did not carefully search for the best hyper-parameters, and the default ones are used throughout the experi-ments.Therefore, elaborately adjusting the hyperparameters may lead to superior performances on specific datasets and scenarios.

Ethics Statement
For the consideration of ethical concerns, we would make description as follows: (1) We conduct all experiments on existing datasets, which are derived from public scientific researches.
(2) We describe the statistics of the datasets and the hyperparameter settings of our method.Our analysis is consistent with the experimental results.
(3) Our work does not involve sensitive data or sensitive tasks.( 4) We will open source our code for reproducibility.

Limitations
Although the proposed method alleviates the catastrophic forgetting problem to some extent, its performances are still unsatisfactory when more CL steps are performed.Additionally, calculating causal effects from Other-class depends on the old model predictions, resulting in errors propagating to the following CL steps.Moreover, the proposed method requires more computation and a longer training time since the predictions of matched samples are calculated.

A Forgetting in CL-NER
To identify the cause of forgetting, we consider the differences in prediction Y when the old data S exists or not.For each CL step, the effect of old data S can be calculate as: − P ( Ŷ = ŷ|do(S = 0)) where do(•) is the causal intervention (Pearl, 2014(Pearl, , 2009) ) representing that assigning a certain value to a variable without considering all parent nodes (causes).In the first equation, do(S = 0) represents null intervention, i.e., setting old data to null.In the second equation, P ( Ŷ = ŷ|do(S) = P ( Ŷ = ŷ|S) due to the fact that S has no parent nodes.In the third equation, P ( Ŷ = ŷ|S) = P ( Ŷ = ŷ) since all causal paths from S to Y are blocked by the collider X 0 .From Eq.( 11), we find that the missing old data effect causes the forgetting.We neglect the effect of initial parameters adopted from the old model since it will be exponentially decayed towards 0 during learning (Kirkpatrick et al., 2017).

B Greedy Sampling Algorithm for CL-NER
In real-world scenarios, the new data should focus on the new entity types, i.e., most sentences contain the tokens from new entity types.Suppose we randomly partition a dataset for CL-NER as in Monaikul et al. (2021).In that case, each slice contains a large number of sentences whose tokens all belong to Other-class, resulting in that models tend to bias to Other-class when inference.A straightforward solution is to filter out all sentences with only Other-class tokens.However, it brings a new problem: the slices' sizes are imbalanced.
To address this problem, we propose a sampling algorithm for partitioning a dataset in CL-NER (Algorithm 1).Simply put, we allocate the sentence containing low-frequency entity types to the corresponding slice in priority until the slice contains the required number of sentences.If a sentence contains no entity types or the corresponding slices are full, we randomly allocate the sentence to an incomplete slice.In this way, we partition the dataset into  slices with balanced sizes, and each slice mainly contains the entity types to learn.
For comparing Algorithm 1 and the random sampling as in Monaikul et al. (2021), we provide the label distributions in each slices of training data in Figure 7. Figure 7 shows that the greedy sampling generates more realistic datasets for CL-NER.When we use the randomly partitioned dataset for training in the setting FG-1-PG-1, the micro-f1 score of our method is 16.12, 17.43, and 12.75 (%) on OntoNotes5, I2B2, and CoNLL2003, respectively, indicating that the number of entities in each slice is inadequate for learning a NER model.Therefore, the greedy sampling alleviates the need for data in CL-NER.

C Baselines Introductions and Settings
The introductions about the baselines in experiments and their experimental settings are as follows.Note that we do not apply any reserved samples from old classes in LUCIR and PODNet for a fair comparison since our method requires no old data.
• Self-Training (ST) (Rosenberg et al., 2005;De Lange et al., 2019): ST first utilizes the old model to annotate the Other-class tokens with old entity types.Then, the new model is trained on new data with annotations of all entity types.Finally, the cross-entropy loss on all entity types is minimized.
• ExtendNER (Monaikul et al., 2021): Extend-NER has a similar idea to ST, except that the old model provides the soft label (i.e., probability distribution) of Other-class tokens.Specifically, the cross-entropy loss is computed for entity types' tokens, and KL divergence loss is computed for Other-class tokens.
During training, the sum of cross-entropy loss and KL divergence loss is minimized.Following Monaikul et al. (2021), we set the temperature of the teacher (old model) and student model (new model) to 1 and 2, respectively.
• LUCIR (Hou et al., 2019): LUCIR develops a framework for incrementally learning a unified classifier for the continual image classification tasks.The total loss consists of three terms: (1) the cross-entropy loss on the new classes samples; (2) the distillation loss on the features extracted by the old model and those by the new one; (3) the margin-ranking loss on the reserved samples for old classes.In our experiments, we compute the cross-entropy loss for new entity types, the distillation loss for all entity types, and the margin-ranking loss for Other-class samples instead of the reserved samples.Following (Hou et al., 2019), λ base (i.e., loss weight for the distillation loss) is set to 50, K (i.e., the top-K new class embeddings are chosen for the margin-ranking loss) is set to 1 and m (i.e., the threshold of margin ranking loss) is set to 0.5 for all the tasks.
• PODNet (Douillard et al., 2020): PODNet has a similar idea to LUCIR to combat the catastrophic forgetting in continual learning

Figure 2 :
Figure 2: An illustration of the impact of Other-class samples on OntoNotes5.We consider two scenarios with different extra annotation levels on Other-class samples: (1) annotate all recognized entity types on the data in the current CL step (Current); (2) no extra annotations on Other-class samples (None).

Figure 3 :Figure 4 :
Figure 3: The causal graph for CL-NER: (a) forgetting happens when there are no causal paths from old data to new predictions; (b) anti-forgetting is to build causal paths from old data to new predictions through new entities (D E ) and Other-class samples (D O ).We call the causal effects in these two links Effect E and Effect O , respectively.New Feature Space

Figure 5 :
Figure 5: A demonstration of the proposed causal framework for CL-NER.

Figure 7 :
Figure 7: Comparison of the greedy sampling and random sampling on OntoNotes5.Each slice contains one entity types to lean.
Effect E , respectively.Specifically, in the w/o Effect O setting, we apply knowledge distillation for all Other-class samples, while in w/o Effect E setting, we calculate the cross-entropy loss for classification.Note that our model is the same as ExtendNER when no causal effects are used (i.e., w/o Effect O & Effect E ).The results show that both Effect O and Effect E play essential roles in our framework.Furthermore, the adaptive weight and the curriculum-learning strategy help model better learn causal effects in new data.

Table 4 :
Combining our methods with other baselines on three datasets in the setting FG-1-PG-1.Mi-F1: micro-F1; Ma-F1: macro-F1.CF represents applying causal effects.As shown in Figure5, Effect E is based on the cross-entropy loss for classifying new entities, while Effect O is based on the KL divergence loss for the prediction-level distillation.We use LUCIR and ST as the baselines for a demonstration.To combine LUCIR with our method, we substitute the classification loss in LUCIR with Effect E and substitute the feature-level distillation with Effect O .When combining with ST, we only replace the soft label with the hard label when calculating Effect O .