Lifelong Event Detection with Knowledge Transfer

Traditional supervised Information Extraction (IE) methods can extract structured knowledge elements from unstructured data, but it is limited to a pre-defined target ontology. In reality, the ontology of interest may change over time, adding emergent new types or more fine-grained subtypes. We propose a new lifelong learning framework to address this challenge. We focus on lifelong event detection as an exemplar case and propose a new problem formulation that is also generalizable to other IE tasks. In event detection and more general IE tasks, rich correlations or semantic relatedness exist among hierarchical knowledge element types. In our proposed framework, knowledge is being transferred between learned old event types and new event types. Specifically, we update old knowledge with new event types’ mentions using a self-training loss. In addition, we aggregate old event types’ representations based on their similarities with new event types to initialize the new event types’ representations. Experimental results show that our framework outperforms competitive baselines with a 5.1% absolute gain in the F1 score. Moreover, our proposed framework can boost the F1 score for over 30% absolute gain on some new long-tail rare event types with few training instances. Our knowledge transfer module improves performance on both learned event types and new event types under the lifelong learning setting, showing that it helps consolidate old knowledge and improve novel knowledge acquisition.


Introduction
The Information Extraction (IE) task aims to extract informative knowledge elements (e.g., entities, relations, and events) from natural language. In practice, we usually extract knowledge elements for a pre-defined ontology consisting of various types of knowledge elements of interest. In this setting, IE is often formulated as a classification problem over types in the ontology, with an additional Not-Any (NA) type to identify text spans that don't belong to any ontology types (negative instances). Ontology-based supervised IE methods such as  can produce more accurate and structured results with annotated training data than open-domain IE (Yates et al., 2007) while being limited to the ontology. However, the ontology of interest may change over time, adding emergent new types for various knowledge elements (e.g., change bombing to disease outbreak), or adding more fine-grained subtypes for some existing general types (e.g., break justice into acquit, arrest, charge, convict, release, sentence, and trial). In the past twenty-five years the IE community has been shifting from one old ontology to a new one once every five years based on the consumers' needs, under many shared tasks, including MUC (Grishman and Sundheim, 1996), ACE 2 , TAC-KBP (Ji et al., 2011), DARPA AIDA 3 and DARPA KAIROS 4 . When we face a new ontology, we need to annotate a new training set for new types and retrain a new system on the new ontology while discarding the old established system for the old ontologies. In contrast, we propose a new and more economic paradigm, Lifelong Event Detection, which can combine old system and new resources in a neverending continual learning fashion. We show an example of lifelong event detection in Figure 1.
Such a paradigm is commonly referred as Lifelong Learning, Continual Learning or Incremental Learning (Ring, 1995;Thrun, 1998). We formu- late lifelong event detection by modifying the commonly studied class incremental lifelong learning, where the model incrementally learns to classify more classes with only positive instances. We add a special NA type denoting negative instances that are not event triggers for any types. For example, injured in the sentence "Bob is injured." is a mention of an injure event, and is is a negative instance. Whenever training on new event types, the new training data includes instances of new types and negative instances. Lifelong event detection differs from class incremental learning, where the new data only contains instances for the new classes.
Another challenge in lifelong event detection is the naturally imbalanced distribution of event types in natural language as shown in Figure 2. Existing methods (Rebuffi et al., 2017;Castro et al., 2018;Wu et al., 2019;Hou et al., 2019) for class incremental learning usually study relatively balanced classification datasets, and previous attempts (Nguyen et al., 2016; on incremental learning of event detection ignore this problem by only experimenting on frequent types. In light of the challenges above in lifelong event detection, we propose a knowledge transfer framework taking advantage of the rich connections between various types such as contextual and semantic similarity. For instance, mentions for both the Trial and the Charge events contain court entities and crime-related content frequently in context, and their respective typical triggers such as try and charge usually have similar word embeddings representing their semantics. In our proposed lifelong Y-axis is the number of training mentions for each event type divided by that of the most frequent event type, and X-axis is the rank of event types by number of mentions divided by the total number of event types in the ontology. event detection framework, we measure the relatedness as a model's prediction of new event types' mentions on old event types, i.e. P (c old |m cnew ) where c old is an old event type and m cnew is the mention of a new type. The intuition is that an old event type's (e.g., Trial) identifier should predict higher score for a related new type such as Charge than an unrelated new type such as Marry. We then transfer knowledge between related event types in two directions. We use the representations of old event types to help the learning of new event types through knowledge-aware initialization, transferring knowledge from related learned event types to new event types. We also transfer knowledge from related new event types to the old types by continually training the representations of old events with the data for new event types using a self-training loss. Our proposed framework can improve learning both old and new types, especially for acquiring new long-tail types.
To summarize, our contributions are three-fold: • We propose a new formulation for lifelong event detection; • We study the unique challenge of lifelong event detection from heavily imbalanced type distribution; • We propose a novel framework that can transfer knowledge between related types to benefit the learning of old and new event types, especially for new long-tail types. Our framework outperforms state-of-the-art methods with over 5.1% F1 score under our setting, and improves over 30% F1 on several rare types.

Problem Formulation
In event detection, given a text sequence w 1:L and a target text span specified by the start and end offsets (s, e), we aim to classify the target span into a type in the ontology or label it as NA if it is not an event mention. This definition is generalizable to many other IE tasks by varying the number m of target spans. For instance, entity recognition follows the same setting as event detection where m = 1. Relation extraction takes m = 2 target entity spans to identify relations between them. In lifelong event detection, the training phase is separated into a sequence of time stages, which we will denote by t. A stream of datasets {D t } containing training instances in the above form is provided to the model sequentially according the current time stage t . Each dataset consists of training instances for a set of types C t = {c 1 t , c 2 t , . . . , c nt t } and negative instances for NA. We will denote NA by c φ below for equations. C t C t = ∅, meaning that the model continually learns new event types. At stage t, the model needs to detect events for the combined ontology of all seen types, i.e., O t = C 1 . . . C t . Throughout this paper, we don't include type NA when mentioning the term ontology unless specified. Compared with the traditional supervised learning setting, the main difference of lifelong learning is that the model is exposed to only training data D t that covers a subset of ontology O t , while the traditional setting always train the model on the full training data t D t on the full ontology t O t . We will refer to this setting as "joint training" on all event types in this paper. Since our definition of event detection can generalize to other IE tasks as shown above, this formulation can also generalize to various lifelong IE tasks.

Baseline Framework
We first introduce a simple baseline framework that applies experience replay and knowledge distillation to a span-based event detection model for lifelong event detection. Span-based Event Detection Model. Similar to (Wadden et al., 2019) consider x = (w 1:L , s, e) as a training instance consisting of the sentence w 1:L and span offsets (s, e) as described in Section 2.1, and y is the corresponding event type label. We first use BERT (Devlin et al., 2019) to encode the text span x w 1:L = BERT(w 1:L ) where InputMap is a two-layer feedforward neural network that maps BERT outputs to a lowerdimensional feature vector. For each type c i ∈ O t we also assign a unique type embedding c i , and the score for each type is computed via inner product Since the instances corresponding to NA are not semantically consistent, it could pose additional challenge to learn a type embedding for c φ . Hence, we don't compute score for c φ as above and instead always set o φ = 0. In this way, we essentially train the model to output negative scores for negative instances on all valid types. We use softmax to achieve output probability distribution, Then the cross-entropy loss is used to train the model on the current dataset: Although some of the methods designed explicitly for event detection (Ji and Grishman, 2008;Chen et al., 2015;Feng et al., 2016;Liu et al., 2017;Yan et al., 2019;Tong et al., 2020) may have better detection performance, this model is more flexible in that (1) by taking event detection as label prediction for text spans, many existing lifelong learning methods for classification become applicable; (2) as described in Section 2.1, this architecture can handle a variety of IE tasks without significant modification. Experience Replay. An exemplar set is kept and updated continually containing training instances for all learned types to remind the model when old training data is no longer available. We use E t to denote the exemplar set for types in ontology O t . In lifelong learning literature (Rebuffi et al., 2017;Castro et al., 2018;Wu et al., 2019;Hou et al., 2019), most methods either allocate k slots for each type, or fix K slots in total and allocate evenly among all learned types where k or K is hyperparameter. Although the latter setting is considered economical in memory, the framework is limited to learn at most K types. Therefore we adopt the former as the more appropriate setting for lifelong learning where we may want to learn infinitely many types. We select exemplar instances of a type using herding algorithm (Welling, 2009) after the model is trained, following (Rebuffi et al., 2017) and most follow-up work. We don't need to keep instances for c φ , since we assume negative instances are always available in the background text of positive mentions. At stage t, we augment the training dataset D t with E t−1 in Equation (4). Knowledge Distillation. Knowledge distillation (Hinton et al., 2015) is also widely used in lifelong learning (Rebuffi et al., 2017;Li and Hoiem, 2018;Castro et al., 2018;Wu et al., 2019;Hou et al., 2019). Before learning new types, we will keep a copy of the old model for ontology O t−1 . During learning on each instance x, we compute p t−1 as old output probabilities on learned ontology O t−1 using Equation (3). Then we rescale the old and current predictions on O t−1 by a temperature parameter T ,p We take T = 2 in our experiments to retain the old model's output distribution on all learned types instead of only the predicted labels (Hinton et al., 2015).
The final loss is a weighted sum of L C and L D ,

Knowledge Transfer Between Learned Types and New Types
New to Old. Traditional lifelong learning methods focus on the catastrophic forgetting problem by retaining the old models' knowledge and don't effectively update the old knowledge. If a new type is related to some old types, some instances of the new type may share similarity with the old types. We utilize these instances to update learned knowledge by extending knowledge distillation loss that only retains old knowledge to a new self-training loss similar to (Xie et al., 2016) that trains the model with a soft pseudo-label predicted by the old model. For each instance x ∈ D t , we first compute a distribution over old types as the pseudolabel q ∼ p t−1 1/τ with a temperature factor τ = 0.5 < 1 to sharpen the distribution. Then we train the model with the following loss, Note that different from knowledge distillation in Equation (6), the current model output is not scaled by the temperature τ , which facilitates the model to update its knowledge on learned types. Then we substitute L D in Equation (7) with λL S + (1 − λ)L D , where λ is a weighting hyperparameter. Old to New. We transfer old knowledge to new types by initializing the type embeddings for new types based on learned types. When a new type has sufficient training instances, random initialization is usually good enough. We adopt a knowledge-aware random initialization r ∼ N (0, d 2 I/dim(r)) for frequent new types, being the average norm of existing type embeddings. Our intuition is that the norm of learned features also contains knowledge about the feature space, and it can be leveraged to benefit learning of new types.
For long-tail new types with less training data, it may be difficult to train from random initialization and transferring more knowledge from old types can be helpful. To find related old types, we first collect another exemplar set with h instances for each new type before training on them. Suppose x 1:h are instances for a new type, the current model's output p(·|x i ) in Equation (3) is considered as relatedness measure. The type embeddings of learned types are aggregated accordingly The new type is not necessarily a combination of existing knowledge. We represent the new knowledge by aggregating encoded exemplar instances where x 1:h are computed using Equation (1) and rescaled with the average norm of existing type embeddings d. We weight each exemplar instance by p(c φ |x i ), indicating how unrelated it is to learned types. The new type initialization is µ = ω + ν.
We shift between these two cases with a gate function g α,β (N ) = α exp(−βN ) where N is the number of the new type's training instances and α, β are positive hyperparameters. The new type embedding z is computed as 3 Experiments

Datasets and Incremental Tasks
We use two datasets and create incremental tasks on them to evaluate lifelong learning. We include detailed statistics of data splits in the Appendix. Incremental Tasks We construct incremental tasks for our formulation of lifelong event detection in Section 2.1. We partition the ontology into 5 subsets. Then these subsets are given to the model in a fixed order as D 1:5 . At stage t, the model needs to perform event detection for types in seen subsets D 1:t , resulting in 5 incremental event detection tasks with expanding ontology. We denote them by Task 1-5. Although we may also need to tackle ontologies from multiple datasets, we simulate this situation by partitioning ontology from a single dataset to avoid the implicitly overlapping types in existing benchmark datasets. In our experiments, we sample one random partition of types in the ontology. We then sample 5 random permutations of orders given to the model. We report the average performance on 5 random permutations. We include the details of the sampled partitions and order permutations in the Appendix.

Experiment Settings
We experiment with two settings. Oracle Negative: we provide "oracle" negative instances, including all negative instances in the original datasets and all unlearned types. We exclude instances for learned types for the training of new types. This setting simulates the situation of adding the new type's annotations into the existing dataset for the old types. Our lifelong detection framework needs only added annotations to update the detection model. Silver Negative: we provide negative instances, including the negative instances in the original datasets, instances of unlearned types, and also instances for already learned types. This setting simulates annotating new types in a different corpus from the existing dataset for the old types. This setting is practical when new types come from another domain, or when we want to add new documents to train the model. We include more details of this setting in the Appendix. We only experiment with this setting using the larger MAVEN dataset since we need sufficient training instances for each type. We hold out some of them as negative instances for other data subsets.
Evaluation We use the F1 score to evaluate the model's performance for each task. In traditional lifelong learning evaluation for Task i, the model is only tested on instances for learned types. While in event detection, we constantly evaluate the model on the entire test set while taking the mentions of unlearned types as negative instances. We include hyper-parameters and training details in the Appendix.

Methods in Comparison
We consider the following methods for comparison. Finetune: The model is simply finetuned on D t at time step t. KD+R: The baseline framework introduced in Section 2.2 that combines knowledge distillation with replay. KD+R+K (Ours): The baseline framework and proposed knowledge transfer strategy introduced in Section 2.3. Joint: We also report a joint training performance over all types using the same event detection module in 2.2 as upperbound for the final task.
We also adapt class incremental learning methods iCaRL (Rebuffi et al., 2017), EEIL (Castro et al., 2018), BIC (Wu et al., 2019) to our tasks based on our best knowledge on the papers and official code if released. KCN*  studies a different formulation of lifelong event detection. We include a more detailed comparison with their formulation in Section 4. We are able to adapt their main methods for our formulation. We give brief descriptions on these methods and adaptation details in the Appendix.

Results on Oracle Negative
The main results for the Oracle Negative setting are summarized in Table 1. All methods but iCaRL have the same performance on Task 1, since these lifelong learning strategies are not applicable when training on the first subsets of types. However, iCaRL uses a different exemplar-based scoring method. Existing methods with balancing strategy suffer from the problem of long-tail distribution on event types. Compared to our simpler KD+R, all of iCaRL, EEIL, and BIC balance the training data of new types and the exemplars of old types for classification but achieve less competitive results on our task. For iCaRL, we found significantly worse results in the beginning tasks while improving performance for later tasks. The reason is that iCaRL is a feature-based method and learns better representation with more training data of various types. It also explains why it has even worse performance on ACE 2005 with much fewer event mentions. EEIL and BIC also show comparable or even slightly inferior performance over KD+R without balancing, indicating that naïve balancing may degrade performance when the original distribution is long-tailed.
Most event detection evaluation focuses on micro F1 score that is averaged over instances. Due to the long-tail distribution of event types, micro F1 score is usually dominated by frequent event types. To further study the improvement for old types and new types, we consider the per-type F1 scores (a.k.a macro F1) of our proposed KD+R+K and baseline framework KD+R. In Figure 3, we show the curves of macro F1 scores on learned and new types. We notice that our framework improves performance on both learned types and new types at all stages. Comparing the curves for rare new types with less than 120 training mentions, KD+R+K and KD+R, our framework significantly improves the performance for rare new types. We also observe that the performance gain is larger in later stages because the model has seen more event types and accumulated more knowledge.

Results on Silver Negative
We show results in Table 2   training data because we hold some of them as silver negative instances for later training. However, there is a significantly smaller performance drop as training proceeds for BIC, KCN*, KD+R, and KD+R+K. This result indicates that most lifelong learning methods can effectively avoid catastrophic forgetting when the mentions of old types exist in the context of new types, even if we don't provide annotations for them. Furthermore, our proposed methods are more effective, showing improved performance on some later tasks since our knowledge transfer module explicitly utilizes related new instances to update learned knowledge. However, the significant gap compared with joint training performance indicates that improving old knowledge with new related training data instead of retaining learned knowledge from old training data is an important research direction.

Case Study on Knowledge Transfer
In Table 3 we show some examples of test instances with new long-tail event types. The baseline framework fails to identify these instances and labels them as NA. Our proposed method makes the correct predictions leveraging knowledge from related old types. For each of these new types, we examine the weights over old types p(c|x i ) in Equation (9) and show the most related old type with the highest weight. The semantic relatedness between these types can help the identification. Furthermore, we compare the precision and recall of two rare event types in Table 4. We can observe that our proposed method improves the recall significantly, although it has a slightly lower precision. The extremely low recall of the baseline method is due to the lack of training instances, which causes the model to identify only the most similar mentions. In contrast, our proposed framework can identify more diverse instances with knowledge transfer, although such knowledge from related types may also bring some false positives that slightly undermine the precision.

Event Detection
In this work we mainly use event detection as an exemplar task. Event detection under the traditional setting has been widely studied (Ji and Grishman, 2008;Chen et al., 2015;Feng et al., 2016;Liu et al., 2017Liu et al., , 2018Lu et al., 2019;Ding et al., 2019;Yan et al., 2019;Tong et al., 2020). Other methods on joint information extraction (Li et al., 2013;Wadden et al., 2019;) also include event detection as a subtask. There are also a few attempts to apply lifelong learning to information extraction. Nguyen et al. (2016) studies the problem of adding one new event type into the existing model. However they only focus on frequent types and single-stage incremental learning.

Commitment Statement
Graham's family filed a lawsuit against the city of New York, and the lawsuit was settled for $ 3. 9 million in 2015.

Legality Judgment_communicationn
The French and British commanders in the pocket decided to make for Le Havre and Fortune detached Arkforce, the equivalent of two brigades, to guard the routes back to the port.  type.  focuses on lifelong event detection. However, they bypass the long-tailed distribution problem by focusing on frequent event types, while we consider severe data imbalance as the challenge of lifelong IE. Besides, based on their released code, the framework is designed for single token trigger and learning one type at a time, and thus it is limited to extend to more general settings. In contrast, our formulation of lifelong IE can be applied to more general cases for event detection and also other IE tasks. Besides, our framework outperforms the adapted version of their methods under our formulation.

Lifelong Learning
Our formulation is mainly developed based on class incremental learning. The Learning without Forgetting method (Li and Hoiem, 2018) uses knowledge distillation to avoid catastrophic forgetting. ICaRL (Rebuffi et al., 2017) combines knowledge distillation with experience replay to learn feature representation for each class. EEIL (Castro et al., 2018) adopts an end-to-end training method by adding another finetuning stage on the balanced exemplar set. BIC (Wu et al., 2019) prevents overfitting during balanced finetuning by training only two additional parameters on a small balanced validation exemplar set to correct the bias towards new classes. Hou et al. (2019) add additional constraints into the loss to further reduce forgetting. Other work uses bi-level optimization to select better exemplar instances Borsos et al., 2020). Most of these methods conduct experiments on image classification. Our formulation and proposed framework consider the unique challenge of the long-tail distribution in event detection and information extraction and take advantage of rich correlations among ontology types. In addition to class incremental learning, regularization-based methods (Kirkpatrick et al., 2016;Jung et al., 2020;Ahn et al., 2019;Golkar et al., 2019;Serrà et al., 2018) are widely used for the situation where the model needs to learn a sequence of "disjoint" tasks. Unlike class incremental learning, the model continually learns classification on an entirely new ontology at each stage instead of combining old ontology with several new classes.

Conclusions and Future Work
In this paper, we formulate the Lifelong Event Detection problem, and propose a novel framework that transfers knowledge among related event types and between old and new types to tackle the unique challenges brought by the imbalanced distribution of event types. This framework benefits learning of both old types and new types, and improves performance on long-tail types. It is worth mentioning that although we use event detection as an exemplar task, our proposed framework is applicable to other information extraction tasks. We will extend and empirically study our approach to them in the future. Moreover, in lifelong learning the new types' can come from a completely different domain, for which the context distribution (such as text style and genre) changes significantly from seen training instances. We leave explicit exploration and evaluation of lifelong learning for specific domains as future work. Moreover, although the knowledge transfer can improve the performance on long-tail types significantly, there is still a gap between rare types and frequent types. We can combine our framework with other efficient learning methods such as zero-shot learning, few-shot learning, and weakly-supervised learning to learn these types better.

A.1 Collection of New Splits
On MAVEN , we take the original development set as test set, and collect another development set from the original training data. The collection process is as follows: we first randomly sample 413 documents from the original 2913 training documents as development documents. Then we manually check the missing types. For each missing type, we sample one more document from the remaining training documents that mentions it. Then we end up with a development set that covers all event types.
On ACE (Walker et al., 2006), we develope new splits from the one used by . We do the similar process as above to add documents to both development and test set. Since Justice:Pardon only has 2 event mentions in the entire ACE 2005 data, we don't include this type Justice:Pardon in both development and test set. Besides, since original development set misses much more types than original test set, the modified development set contains more documents than test set. We therefore use the modified test set as development set, and modified development set as the test set to make the test set more diverse.
We show detail statistics of the new splits in Table 5. We use the provided negative instances for MAVEN (see original paper  for more details), and collect negative instances as all unlabeled consecutive text spans of at most 3 tokens for ACE 2005.

A.2 Collection of Type Partitions and Permutations
Traditional lifelong learning methods usually make each subset contain same number of types. However, since the data distribution is heavily longtailed, we construct each subset to contain approximately same number of the instances.  dom permutations of the order given to the model. Hence we have 5 sets of incremental tasks for each dataset. The sampling process is as follows: for each dataset we first randomly shuffle the list of all event types, and initialize five empty sets. Then we traverse over the shuffled list, and each time when we pick up an event type, we will put it into one of the five sets types in which have the fewest total training instances. After all event types are visited, the construction of partitioning subsets is finished. We show basic statistics of partitions in Table 6. We provide the partition of types, and order permutations used in the code.

A.3 Details on Silver Negative Setting
We also introduce how we prepare D t in Silver Negative setting. We first divide the entire MAVEN training splits into 169 subsets, each representing mentions for an event type including NA. For NA subset, we evenly devide it into 5 subsets and put them into D 1:5 respectively. For each valid event types, we divide its subsets into 5 subsets, with a training subset containing 90% of training instances, and 4 silver negative subsets evenly split the rest of instances. The training subset is put into the D i that it is supposed to be learned as new type. The rest 4 subsets serve as negative instances with label NA in other data subsets. This setting is to simulate the situation that all event types are in similar domains, and frequently co-occurs with each other. Therefore in annotated data for a subset of event types, event mentions of other types in the context will become negative instances.

B Hyperparameters and Training
Details.

B.1 Hyperparameters
All equation references in this section refer to the main paper. We use BERT-large-cased (Devlin et al., 2019) as the BERT encoder and fix it during training, and concatenated span offsets is mapped to dimension of 512 in Equation (1). Number of exemplar instances for each type is 20. For knowledge transfer, λ is used to balance knowledge distillation loss and new-to-old type knowledge transfer loss. We assume that there will be little new knowledge for old types from a new type and old types if the old model confidently predicts its instance as NA with high probability. Therefore, we will use λ = 0.5 if the probability of NA is less than 0.9, otherwise we set λ = 0, meaning that only knowledge distillation is used. α, β specifies the gating function in Equation (11). We use α = 0.5 and β = 0.05. h = 20 is the number of pre-fetch exemplar instances for initialization of new types' embeddings. To avoid irrelavant old knowlege in initialization in Equation (9), we only aggregate most probable old types with total probability over 0.9 and with renormalization on remaining old types.

B.2 Training Details
We apply an additional filtering on ACE 2005 negative instances during training to reduce training time. We only include all single-token span, or multi-token span that overlaps with any event mentions as negative instance. This filtering will reserve only one third of all negative instances in Table 5, although it slightly degrades the upperbound joint training performance by around 1% F1. Better filtering techniques may be developed to trade off between training time and performance. During training, we use AdamW (Loshchilov and Hutter, 2019) to train the model with learning rate 1e − 4 and weight decay 1e − 2. We use batch size of 128 to sample instances from D t . We append instance from exemplar set as additional training data to each training batch. For each training stage t on D t , we run for a maximum of 20 epochs. Early stopping is adopted if performance on development set doesn't increase for consecutive 5 epochs. It is worth mentioning some class incremental methods run experiments with fixed epochs for each training stage, which is only reasonable when task of learning new sets of types are equally difficult. However, due to long-tail distribution and variation of number of types in our formulation of lifelong event detection, a fixed number of epochs may result in sub-optimal training for different type subsets.

B.3 Training Environments
We use pytorch for implementation. All models are trained on Nvidia V100.

B.4 Performance on Development Set
We compare results for Oracle Negative and Silver Negative on development set in Table 7 and

B.5 Details of baselines
iCaRL (Rebuffi et al., 2017). In original iCaRL, a classification network is trained using knowledge distillation and experience replay, but only the representation network before the last classification layer is used to score the instances using distance to the mean encoded features of each types' exemplar instances. We refer readers to original paper for more details. In our adaption, event type embeddings are discarded and encoded x in Equation (1) is considered as output of the representation network. And since we always have NA in every stage, we use mean encoded features of all negative instances in current training stage to represent it.
EEIL (Castro et al., 2018). In our adaptation, an additional balanced finetuning training on E t is performed in addition to our baseline framework, after training on D t and updating exemplar set to E t .
BIC (Wu et al., 2019). In our adaptation, an additional bias correction training on a balanced development exemplar set is performed in addition to our baseline framework, after training on D t . An affine transformation of scores on new types is learned to correct the bias towards new types. Besides, for the knowledge distillation loss after stage 2, the old scores are also corrected by the stored old correction parameters. Another adaptation is on the collection of development exemplar set. In original paper, since they don't use development set in their benchmarks, development exemplar set is collected from a portion of reserved training instances. We directly collect them from our development set. In this way their model is essentially strengthened to include more data for training.
KCN*  We re-implemented their hierarchical distillation for feature x in Equation (1) and outputs in Equation (3) of the main paper, while substituting their exemplar set construction with more widely used herding algorithm (Welling, 2009).

B.6 Ablation Studies
We perform ablation studies in Table 9 on the two knowledge transfer components, the new-toold knowledge transfer and the old-to-new knowledge transfer, under the oracle negative settings on MAVEN dataset. We only show results on task 2-4 since the performance for the first task are the same for all methods.  Table 9: Ablation results on two-way knowledge transfer on MAVEN dataset. The last row is the same as ıthe baseline KD+R method without knowledge transfer.