InfoCL: Alleviating Catastrophic Forgetting in Continual Text Classification from An Information Theoretic Perspective

Continual learning (CL) aims to constantly learn new knowledge over time while avoiding catastrophic forgetting on old tasks. We focus on continual text classification under the class-incremental setting. Recent CL studies have identified the severe performance decrease on analogous classes as a key factor for catastrophic forgetting. In this paper, through an in-depth exploration of the representation learning process in CL, we discover that the compression effect of the information bottleneck leads to confusion on analogous classes. To enable the model learn more sufficient representations, we propose a novel replay-based continual text classification method, InfoCL. Our approach utilizes fast-slow and current-past contrastive learning to perform mutual information maximization and better recover the previously learned representations. In addition, InfoCL incorporates an adversarial memory augmentation strategy to alleviate the overfitting problem of replay. Experimental results demonstrate that InfoCL effectively mitigates forgetting and achieves state-of-the-art performance on three text classification tasks. The code is publicly available at https://github.com/Yifan-Song793/InfoCL.


Introduction
Continual learning (CL) enables conventional static natural language processing models to constantly gain new knowledge from a stream of incoming data (Sun et al., 2020;Biesialska et al., 2020).In this paper, we focus on continual text classification, which is formulated as a class-incremental problem, requiring the model to learn from a sequence of class-incremental tasks (Huang et al., 2021).Figure 1 gives an illustrative example of continual text classification.The model needs to learn to distinguish some new classes in each task and is eventually evaluated on all seen classes.Like other CL systems, the major challenge of continual text classification is catastrophic forgetting: after new tasks are learned, performance on old tasks may degrade dramatically (Lange et al., 2022).The earlier work in the CL community mainly attributes catastrophic forgetting to the corruption of the learned representations as new tasks arrive and various methods have been introduced to retain or recover previously learned representations (Kirkpatrick et al., 2017;Rebuffi et al., 2017;Mallya and Lazebnik, 2018;Lange et al., 2022).Recently, some studies (Wang et al., 2022;Zhao et al., 2023) find that, under the class-incremental setting, the severe performance decay among analogous classes is the key factor of catastrophic forgetting.To improve the performance of distinguishing analogous classes, Wang et al. (2022) exploit a heuristic adversarial class augmentation and Zhao et al. (2023) propose a sophisticated memory-insensitive prototype mechanism.However, due to a lack of thorough investigation into the underlying cause of confusion in similar classes, previous empirical methods may not be universally effective and are unable to offer guidance for further improvements.
In this paper, for the first time we present an in-depth analysis of the analogous class confusion problem from an information theoretic perspective.We investigate the impact of the information bottleneck (IB) on the representation learning process of CL, specifically how it compresses the mutual information between the representation and the input.Through formal analysis and empirical study, we find that within each task, current CL models tend to discard features irrelevant to current task due to the compression effect of IB.While the acquired representations are locally sufficient for current task, they may be globally insufficient for classifying analogous classes in the final test.We refer to this phenomenon as representation bias and aim to enhance the CL model's ability to learn more comprehensive representations.
Based on our analysis, we propose a replaybased continual text classification method InfoCL.Using contrastive learning as the core mechanism, we enable the model to learn more comprehensive representations by maximizing the mutual information between representation and the original input via InfoNCE.Specifically, we design fast-slow and current-past contrast strategies.First, from the IB theory, the representations in the early stage of optimization preserves more information.Hence, when learning for new classes in current task, we leverage MoCo framework (He et al., 2020) and conduct fast-slow contrastive learning to facilitate the learned representations to retain more information about the input.On the other hand, to further alleviate representation corruption, when conducting memory replay, we leverage current-past contrastive learning to ensure the learned representations do not undergo significant changes.Due to the limited budget of memory, the performance of current-past contrastive learning is hindered by the over-fitting problem.To this end, InfoCL incorporates adversarial data augmentation to generate more training instances for replay.
Our contributions are summarized as follows: (1) We formally analyze the analogous class confusion problem in CL from an information theoretic perspective and derive that the representation bias led by the compression effect of IB is the underlying cause of forgetting.(2) We propose a novel replay-based continual text classification method InfoCL, which exploits fast-slow and current-past constrastive learning to capture more comprehen-sive representations.(3) Experimental results on several text classification datasets show that InfoCL learns more effective representations and outperforms state-of-the-art methods.

Related Work
Continual Learning Continual Learning (CL) studies the problem of continually learning knowledge from a sequence of tasks (Lange et al., 2022) while avoiding catastrophic forgetting.Previous CL work mainly attributes catastrophic forgetting to the corruption of learned knowledge and can be divided into three major families.Replay-based methods (Rebuffi et al., 2017;Prabhu et al., 2020) save a few previous task instances in a memory module and retrain on them while training new tasks.Regularization-based methods (Kirkpatrick et al., 2017;Aljundi et al., 2018) introduce an extra regularization loss to consolidate previous knowledge.Parameter-isolation methods (Mallya and Lazebnik, 2018) dynamically expand the network and dedicate different model parameters to each task.Recent studies have identified the confusion among analogous classes as a key factor of catastrophic forgetting.In this paper, we discover the representation bias is the underlying cause of such confusion and design InfoCL to mitigate it.
Contrastive Learning Contrastive learning aims to learn representations by contrasting positive pairs against negative pairs (Chen et al., 2020).Recently, contrastive learning has made great progress in both unsupervised and supervised settings (Chen et al., 2020;He et al., 2020;Khosla et al., 2020;Barbano et al., 2022).The success of contrastive learning can be partially attributed to that the commonly used objective, InfoNCE, maximizes the mutual information between representations and inputs (van den Oord et al., 2018).Previous continual learning work has already integrated contrastive learning to alleviate the catastrophic forgetting.Cha et al. (2021) and Zhao et al. (2022) use supervised contrastive learning to learn more consistent representations.Hu et al. (2022) design a prototypical contrastive network to alleviate catastrophic forgetting.However, due to the lack of in-depth analysis of the representations learned in continual learning, these approaches fail to harness the full potential of contrastive learning 1 .In this paper, we investigate the representation learning process in CL and propose fast-slow and currentpast contrastive learning to enable the model learn more comprehensive representations and further mitigate the representation corruption problem.

Task Formulation
In this work, we focus on continual learning for a sequence of k class-incremental text classification tasks (T 1 , T 2 , ..., T k ).Each task T i has its dataset , where (x n , y n ) is an instance of current task and is sampled from an individually i.i.d.distribution p(X i , Y i ).Different tasks T i and T j have disjoint label sets Y i and Y j .The goal of CL is to continually train the model on new tasks to learn new classes while avoiding forgetting previously learned ones.From another perspective, if we denote X = ∪ i X i and Y = ∪ i Y i as the input and output space of the entire CL process respectively, continual learning aims to approximate a holistic distribution p(Y |X) from a non-i.i.d data stream.
The text classification model F is usually composed of two modules: the encoder f and the classifier σ.For an input x, we get the corresponding representation z = f (x), and use the logits σ (z) to compute loss and predict the label.

Representation Bias in CL
Previous work (Wang et al., 2022;Zhao et al., 2023) reveals that the severe performance degradation on analogous classes is the key factor of catastrophic forgetting.In this section, we investigate the representation learning process of continual learning from an information theoretic perspective and find that the representation bias is the underlying cause of confusion in analogous classes.

Information Bottleneck
We first briefly introduce the background of information bottleneck in this section.Information bottleneck formulates the goal of deep learning as an information-theoretic trade-off between representation compression and preservation (Tishby and Zaslavsky, 2015;Shwartz-Ziv and Tishby, 2017).Given the input X and the label set Y, one model is built to learn the representation Z = F(X ), where F is the encoder.The learning procedure of the model is to minimize the following Lagrangian: where I(X ; Z) is the mutual information (MI) between X and Z, quantifying the information retained in the representation Z. I(Z; Y) quantifies the amount of information in Z that enables the identification of the label Y. β is a trade-off hyperparameter.With information bottleneck, the model will learn minimal sufficient representation Z * (Achille and Soatto, 2018) of X corresponding to Y: Minimal sufficient representation is important for supervised learning, because it retains as little about input as possible to simplify the role of the classifier and improve generalization, without losing information about labels.

Representation Learning Process of CL
Continual learning is formulated as a sequence of individual tasks (T 1 , T 2 , ..., T k ).For i-th task T i , the model aims to approximate distribution of current task p(Y i |X i ).According to IB, if the model F = σ • f converges, the learned hidden representation Z i = f (X i ) will be local minimal sufficient for T i : which ensures the performance and generalization ability of the current task.Nevertheless, the local minimization of the compression term I (X i ; Z i ) will bring potential risks: features that are useless in the current task but crucial for other tasks will be discarded.
The goal of CL is to classify all seen classes Y = ∪ i Y i .For the entire continual learning task with the holistic target distribution p(Y |X), the necessary condition to perform well is that the representation Z is globally sufficient for Y : I(Z; Y ) = I(X; Y ).However, as some crucial features are compressed, the combination of local minimal sufficient representations for each task Z = ∪ i Z i may be globally insufficient: We name this phenomenon as representation bias: due to the compression effect of IB, the learned representations in each individual task may be insufficient for the entire continual task.
Then the underlying cause of the performance decrease of analogous classes is obvious.Take two cross-task analogous classes y a and y b as an example.Under sequential task setting of CL, the model is unable to co-training on instances from y a and y b .It means that the representations of these two classes can exclusively be learned within their respective tasks.When learning y a , the local sufficient representations to identify y a are insufficient to differentiate with y b .Hence, the appearance of y b will lead to a dramatically performance decrease of y a , resulting in what is known as catastrophic forgetting.

Empirical Results
To confirm our analysis, here we directly measure the mutual information among X, Y and Z.Since the representations learned by supervised learning is always globally sufficient, i.e., I (Z; Y ) = I(X; Y ), we use supervised learning on all data as the baseline, and compare it with several strong CL methods.Concretely, we use MINE (Belghazi et al., 2018) as the MI estimator and conduct experiments on FewRel and MAVEN datasets 2 .
First, we measure I(X; Z) to quantify the features preserved in the representation Z.However, previously learned representations will be corrupted once the model learns new tasks, leading to inaccurate estimation.To exclude the impact of representation corruption, we instead estimate I(X 1 ; Z 1 ) on T 1 's test set.Second, to assess whether learned representations are sufficient for 2 See Section 6.1 for details of CL baselines and datasets.
the entire continual task, we compare I(Z; Y ) on the final test set with all classes.
As shown in Table 1, both I(X 1 ; Z 1 ) and I(Z; Y ) of three CL models are significantly lower than supervised learning, indicating that the CL model tends to compress more information due to the individual task setting and the representations learned in CL are insufficient for the entire continual task.

Methodology
From the formal analysis and empirical verification, we establish that representation bias plays a crucial role in the performance decline of analogous classes.Consequently, in this section, we propose a novel replay-based CL method, InfoCL, which is able to maximize I(X i ; Z i ) and help the model learn more comprehensive representations.

Overall Framework
The objective of contrastive learning, specifically InfoNCE, serves as a proxy to maximize the mutual information I(X i ; Z i ) (van den Oord et al., 2018).Therefore, we utilize contrastive learning as the core mechanism of InfoCL to acquire more comprehensive representations.Concretely, we design fast-slow and current-past contrastive learning for training new data and memory data.
The overall framework of InfoCL is depicted in Figure 2.For a new task T k , we first train the model on D k to learn this task.We perform fast-slow contrastive learning to help the model capture more sufficient representations and mitigate the representation bias.Then we store a few typical instances for each class y ∈ Y k into the memory M, which contains instances of all seen classes.To alleviate representation decay, we next conduct memory replay with current-past contrastive learning.As the performance of representation recovery is always hindered by the limited size of memory, we incorporates adversarial augmentation to alleviate overfitting.

Fast-Slow Contrastive Learning
In the representation learning process of a task T i , the compression effect of IB will minimize the mutual information I(X i ; Z i ), leading to globally insufficient representations.Intuitively, in the early phase of optimization, I(X i ; Z i ) is larger and the representations preserve more information about the inputs.However, directly adopting an early  stop strategy is not feasible, as the representation compression is essential for generalization.Instead, we try to pull the learned representations and early representations of the same class together, facilitating the preservation of more comprehensive information in the final representations.Inspired by He et al. (2020), we employ a momentum contrast which consists of a fast encoder and a momentum updated slow encoder.The representations from the slowly-updated branch will preserve more information about the input sentences.The fast-slow contrast can "distill" these information from the slow branch to fast branch to learn more comprehensive representations.
Figure 2 (b) depicts the architecture.The fast model is updated by gradient descent, while the slow model truncates the gradient and is updated with the momentum mechanism during training.Formally, denoting the parameters of the fast and slow models as θ and θ ′ , θ ′ is updated by: where η is the momentum coefficient which is relatively large to ensure the slow update (e.g., η = 0.99).For the slow encoder, we also maintain a representation queue Q to increase the number of negative instances beyond the batch size, enabling InfoNCE to more effectively maximize MI.The queue is updated with the output of slow model by first-in-first-out strategy.
We denote z for the representations from the fast model and z for the slow model.Then the slow representations z preserve more information than z.We use InfoNCE to perform fast-slow contrast: where I = {1, 2, ..., |B|} is the set of indices of batch B. J = {1, 2, ..., |B ∪ Q|} denotes the indices set of instances in the batch or the queue.P (i) = {p ∈ J : y p = y i } is the indices of instances which have the same label as z i from the batch or the queue.τ 1 is the temperature hyperparameter.
The final optimization objective in new task training is the combination of cross entropy loss L ce and the contrastive loss L fs : where λ 1 is the factor to adjust the loss weight.

Memory Selection
After the initial training stage, we select and store typical instances for each class for replay.Since the primary focus of our paper is to address the representation bias problem, we adopt the memory sampling strategy employed in prior work (Cui et al., 2021;Zhao et al., 2022) to ensure a fair comparison.Specifically, for each class, we use K-means to cluster the corresponding representations, and the instances closest to the centroids are stored in memory M. Then we use the instances of all seen classes from memory M to conduct the memory replay stage.

Current-Past Contrastive Learning
When performing memory replay, we also employ contrastive learning to enable the model to learn more comprehensive representations for all previously seen classes.Additionally, to further enhance representation recovery in memory replay stage, we propose current-past contrastive learning which explicitly aligns current representations to the previous ones.
As shown in Figure 2 (c), after the model finishs T k−1 , we store the representations z of the instances from memory M. Then we use InfoNCE loss to pull current representations z and past representations z of the same class together: The optimization objective in the memory replay stage is where L ce is cross entropy loss and λ 2 is the factor to adjust the loss weight.

Adversarial Memory Augmentation
Due to the constrained memory budgets, the performance of current-past contrastive learning in the memory replay stage is hindered by the overfitting problem.To alleviate overfitting and enhance the effect of representation recovery, we incorporate adversarial data augmentation (Zhu et al., 2020): (12) Intuitively, it performs multiple adversarial attack iterations to craft adversarial examples, which is equivalent to replacing the original batch with a Ktimes larger adversarial augmented batch.Please refer to Appendix B for details about Eq. 12.

Experiment Setups
Datasets To fully measure the ability of InfoCL, we conduct experiments on 4 datasets for 3 different text classification tasks, including relation extraction, event classification, and intent detection.For relation extraction, following previous work (Han et al., 2020;Cui et al., 2021;Zhao et al., 2022), we use FewRel (Han et al., 2018) and TA-CRED (Zhang et al., 2017).For event classification, following Yu et al. (2021) and Wu et al. (2022), we use MAVEN (Wang et al., 2020) to build our benchmark.For intent detection, following Liu et al. (2021), we choose HWU64 (Liu et al., 2019) dataset.For the task sequence, we simulate 10 tasks by randomly dividing all classes of the dataset into 10 disjoint sets, and the number of new classes in each task for FewRel, TACRED, MAVEN and HWU64 are 8, 4, 12, 5 respectively.For a fair comparison, the result of baselines are reproduced on the same task sequences as our method.Please refer to Appendix C for details of these four datasets.Following previous work (Hu et al., 2022;Wang et al., 2022), we use the average accuracy (Acc) on all seen tasks as the metric.
Some baselines are originally proposed to tackle one specific task.For example, RP-CRE is designed for continual relation extraction.We adapt these baselines to other tasks and report the corresponding results.Since ACA and CEAR consist of data augmentation specially designed for relation extraction, they cannot be adapted to other tasks.
Implementation Details For InfoCL, we use BERT base (Devlin et al., 2019) as the encoder following previous work (Cui et al., 2021;Wang et al., 2022).The learning rate of InfoCL is set to 1e-5 for the BERT encoder and 1e-3 for other modules.Hyperparameters are tuned on the first three tasks.The memory budget for each class is fixed at 10 for all methods.For all experiments, we use NVIDIA A800 GPUs and report the average result of 5 different task sequences.More implementation details can be found in Appendix E.

Main Results
Table 2 shows the performance of InfoCL and baselines on four datasets for three text classification tasks.Due to space constraints, we only illustrate results on the last three tasks.The complete accuracy and standard deviation of all 10 tasks can be found in Appendix G.As shown, on FewRel, MAVEN and HWU64, our proposed InfoCL consistently outperforms all baselines and achieves new state-of-the-art results.These experimental results demonstrate the effectiveness and universality of our proposed method.Regarding the TACRED dataset, it has been noticed that a large fraction of the examples are mislabeled, thus compromising the reliability of the evaluation (Alt et al., 2020;Stoica et al., 2021).Here we strongly advocate for a more reliable evaluation on other high-quality text classification datasets.

Ablation Study
We conduct an ablation study to investigate the effectiveness of different components of InfoCL.The results are shown in Table 3.We find that the three core mechanisms of InfoCL, namely fast-slow contrast, current-past contrast, and adversarial memory augmentation, are conducive to the model performance.Furthermore, the fast-slow contrastive learning performed in the new task training stage seems to be more effective than the other components, indicating that learning comprehensive representations for the new task is more essential to mitigate representation bias problem.

Performance on Analogous Classes
We reveal that the representation bias problem is the underlying cause of confusion on analogous classes in CL.Since InfoCL aims to learn more sufficient representations and mitigate the bias, we also conduct experiment to explore this point.Following (Wang et al., 2022), we use the cosine distance of the average embedding of the instances as a metric to identify analogous classes.Specifically, we select 16 and 24 classes (20% of all of the classes) for FewRel and MAVEN, which are most likely to be confused with other classes.The list of these classes are shown in Appendix F. If ("director") and P58 ("screenwriter").P57 and P58 are in different tasks and P57 is learned before P58 appears.
these selected classes are in the former five tasks, we record the average final accuracy and the accuracy drop on them.As shown in Table 4, the performance on analogous classes of our model is superior and drops the least, demonstrating that our model succeeds in alleviating confusion among analogous classes.

Comprehensive Representation Learning
InfoCL employs contrastive learning to learn comprehensive representations via mutual information maximization.To assess the efficiency of representation learning process of our method, we directly compute the mutual information of the representations.Same as Section 4.3, we measure I(X 1 ; Z 1 ) on T 1 and I(Z; Y ) after the final task with all classes.For InfoCL, I(X 1 ; Z 1 ) = 2.32 and I(Z; Y ) = 2.38 on FewRel, I(X 1 ; Z 1 ) = 3.20 and I(Z; Y ) = 2.37 on MAVEN.Compared with baselines in Table 1, both mutual information metrics are higher, indicating our method can learn more sufficient representation to mitigate the representation bias.
To provide an intuitive demonstration of the effective representations acquired by InfoCL, we conducted a case study.We select two analogous classes from FewRel, P57 ("director") and P58 ("screenwriter").We use t-SNE to visualize the representations of these two classes after the model learned them.As illustrated in Figure 4, for both methods, the accuracy of P57 reaches 100% after learning it.However, due to the representation bias, when P58 appears, the accuracy of CRL dramatically declined.In contrast, InfoCL maintains a relatively stable performance.Notably, even without training for P58, the representations of two classes in our method exhibit a noticeable level of differentiation, highlighting InfoCL's capacity to acquire more comprehensive representations.

Influence of Memory Size
Memory size is the number of stored instances for each class, which is an important factor for the performance of replay-based CL methods.Therefore, in this section, we study the impact of memory size on InfoCL.As Table 2 has reported the results with memory size 10, here we compare the performance of InfoCL with strong baselines on FewRel and MAVEN under memory sizes 5 and 20.
Table 3 demonstrates that InfoCL consistently outperforms strong baselines across various memory sizes.Surprisingly, even with a memory size of 5, InfoCL achieves comparable performance to the baselines with a memory size of 20, highlighting its superior capability.As the memory size decreases, the performance of all models degrades, showing the importance of memory for replay-based methods.Whereas, InfoCL maintains a relatively stable performance, showcasing its robustness even in extreme scenarios.

Conclusion
In this paper, we focus on continual learning for text classification in the class-incremental setting.We formally investigate the representation learning process of CL and discover the representation bias will lead to catastrophic forgetting on analogous classes.Based on our analysis, we propose InfoCL, which utilizes fast-slow and current-past contrastive learning to learn more comprehensive representations and alleviate representation corruption.An adversarial augmentation strategy is also employed to further enhance the performance of the representation recovery.Experimental results show that InfoCL learns effective representations and outperforms the latest baselines.

Limitations
Our paper has several limitations: (1) Our proposed InfoCL utilizes fast-slow and current-past contrastive learning to learn more comprehensive representations, which introduces extra computational overhead and is less efficient than other replay-based CL methods; (2) We only focus on catastrophic forgetting problem in continual text classification.How to encourage knowledge transfer in CL is not explored in this paper.

A Comparison with Some Related Work
As aforementioned in Section 2, some related work also applied contrastive learning on continual learning.In this section, we provide more detailed discussion of these work.
Both of Co 2 L (Cha et al., 2021) and CRL (Zhao et al., 2022) use supervised contrastive learning and knowledge distillation to learn more consistent representations.We would like to highlight two significant differences between our work and them.First, we formally analyze the representation bias problem from an information bottleneck perspective and propose that maximizing I(X i ; Z i ) can effectively mitigate this bias.Second, while Co2L employs a vanilla supervised contrastive loss for representation learning, we introduce two novel contrastive learning techniques: fast-slow and current-past contrastive learning.Our method not only enables the model to acquire more comprehensive representations but also improves its ability to retain previously learned knowledge.As shown in Table 2, our method outperforms Co 2 L/CRL on four datasets.
CLASSIC (Ke et al., 2021) utilizes a contrastive continual learning method for continual aspect sentiment classification task.However, it is specifically designed for the domain-incremental scenario, where its primary goal is to facilitate knowledge transfer across tasks.This objective is distinct from our class-incremental scenario, and thus, the use of InfoNCE in CLASSIC is not directly comparable to our work.
OCM (Guo et al., 2022) proposes mutual information maximization to learn holistic representation for online continual image classification.Our work, in comparison to OCM, shares a similar motivation of learning more comprehensive representations.Here we provide clarifications on several key differences between our work and OCM.First, OCM focuses on online continual learning on image classification, which is different with our scenario.Second, for the first time, we provide a formal analysis of catastrophic forgetting from the perspective of the information bottleneck.This analysis sets us apart from OCM.Finally, the contrastive learning in OCM relies on data augmentation of the image to ensure performance.According to their ablation study, without the random-resizedcrop augmentation, the performance of OCM will drop drastically.In contrast, we design fast-slow and current-past contrastive learning to get better representations.This contrastive learning design is also a distinguishing feature of our work.

B Details about Eq. 12
In memory replay stage, to alleviate the overfitting on memorized instances, we introduce adversarial data augmentation (Zhu et al., 2020): ) where F is the text classification model and (x, y) is a batch of data from the memory bank M, δ is the perturbation constrained within the ϵ-ball, K is step size hyperparameter.The inner maximization problem in ( 13) is to find the worst-case adversarial examples to maximize the training loss, while the outer minimization problem in (13) aims at optimizing the model to minimize the loss of adversarial examples.The inner maximization problem is solved iteratively: where δ t is the perturbation in t-th step and ∥δ∥≤ϵ (•) projects the perturbation onto the ϵ-ball, α is step size.
Intuitively, it performs multiple adversarial attack iterations to craft adversarial examples, and simultaneously accumulates the free parameter gradients ∇ θ L in each iteration.After that, the model parameter θ is updated all at once with the accumulated gradients, which is equivalent to replacing the original batch with a K-times larger adversarial augmented batch.

D Baselines
IDBR (Huang et al., 2021) proposes an information disentanglement method to learn representations that can well generalize to future tasks.KCN (Cao et al., 2020) utilizes prototype retrospection and hierarchical distillation to consolidate knowledge.KDRK (Yu et al., 2021) encourages knowledge transfer between old and new classes.EMAR (Han et al., 2020) proposes a memory activation and reconsolidation mechanism to retain the learned knowledge.RP-CRE (Cui et al., 2021) proposes a memory network to retain the learned representations with class prototypes.CRL (Zhao et al., 2022) adopts contrastive learning replay and knowledge distillation to retain the learned knowledge.CRECL (Hu et al., 2022) uses a prototypical contrastive network to defy forgetting.ACA (Wang et al., 2022) designs two adversarial class augmentation mechanism to learn robust representations.CEAR (Zhao et al., 2023) proposes memoryintensive relation prototypes and memory augmentation to reduce overfitting to typical samples in rehearsal stage.

E Implementation Details
We implement InfoCL with PyTorch (Paszke et al., 2019) and HuggingFace Transformers (Wolf et al., 2020).Following previous work (Cui et al., 2021;Wang et al., 2022), we use BERT base (Devlin et al., 2019) as encoder.PyTorch is licensed under the modified BSD license.HuggingFace Transformers and BERT base are licensed under the Apache License 2.0.Our use of existing artifacts is consistent with their intended use.
Specifically, for the input x in relation extraction, we use [E 11 ], [E 12 ], [E 21 ] and [E 22 ] to denote the start and end position of head and tail entity respectively, and the representation z is the concatenation of the last hidden states of [E 11 ] and [E 21 ].For the input x in event detection, the representation z is the average pooling of the last hidden states of the trigger words.For the input x in intent detection, the representation z is the average pooling of the last hidden states of the whole sentence.
The learning rate of InfoCL is set to 1e-5 for the BERT encoder and 1e-3 for the other modules.For FewRel, MAVEN and HWU64, the batch size is 32.For TACRED, the batch size is 16 because of the small amount of training data.The budget of memory bank for each class is 10 for all methods.The size of the queue Q in fast-slow contrastive learning is 512 and the momentum coefficient η is 0.99.The temperatures τ 1 , τ 2 are set to 0.05.The loss factors λ 1 , λ 2 are 0.05.For the adversarial memory augmentation, K=2, ϵ=3e-1, α=1e-1.For all experiments, we use NVIDIA A800 GPUs and report the average result of 5 different task sequences.

F Analogous Classes for Evaluation
We select the top 20% classes which are most likely to be confused with other classes from FewRel and MAVEN.

G Complete Experimental Results
The complete experimental results on all 10 tasks are shown in Table 5.

Figure 1 :
Figure 1: Illustration for continual text classification with three tasks where each task involves two new classes.X i , Y i , and Z i denote input sentences, new classes and learned representations for i-th task T i respectively.Although the representations Z i learned in T i are sufficient for classifying Y i , they are insufficient to distinguish all seen classes Y in the final test.

Figure 2 :
Figure 2: (a) A demonstration for InfoCL.We design fast-slow and current-past contrastive learning for initial training and memory replay, respectively.(b) Fast-slow contrastive learning.The slowly progressing model generates representations preserving more information.(c) Current-past contrastive learning with adversarial augmentation.Contrasting with old model from T k−1 further alleviates representation corruption.
10) where I = {1, 2, ..., |B|} is the set of indices of batch B. M = {1, 2, ..., |M|} denotes the indices set of instances in memory M. P (i) = {p ∈ M : y p = y i } is the indices of instances which have the same label as z i from the memory.τ 2 is the temperature hyperparameter.

Table 1 :
Mutual information comparison between supervised learning and strong CL baselines on FewRel and MAVEN datasets.We use I(X; Z) to measure how much features of input X representation Z preserves.To exclude the impact of representation corruption, we instead estimate I(X 1 ; Z 1 ) after CL models finish T 1 .I(Z; Y ) measures whether the learned representation is sufficient for the entire continual task.

Table 2 :
Accuracy (%) on all seen classes after learning the last three tasks.We report the average result of 5 different runs.The best results are in boldface.ACA and CEAR is specially designed for continual relation extraction and cannot be adapted to other tasks.

Table 4 :
Average Accuracy (%) and accuracy drop (%) on analogous classes.For each dataset, we select 20% classes which are most likely to be confused with other classes.
Liu et al. (2021))).(2022);Wangetal.(2022),weremoveno_relationand in our experiments.Since TACRED is a imbalanced dataset, for each relation the number of training instances is limited to 320 and the number of test instances is limited to 40.TACRED is licensed under LDC User Agreement for Non-Members.MAVEN(Wang et al., 2020)It is a large scale event detection dataset with 168 event types.Since MAVEN has a severe long-tail distribution, we use the data of the top 120 frequent classes.The original test set of MAVEN is not publicly available, and we use the original development set as our test set.MAVEN is licensed under Apache License 2.0.HWU64(Liu et al., 2019)It is an intent classification dataset with 64 intent classes.FollowingLiu et al. (2021), we use the data of the top 50 frequent classes and the total number of instances are 24137.HWU64 is licensed under CC-BY-4.0License.