Continual Generalized Intent Discovery: Marching Towards Dynamic and Open-world Intent Recognition

In a practical dialogue system, users may input out-of-domain (OOD) queries. The Generalized Intent Discovery (GID) task aims to discover OOD intents from OOD queries and extend them to the in-domain (IND) classifier. However, GID only considers one stage of OOD learning, and needs to utilize the data in all previous stages for joint training, which limits its wide application in reality. In this paper, we introduce a new task, Continual Generalized Intent Discovery (CGID), which aims to continuously and automatically discover OOD intents from dynamic OOD data streams and then incrementally add them to the classifier with almost no previous data, thus moving towards dynamic intent recognition in an open world. Next, we propose a method called Prototype-guided Learning with Replay and Distillation (PLRD) for CGID, which bootstraps new intent discovery through class prototypes and balances new and old intents through data replay and feature distillation. Finally, we conduct detailed experiments and analysis to verify the effectiveness of PLRD and understand the key challenges of CGID for future research.


Introduction
The traditional intent classification (IC) in a taskoriented dialogue system (TOD) is based on a closed set assumption (Chen et al., 2019;Yang et al., 2021;Zeng et al., 2022) and can only handle queries within a limited scope of in-domain (IND) intents.However, users may input out-of-domain (OOD) queries in the real open world.Recently, the research community has paid more attention to OOD problems.OOD detection (Lin and Xu, 2019;Zeng et al., 2021;Wu et al., 2022;Mou et al., 2022d) aims to identify whether a user's query is outside the range of the predefined intent set to Figure 1: Illustration of the defects of GID and the advantages of CGID.GID only perform single stage of OOD learning and requires all IND data for joint training.In contrast, CGID timely updates the system from dynamic OOD data streams through continual OOD learning stage and almost does not rely on previous data.avoid wrong operations.It can safely reject OOD intents, but it also ignores OOD concepts that are valuable for future development.OOD intent discovery (Lin et al., 2020;Zhang et al., 2021;Mou et al., 2022c,a) helps determine potential development directions by grouping unlabelled OOD data into different clusters, but still cannot incrementally expand the recognition scope of existing IND classifiers.Generalized Intent Discovery (GID) (Mou et al., 2022b) further trains a network that can classify a set of labelled IND intent classes and simultaneously discover new classes from an unlabelled OOD set and incrementally add them to the classifier.
Although GID realizes the incremental expansion of the recognition scope of the intent classifier without any new intents labels, two major problems limit the widespread application of GID in reality as shown in Fig 1: (1) GID only considers singlestage of OOD discovery and classifier expansion.
In real scenarios, OOD data is gradually collected over time.Even if the current intent classifier is incrementally expanded, new OOD queries and intents will continue to emerge.Besides, the timeliness of OOD discovery needs to be considered: timely discovery of new intents and expansion to the system can help improve the subsequent user experience.
(2) GID require data in all previous stages for joint training to maintain the classification ability for known intents.Since OOD samples are collected from users' real queries, storing past data may bring serious privacy issues.In addition, unlike Class Incremental Learning (CIL) that require new classes with real labels, it is hard to obtain a large amount of dynamic labeled data in reality, and the label set for OOD queries is not predefined and needs to be mined from query logs.Inspired by the above problems, in this paper, we introduce a new task, Continual Generalized Intent Discovery (CGID), which aims to continually and automatically discover OOD intents from OOD data streams and expand them to the existing IND classifier.In addition, CGID requires the system to maintain the ability to classify known intents with almost no need to store previous data, which makes existing GID methods fails to be applied to CGID.Through CGID, the IC system can continually enhance the ability of intent recognition from unlabeled OOD data streams, thus moving towards dynamic intent recognition in an open world.We show the difference between CGID and GID, as well as the CIL task in Fig 2 and then leave the definition and evaluation protocol in Section 2.
As CGID needs to continuously learn from unlabeled OOD data, it is foreseeable that the system will inevitably suffer from the catastrophic forgetting (Biesialska et al., 2020;Masana et al., 2022) of known knowledge as well as the interference and propagation of OOD noise (Wu et al., 2021).To address the issues, we propose the Prototype-guided Learning with Replay and Distillation (PLRD) for CGID.Specifically, PLRD consists of a main module composed of an encoder and a joint classifier, as well as three sub-modules: (1) class prototype guides pseudo-labels for new OOD samples and alleviate the OOD noise; (2) feature distillation reduces catastrophic forgetting; (3) a memory balances new class learning and old class classification by replaying old class samples (Section 3)2 .Furthermore, to verify the effectiveness of PLRD, we construct two public datasets and three baseline methods for CGID.Extensive experiments prove that PLRD has significant performance improvement and the least forgetting compared to the baselines, and achieves a good balance among old classes classification, new class discovery and incremental learning (Section 4).To further shed light on the unique challenges faced by the CGID task, we conduct detailed qualitative analysis (Section 5).We find that the main challenges of CGID are conflicts between different sub-tasks, OOD noise propagation, fine-grained OOD classes and strategies for replayed samples (Section 6), which provide profound guidance for future work.
Our contributions are three-fold: (1) We introduce a new task, Continual Generalized Intent Discovery (CGID), to achieve the dynamic and openworld intent recognition and then construct datasets and baselines for evaluating CGID.(2) We propose a practical method PLRD for CGID, which guides new samples through class prototypes and balances new and old tasks through data replay and feature distillation.(3) We conduct comprehensive experiments and in-depth analysis to verify the effectiveness of PLRD and understand the key challenges of CGID for future work.

Problem Definition
In this section, we first briefly introduce the Generalized Intent Discovery(GID) task, then delve into the details of the Continual Generalized Intent Discovery (CGID) task we proposed.

GID
Given a set of labeled in-domain data

CGID
In contrast, CGID provides data and expands the classifier in a sequential manner, which is more in line with real scenarios.
First, we define t ∈ [0, T ], which denotes the current learning stage of CGID and T denotes the maximum number of learning stages of CGID.In the IND learning stage (t = 0), given a labeled in-domain dataset 3 Estimating |Yt| is out of the scope of this paper.In the following experiment, we assume that |Yt| is ground-truth and provide an analysis in Section 5.5.

Evaluation Protocol
For CGID, we mainly focus on the classification performance along the training phase.Following (Mehta et al., 2021), we let a t,i4 denote the accuracy on class set Y i after training on stage t.When t > 0, we calculate the accuracy A t as follows: (1) Moreover, to measure catastrophic forgetting in CGID, we introduce the forgetting F t as follows: On the whole, A

Memory for Data Replay
We equip a memory module M for PLRD.After each learning stage, M stores a very small number of training samples and replays old class samples in the next learning stage to prevent catastrophic forgetting and encourage positive transfer.Specifically, in the IND learning stage, we randomly select n samples for each IND class according to the ground-truth labels; in each OOD learning stage, since the ground-truth labels are unknown, we randomly select n samples5 for each new class according to the pseudo-labels and store them in M together with the pseudo-labels.In the new learning stage, for each batch, we randomly select old class samples {x old } with the same number as new class samples {x new } from M and input them into the BERT encoder f (•) together with new class samples x new , i.e., |{x new }| = |{x old }|, {x} = {x new } ∪ {x old }.

Prototype-guide Learning
Previous semantic learning studies (Yu et al., 2020;Wang et al., 2022;Ma et al., 2022;Dong et al., 2023a,b) have shown that learned representations can help to disambiguate the noisy sample labels and mitigate forgetting.Therefore, we build prototypes through a linear projection layer g(•) after the encoder.In stage t > 0, we first randomly initialize new class prototype µ j , j ∈ Y t . 6For sample x i ∈ {x}, we use an |Y all t |-dimensional vector q i representing the probabilities of x i being assigned

Stage
Banking OOD Ratio CLINC OOD Ratio 40% 60% 80% 40% 60% 80% to all prototypes: where y old i is the ground-truth or pseudo label of Then we introduce prototypical contrastive learning (PCL) (Li et al., 2020) as follows: where τ denotes temperature, q j i is the j-th element of q i and z i = g(f (x i )).By pulling similar samples into the same prototype, PCL can learn clear intent representations for new classes and maintain representations for old classes.To further improve the generalization of representation, we also introduce the instance-level contrastive loss (Chen et al., 2020) to x i as follows: (5) where ẑi denotes the dropout-augmented view of z i .Next, we update all new and old prototypes in a sample-wise moving average manner to reduce the computational complexity following (Wang et al., 2022).For sample x i , prototype µ j is updated as follows: where the moving average coefficient γ is an adjustable hyperparameter and j is the index of the maximum element in q i .
Finally, for the new sample x i ∈ {x new }, its pseudo label is assigned as the index of the nearest new class prototype to its representation z i .We optimize the joint classifier using cross-entropy L ce over both the new and replayed samples.

Feature Distillation
It can be expected that the encoder features may change significantly when updating the network parameters in the new learning stage.This means that the network tends to forget the knowledge learned from the old classes before and suffers from catastrophic forgetting.To further remember the knowledge in the non-forgotten features, we integrate the feature distillation into PLRD.Specifically, at the beginning of stage t, we copy and freeze the encoder, denoted as f init (•).Then given replayed samples x i ∈ {x old } in a batch, we constrain the feature output f (x i ) of the current encoder with the feature f init (x i ).Formally, the feature distillation loss is as follows:

Overall Training
The total loss is defined as follows: 4 Experiment

Datasets
We construct the CGID datasets based on two widely used intent classification datasets, Banking (Casanueva et al., 2020) and CLINC (Larson et al., 2019).Banking covers only a single domain, containing 13,083 user queries and 77 intents, while CLINC contains 22,500 queries covering 150 intents across 10 domains.For the CLINC and Banking datasets, we randomly select a specified proportion of all intent classes (about 40%, 60%, and 80% respectively) as OOD types, with the rest being IND types.Furthermore, we assign the maximum stage T =3, so we divide the OOD data into three equal parts for each OOD training stage.We show the number of classes at each stage in Table 1 and leave the detailed statistics in Appendix A.

Baselines
Since this is the first study on CGID, there are no existing methods that solve exactly the same task.We adopt three prevalent methods in OOD discovery and GID, and extend them to the CGID setting to develop the following competitive baselines7 .
• K-means is a pipeline baseline which first use the clustering algorithm K-means (MacQueen, 1965) to cluster the new samples to obtain pseudo labels and then combine these samples and replayed samples in the memory to train the joint classifier at each OOD training stage.
• DeepAligned is another pipeline baseline that leverages the iterative clustering algorithm DeepAligned (Zhang et al., 2021).At each OOD training phase, DeepAligned iteratively clusters the new data and then utilizes them along with the replayed samples for classification training.
• E2E is an end-to-end baseline.At each OOD training stage, E2E (Mou et al., 2022b) amalgamates the new instances and replayed samples and then obtain the logits through the encoder and joint classifier.The model is optimized with a unified classification loss, where the new OOD pseudo-labels are obtained by swapping predictions (Caron et al., 2020).

Main Results
We conduct experiments on Banking and CLINC with three different OOD ratios, as shown in Tables 2 and 3.In general, our proposed PLRD consistently outperform all the baselines with a large margin.Next, we analyze the results from three aspects: (1) Comparison of different methods We observe that DeepAligned roughly achieves the best IND performance while E2E has the best OOD performance among all baselines.However, our proposed PLRD consistently outperforms all baselines significantly in both IND and OOD, achieving best performance and new-old task balance.Specifically, under the average of three ratios, PLRD is better than the optimal baseline by 7. ) on CLINC.As for forgetting, E2E experiences a substantial performance drop on old classes when learning new classes, while PLRD is lower than the optimal baseline by 3.72%, 1.69% (F ALL T ) on Banking and CLINC respectively.This indicates PLRD does not sacrifice too much performance on old classes when learning new classes and has the least forgetting among all methods.
(2) Comparison of different datasets We validate the effectiveness of our method on different datasets, where CLINC is multi-domain and coarsegrained while Banking contains more fine-grained intent types within a single domain.We see that the performance of all methods on CLINC is significantly better than that on Banking.For example, PLRD is 11.72% (A ALL T ) higher on CLINC than on Banking at an OOD ratio of 60%.In addition, at the same OOD ratio, PLRD shows an average increase of 7.53% (F IN D T ), 6.45% (F OOD T ), and 6.57% (F ALL T ) on Banking over CLINC.We believe this could be because fine-grained new and old classes are easily confused, which leads to serious new-old task conflicts and high forgetting.However, PLRD achieves larger improvements than baselines on Banking, indicating that PLRD can better cope with challenges in fine-grained intent scenarios.
(3) Effect of different OOD ratios We observe that as the OOD ratio increases, the forgetting of IND classes increases and accuracy of OOD classes decreases significantly for all methods.For PLRD, when the OOD ratio increases from 40% to 80% on Banking, F IN D T rises from 9.64% to 19.80%, and A OOD T drops from 76.70% to 63.19%.Intuitively, more OOD classes make it challenging to distinguish samples from different distributions, leading to noisy pseudo-labels.Moreover, in the incremental process, more OOD classes will update the model to a greater extent, resulting in more IND knowledge forgetting.ing stage, the IND classes form compact clusters, while the OOD samples are scattered in space.As the stage progresses, the gray points are gradually colored and move from dispersion to aggregation, indicating that new OOD classes continue to be discovered and learned good representations.In addition, the already aggregated clusters are gradually dispersed (see "red" points), indicating that the representations of old classes are deteriorating.
Next, to quantitatively measure the quality of representations, we calculate the intra-class and inter-class distances and use the ratio of inter-class distance to intra-class distance as the compactness following (Islam et al., 2021).We report the compactness and accuracy in Fig 5 .It can see that the compactness of OOD classes is much lower than that of IND classes, indicating that the representation learning with labeled IND samples outperforms that with unlabeled OOD samples.As the stage t increases, the compactness of the IND classes gradually decreases.And the compactness of the Y i (i > 0) classes increases significantly when t equals i, and then gradually decreases.This demonstrates the learning and forgetting effects in CGID from a representation perspective.Furthermore, we observe that the maximal compactness of Y i decreases as i increases, showing that the learning ability of new classes gradually declines.We attribute this to the noise in the OOD pseudolabeled data and the greater need to suppress forgetting of more old classes.Finally, the trend of accuracy and compactness remains consistent, suggesting that representation is closely related to the classification performance of CGID.

Loss and Gain of CGID
During the CGID process, the performance of the classifier on IND classes gradually declined, while  the number of supported OOD classes continually expanded.In order to quantify the change of the classifier, we define the Loss and the Gain in stage t for CGID as follows: We illustrate the variations in Loss and Gain of all methods over stages in Fig 6 .The results show that as the training progresses, the Loss of all methods decreases overall and the Gain increases continuously.After finishing the training, although the Loss decrease by 20% roughly, the increase in Gain of PLRD is greater, reaching over 200%.This indicates that the Gain generated by CGID is much higher than the Loss and brings positive effects to the classifier as a whole.Compared with other methods, PLRD has the lowest Loss and highest Gain at each stage, and its advantage continuous amplifies over stages.These further consolidate the conclusion that PLRD outperforms the baselines.

Effect of Replaying Samples in Memory
In this section, we explore the effect of replayed samples from both selection strategies and quantity.
Selection Strategy In the CIL task, since the samples are labeled, we only need to consider the diversity of replayed samples.However, in CGID, we need to take the quality of pseudo-labels into account additionally.We explore three selection strategies for replaying samples: random (randomly sampling from training set), icarl (selecting these closest to their prototypes, following Rebuffi et al. ( 2017)), and icarl_contrary (select the samples farthest from their prototypes).As shown in Table 4, we report the pseudo-label accuracy (Acc) and average feature variance (Var) of replayed samples, as well as the final classifier accuracy of PLRD.We can see that icarl has the highest pseudo-label accuracy while icarl_contrary has the largest sample variance and is inclined to diversity.However, PLRD under the random strategy has the highest OOD and ALL accuracy.This demonstrates that neither accuracy nor diversity alone leads to better performance.CGID needs to strike a balance between diversity and accuracy of the replayed samples.Quantity of Replayed Samples Fig 7 illustrates the effect of replaying different numbers of previous examples.It is evident that replaying more previous examples leads to higher accuracy.Compared with replaying no examples (n=0), storing just one example for each old class (n=1) significantly improves accuracy, demonstrating that replaying old samples is crucial.In addition, PLRD outperforms the baselines significantly when n ≤ 10, proving PLRD's effectiveness with few-shot samples replay.However, when all previous examples are replayed (n=ALL), PLRD performs slightly worse than E2E.We believe this is because PLRD's anti-forgetting mechanism limits learning new classes lightly, and replaying all previous examples deviates from the setting of CGID.In addition, removing L ins and L pcl respectively leads to a certain degree of performance decline, indicating that prototype and instance-level contrastive learning are helpful for OOD discovery and relieving OOD noise.Finally, only retaining the L ce of PLRD will result in the largest accuracy decline, proving the importance of multiple optimization objectives in PLRD.

Estimate the Number of OOD intents
In the previous experiments, we assumed that the number of new OOD classes at each stage is predefined and is ground-truth.However, in practical applications, the number of new classes usually needs to be estimated automatically.We adopt the same estimation algorithm as Zhang et al. ( 2021); Mou et al. (2022b). 8.Since the estimation algorithm is based on sample features, we use the model itself as the feature extractor at the beginning of each OOD learning stage.As shown in Table 6, when the estimated number of classes is inaccurate, the performance of all methods declines to some extent.However, PLRD can estimate the number most accurately and achieve the best performance.Then, in order to align different methods, we consistently use the frozen model after finishing the IND training stage as the feature extractor for subsequent stages.With the same estimation quality, PLRD still significantly outperforms each baseline, demonstrating that PLRD is robust.

Challenges
Based on the above experiments and analysis, we comprehensively summarize the unique challenges faced by CGID : Conflicts between different sub-tasks In CGID, the discovery and classification of new OOD classes tend towards different features, and learning new OOD classes interfere with existing knowledge about old classes inevitably.However, preventing forgetting will lead to model rigidity and is not conducive to the learning of new classes.
OOD noise accumulation and propagation In the continual OOD learning stage, using pseudolabeled OOD samples with noise to fine-tune the model as well as replaying samples with noise will cause the noise to accumulate and spread to the learning of new OOD samples in future stages.This will potentially affect the model's ability to learn effectively from new OOD samples in subsequent stages of learning.
Fine-grained OOD classes Section 4.3 indicate that fine-grained data leads to high forgetting and poor performance.We believe this is because finegrained new classes and old classes are easily confused, which brings serious conflicts between new and old tasks.
Strategy for replayed samples The experiment in Section 5.3 shows that CGID needs to consider the trade-off between replay sample diversity and accuracy as well as the trade-off between quantity of replayed samples and user privacy.
Continual quantity estimation of new classes Section 5.5 shows that even minor estimation errors for each stage can accumulate over stages, leading to severely biased estimation and deteriorated performance.

Related Work
OOD Intent Discovery OOD Intent Discovery aims to discover new intent concepts from unlabeled OOD data.Unlike simple text clustering tasks, it considers how to leverage IND prior to enhance the discovery of unknown OOD intents.Lin et al. (2020) use OOD representations to compute similarities as weak supervision signals.Zhang et al. (2021) propose an iterative method, DeepAligned, that performs representation learning and clustering assignment iteratively while Mou et al. (2022c) perform contrastive clustering to jointly learn representations and clustering assignments.Nevertheless, it's essential to note that OOD intent discovery primarily focus on unveiling new intents, overlooking the integration of these newfound, unknown intents with the existing, well-defined intent categories.Class Incremental Learning The primary goal of class-incremental learning (CIL) is to acquire knowledge about new classes while preserving the information related to the previously learned ones, thereby constructing a unified classifier.Earlier studies (Ke et al., 2021;Geng et al., 2021;Li et al., 2022) mainly focused on preventing catastrophic forgetting and efficient replay in CIL.However, these studies assumed labeled data streams, whereas in reality large amounts of continuously annotated data are hard to obtain and the label space is undefined.Unlike CIL, CGID charts a distinct course by continuously identifying and assimilating new classes from unlabeled OOD data streams.This task presents a set of formidable challenges compared to conventional CIL.

Conclusion
In this paper, we introduce a more challenging yet practical task as Continuous Generalized Intent Discovery (CGID), which aims at continuously and automatically discovering new intents from OOD data streams and incrementally extending the classifier, thereby enabling dynamic intent recognition in an open-world setting.To address this task, we propose a new method called Prototype-guided Learning with Replay and Distillation (PLRD).Extensive experiments and qualitative analyses validate the effectiveness of PLRD and provide insights into the key challenges of CGID.

Limitations
This paper proposes a new task as Continual General Intent Discovery (CGID) aimed at continually and automatically discovering new intents from unlabeled out-of-domain (OOD) data and incrementally expand them to the existing classifier.Furthermore, a practical method Prototype-guided Learning with Replay and Distillation (PLRD) is proposed for CGID.However, there are still several directions to be improved: (1) Although PLRD achieves better performance than each baseline, the performance still has a large gap to improve compared with the theoretical upper bound of a model without forgetting previous knowledge.(2) In addition, all baselines and PLRD use a small number of previous samples for replay.The CGID method without utilizing any previous samples is not explored in this paper and can be a direction for future work.(3) Although PLRD does not generate additional overhead during inference, it requires maintaining prototypes and a frozen copy of the encoder during training, resulting in additional resource occupancy.This can be further optimized in future work.

Figure 2 :
Figure 2: The comparison of CGID with GID and CIL.
where y IN D i ∈ Y IN D , Y IN D = {1, 2, . . ., N }, GID aims to train a joint classifier to classify an input query into the total label set Y = {1, . . ., N, N + 1, . . . . . ., N + M }, where the first N elements represent the labeled IND classes and the last M elements represent newly discovered unlabeled OOD classes.
As shown inFig 3, our proposed PLRD framework consists of a main module which composed of an encoder and a joint classifier and three submodules: (1) Memory module is responsible for replaying known class samples to balance the learning of new classes and maintain known classes; (2) Class prototype module is responsible for generating pseudo-labels for new OOD samples; (3) Feature distillation is responsible for alleviating catastrophic forgetting of old classes.The joint classifier h consists of an old class classification head h old and a new class classification head h new , outputting logit l = [l old ; l new ].After stage t ends, l new will be merged into l old , i.e., l old ← [l old ; l new ].Then, when stage t + 1 starts, a new head l new with the dimension |Y t+1 | will be created.

Figure 5 :
Figure 5: The compactness and accuracy of classes at different stages under Banking OOD ratio = 40%.

Figure 6 :
Figure 6: The Loss and Gain of the classifier at different stage under Banking, where the maximal stage T =6, the number of IND classes is 17, and the number of new classes in each OOD stage is 10.

Figure 7 :
Figure 7: The effect of different number of replayed samples under Banking OOD ratio = 60%, where ALL means storing all training data in the memory.
To overcome the limitation of OOD intent discovery that cannot expand the existing classifier,Mou et al. (2022b) proposed the Generalized Intent Discovery (GID) task.GID takes both labeled IND data and unlabeled OOD data as input and performs joint classification over IND and OOD intents.As such, GID needs to discover semantic concepts from unlabeled OOD data and learn joint classification.However, GID can only perform one-off OOD learning stage and requires full data of known classes, severely limiting its practical use.Therefore, we introduce Continual Generalized Intent Discovery (CGID) to address the challenges of dynamic and continual open-world intent classification.

Replayed Labels IND Training Stage Continual OOD Training Stages Memory Figure
3: The Overall architecture of our PLRD method.During the IND training stage, we only use the crossentropy loss.In the OOD training stages, multiple modules and learning objectives jointly optimize the model.

Table 1 :
The number of new classes at each stage.

Table 4 :
Comparison of different selection strategies under Banking OOD ratio = 60%.

Table 5 :
Ablation study of different learning objective for PLRD under Banking OOD ratio = 60%.
Table 5, we perform ablation study to investigate the effect of each learning objective Ting-En Lin and Hua Xu. 2019.Deep unknown intent detection with margin loss.In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5491-5496, Florence, Italy.Association for Computational Linguistics.Ting-En Lin, Hua Xu, and Hanlei Zhang.2020.Discovering new intents via constrained deep adaptive clustering with cluster refinement.In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8360-8367.Tingting Ma, Huiqiang Jiang, Qianhui Wu, Tiejun Zhao, and Chin-Yew Lin.2022.Decomposed metalearning for few-shot named entity recognition.In Findings of the Association for Computational Linguistics: ACL 2022, pages 1584-1596, Dublin, Ireland.Association for Computational Linguistics.J MacQueen.1965.Some methods for classification and analysis of multivariate observations.In Proc.5th Berkeley Symposium on Math., Stat., and Prob, page 281.Marc Masana, Xialei Liu, Bartłomiej Twardowski, Mikel Menta, Andrew D Bagdanov, and Joost van de Weijer.2022.Class-incremental learning: survey and performance evaluation on image classification.IEEE Transactions on Pattern Analysis and Machine Intelligence.