CLASSIC: Continual and Contrastive Learning of Aspect Sentiment Classification Tasks

This paper studies continual learning (CL) of a sequence of aspect sentiment classification (ASC) tasks in a particular CL setting called domain incremental learning (DIL). Each task is from a different domain or product. The DIL setting is particularly suited to ASC because in testing the system needs not know the task/domain to which the test data belongs. To our knowledge, this setting has not been studied before for ASC. This paper proposes a novel model called CLASSIC. The key novelty is a contrastive continual learning method that enables both knowledge transfer across tasks and knowledge distillation from old tasks to the new task, which eliminates the need for task ids in testing. Experimental results show the high effectiveness of CLASSIC.


Introduction
Continual learning (CL) learns a sequence of tasks incrementally. After learning a task, its training data is often discarded (Chen and Liu, 2018). The CL setting is useful when the data privacy is a concern, i.e., the data owners do not want their data used by others (Ke et al., 2020b;Qin et al., 2020;Ke et al., 2021). In such cases, if we want to leverage the knowledge learned in the past to improve the new task learning, CL is appropriate as it shares only the learned model, but not the data. In our case, a task is a separate aspect sentiment classification (ASC) problem of a product or domain (e.g., camera or phone) (Liu, 2012). ASC is stated as follows: Given an aspect term (e.g., sound quality in a phone review) and a sentence containing the aspect (e.g., "The sound quality is poor"), ASC classifies whether the sentence expresses a positive, negative, or neutral opinion about the aspect.
There are three CL settings (van de Ven and Tolias, 2019): Class Incremental Learning (CIL), Task Incremental Learning (TIL), and Domain Incremental Learning (DIL). In CIL, the tasks contain non-overlapping classes. Only one model is built for all classes seen so far. In testing, no task information is provided. This setting is not suitable for ASC as ASC tasks have the same three classes. TIL builds one model for each task in a shared network. In testing, the system needs the task (e.g., phone domain) that each test instance (e.g., "The sound quality is great") belongs to and uses only the model for the task to classify the instance. Requiring the task information (e.g., phone domain) is a limitation. Ideally, the user should not have to provide this information for a test sentence. That is the DIL setting, i.e., all tasks sharing the same fixed classes (e.g., positive, negative, and neutral). In testing, no task information is required.
This work uses the DIL setting to learn a sequence of ASC tasks in a neural network. The key objective is to transfer knowledge across tasks to improve classification compared to learning each task separately. An important goal of any CL is to overcome catastrophic forgetting (CF) (McCloskey and Cohen, 1989), which means that in learning a new task, the system may change the parameters learned for previous tasks and cause their performance to degrade. We solve the CF problem as well; otherwise we cannot achieve improved accuracy. However, sharing the classification head for all tasks in DIL makes cross-task interfere/update inevitable. Without task information provided in testing makes DIL even more challenging.
Previous research has shown that one of the most effective approaches for ASC (Xu et al., 2019;Sun et al., 2019) is to fine-tune the BERT (Devlin et al., 2019) using the training data. However, our experiments show that this works poorly for DIL because the fine-tuned BERT on a task captures highly task specific features that are hard to use by other tasks.
In this paper, we propose a novel model called CLASSIC (Continual and contrastive Learning for ASpect SentIment Classification) in the DIL setting. Instead of fine-tuning BERT for each task, which causes serious CF, CLASSIC uses the idea of Adapter-BERT in (Houlsby et al., 2019) to avoid changing BERT parameters and yet achieve equally good results as BERT fine-tuning. A novel contrative continual learning method is proposed (1) to transfer the shareable knowledge across tasks to improve the accuracy of all tasks, and (2) to distill the knowledge (both shareable and not shareable) from previous tasks to the model of the new task so that the new/last task model can perform all tasks, which eliminates the need for task information (e.g., task id) in testing. Existing contrastive learning (Chen et al., 2020) cannot do these.
Task masks are also learned and used to protect task-specific knowledge to avoid forgetting (CF). Extensive experiments have been conducted to show the effectiveness of CLASSIC.
In summary, this paper makes the following contributions: (1) It proposes the problem of domain continual learning for ASC, which has not been attempted before.
(2) It proposes a new model called CLASSIC that uses adapters to incorporate the pretrained BERT into the ASC continual learning, a novel contrastive continual learning method for knowledge transfer and distillation, and task masks to isolate task-specific knowledge to avoid CF.

Related Work
Several researchers have studied lifelong or continual learning for sentiment analysis. Early works are done under Lifelong Learning (LL) (Silver et al., 2013;Ruvolo and Eaton, 2013;Chen and Liu, 2014). Two Naive Bayes (NB) approaches were proposed to improve the new task learning (Chen et al., 2015;. Xia et al. (2017) proposed a voting based approach. All these systems work on document sentiment classification (DSC). Shu et al. (2017) used LL for aspect extraction. These works do not use neural networks, and have no CF problem.
L2PG (Qin et al., 2020) uses a neural network but improves only the new task learning for DSC. Wang et al. (2018) worked on ASC, but since they improve only the new task learning, they did not deal with CF. Each task uses a separate network.
Existing CL systems SRK (Lv et al., 2019) and KAN (Ke et al., 2020b) are for DSC in the TIL setting, not for ASC. B-CL (Ke et al., 2021) is the first CL system for ASC. It also uses the idea of Adapter-BERT in (Houlsby et al., 2019) and is based on Capsule Network. More importantly, B-CL works in the TIL setting. The proposed CLAS-SIC system is based on contrastive learning and works in the DIL setting for ASC, which is a more realistic setting for practical applications.
General Continual Learning (CL): CL has been studied extensively in machine learning (Chen and Liu, 2018;Parisi et al., 2019). Existing work mainly focuses on dealing with CF. There are several main approaches. (1) Regularization-based approaches such as those in (Kirkpatrick et al., 2016; add a regularization in the loss to consolidate previous knowledge when learning a new task. (2) Parameter isolation-based approaches such as those in (Serrà et al., 2018;Ke et al., 2020a;Abati et al., 2020) make different subsets of the model parameters dedicated to different tasks and identify and mask them out during the training of the new task.
(3) Replay-based approaches such as those in (Rebuffi et al., 2017;Lopez-Paz and Ranzato, 2017;Chaudhry et al., 2019) retain an exemplar set of old task training data to help train the new task. The methods in (Shin et al., 2017;Kamra et al., 2017;Rostami et al., 2019;He and Jaeger, 2018) build data generators for previous tasks so that in learning the new task, the generated data for previous tasks can help avoid CF.
These methods are for overcoming CF in the CIL or TIL setting of CL. Limited work has been done on knowledge transfer, which is our goal. There is little work in the DIL setting except the replay method DER++ (Buzzega et al., 2020), which saves some past data. CLASSIC saves no past data.
Contrastive learning (Chen et al., 2020;He et al., 2020) is the base of our contrastive continual learning method. However, there is a major difference. Existing contrastive learning uses various transformations (e.g., rotation and cropping) of the existing data (e.g., images) to generate different views of the data. However, we use the hidden space information from the previous task models to create views for explicit knowledge transfer and distillation. Existing contrastive learning cannot do that.

Proposed CLASSIC Method
State-of-the-art ASC systems all use BERT (Devlin et al., 2019) or other language models as the base. The proposed technique CLASSIC adopts the BERT-based ASC formulation in (Xu et al., 2019), where the aspect term (e.g., sound quality) and review sentence (e.g., "The sound quality is great") are concatenated via [SEP]. The sentiment polarity is predicted on top of the [CLS] token. As indicated earlier, although BERT can achieve state-of-the-art performance on a single task, its architecture and fine-tuning are unsuitable for CL (see Sec. 1) and perform very poorly (Sec. 4.4). We found that the BERT adapter idea in (Houlsby et al., 2019) is a better fit for CL.
BERT Adapter. The idea was given in Adapter-BERT (Houlsby et al., 2019), which inserts two 2layer fully-connected networks (adapters) in each transformer layer of BERT (Figure 1(CSC)). During training for the end-task, only the adapters and normalization layers are updated. All the other BERT parameters are frozen. This is good for CL as fine-tuning the BERT causes serious forgetting. Adapter-BERT achieves similar accuracy to the fine-tuned BERT (Houlsby et al., 2019).

Overview of CLASSIC
The architecture of CLASSIC is given in Figure 1, which works in the DIL setting for ASC. It uses Adapter-BERT to avoid fine-tuning BERT. CLAS-SIC takes two inputs in training: (1) hidden states h (t) from the feed-forward layer of a transformer layer of BERT and (2) task id t (no task id is needed in testing, see Sec. 3.2.3). The outputs are hidden states with features for task t to build a classifier.
CLASSIC uses three sub-systems to achieve its objectives (see Sec. 1): (1) contrastive ensemble distillation (CED) for mitigating CF by distilling the knowledge of previous tasks to the current task model; (2) contrastive knowledge sharing (CKS) to encourage knowledge transfer; and (3) contrastive supervised learning on the current task model (CSC) to improve the current task model accuracy. We call this framework contrastive continual learning, inspired by contrastive learning.
Contrastive learning uses multiple views of the existing data for representation learning to group similar data together and push dissimilar data far away, which makes it easier to learn a more accurate classifier. It uses various transformations of the existing data to create useful views. Given a minibatch of N training examples, if we create another view for each example, the batch will have 2N examples. We assume that i and j are two views of the training example. If we use i as the anchor, (i, j) is called a positive pair. All other pairs (i, k) for k = i are negative pairs. The contrastive loss for this positive pair is (Chen et al., 2020), where the dot product h i · h j is regarded as a similarity function in the hidden space and τ is temperature. The final loss for the batch is calculated across all positive pairs. Eq. 1 is for unsupervised contrastive learning. It can also be used for supervised contrastive learning, where any two instances/views from the same class form a positive pair, and any instance of a class and any instance from other classes form a negative pair.

Overcoming Forgetting via Contrastive Ensemable Distillation (CED)
The CED objective is to deal with CF. We first introduce task masks that CED relies on to preserve the previous task knowledge/models to be distilled to the new task model to avoid CF.

Task Masks (TMs)
Given the input hidden states h (t) from the feedforward layer of a transformer layer, the adapter maps them into input k (t) l via a fully-connected network, where l is the l-th layer of the adapter. A TM (a "soft" binary mask) m (t) l is trained for each task t at each layer l in the adapter during training Figure 2: Illustration of task masking: a (learnable) task mask is applied after the activation function to selectively activate a neuron (or feature). The four rows of each task corresponds to the two fully-connected layers and their corresponding task masks. In the neurons before training, those with 0's are the neurons to be protected (masked) and those neurons without a number are free neurons (not used). In the neurons after training, those with 1's show neurons that are important for the current task, which are used as masks for the future. Those neurons with more than one color indicate that they are shared by more than one task. Those 0 neurons without a color are not used by any task. task t's classifier, indicating the neurons that are important for the task in the layer. Here we borrow the hard attention idea in (Serrà et al., 2018) and leverage the task id embedding to train the TMs.
For a task id t, its embedding e (t) l consists of differentiable parameters that can be learned together with other parts of the network and it is trained for each layer in the adapter. To generate the TM m l , Sigmoid is used as a pseudo-gate and a positive scaling hyper-parameter s is applied to help training. The m (t) l is computed as follows: (2) Note that the neurons in m

Training Task Masks (TMs)
For each previous task i prev ∈ T prev , its TM m (iprev) l indicates which neurons are used by that task and need to be protected. In learning task t, m l on all used neurons of the layer l to 0. Before modifying the gradient, we first accumulate all used neurons by all previous tasks TMs. Since m (iprev) l is binary, we use maxpooling to achieve the accumulation: The term m (tac) l is applied to the gradient: Those gradients corresponding to the 1 entries in m (tac) l are set to 0 while the others remain unchanged. In this way, neurons in an old task are protected. Note that we expand (copy) the vector m l is not easy to train. To make the learning of e (t) l easier and more stable, an annealing strategy is applied (Serrà et al., 2018). That is, s is annealed during training, inducing a gradient flow and set s = s max during testing. Eq. 2 approximates a unit step function as the mask, with m where b is the batch index and B is the total number of batches in an epoch. Illustration. In Figure 2, after learning Task 1, we obtain its useful neurons marked in orange with a "1" in each neuron, which serves as a mask in learning future tasks. In learning Task 2, those useful neurons for Task 1 are masked (with "0" in those orange neurons on the left). The process also learns the useful neurons for Task 2 marked in green with "1"s. When Task 3 arrives, all neurons for Tasks 1 and 2 are masked, i.e., its TM entries are set to 0 (orange and green before training). After training Task 3, we see that Task 3 and Task 2 have a shared neuron that is important to both. The shared neuron is marked in both red and green.

Contrastive Ensemble Distillation (CED)
The TMs mechanism isolates different parameters for different tasks. This seems to be perfect for overcoming forgetting since the previous task parameters are fixed and cannot be updated by future tasks. However, since DIL setting does not have task id in testing, we cannot directly take the advantage of the TMs. To address this issue, we propose the CED objective to help distill all previous knowledge to the current task model so that we can simply use the last model as the final model without requiring the task id in testing.
Representation of Previous Tasks. Recall that we know which neurons/units are for which task i by reading {m (i) l }. For each previous task i of the current task t, we can compute its masked output of Adapter-BERT h (i) m (the layer before the classification head) by applying m (i) l to the Adapter-BERT. Ensemble Distillation Loss. We distill the knowledge of the ensemble of previous tasks into the single current task model. As we have a shared classification head for all tasks in DIL, which is exposed to forgetting, the distillation should be based on the output of the classification head. Specifically, given a previous task's Adapter-BERT output h (i) m , we compute the output of the classification head using h (i) m , which gives us the logit (unnormalized prediction) value z (i) m . We then distill the knowledge using z (i) m and the current task classification head output z (t) m based on contrastive loss, inspired by (Tian et al., 2020a), where N is the batch size and τ > 0 is an adjustable temperature parameter controlling the separation of classes. The index n is the anchor and the notation z m:2n are the logits of previous and current task models for the same input sample, a positive pair in contrastive learning. All the other possible pairs are negative pairs. Note that for each anchor i, there is 1 positive pair and 2N − 2 negative pairs. The denominator has a total of 2N − 1 terms (both the positives and negatives). Note that the previous task models are fixed and thus can serve as teacher networks. As we have i ≥ 1 previous tasks, hence i ≥ 1 teacher networks but only one current task student network. We adopt the contrastive framework by defining multiple pair-wise contrastive losses between z 3.3 Transferring Knowledge via Contrastive Knowledge Sharing (CKS) CKS aims to capture the shared knowledge among tasks and help the new task learn a better representation and better classifier. The intuition of CKS is as follows: Contrasive learning has the ability to capture the shared knowledge between different views (Tian et al., 2020b;van den Oord et al., 2018). This is achieved by seeking representation that are invariant cross similar views. If we can generate a view from previous tasks that is similar to the current task, the contrastive loss can capture the shared knowledge and learn a representation for knowledge transfer to the new task learning. Below, we first introduce how to construct such a view and use it in the CKS objective.

Task-based Self-Attention
Intuitively, the more similar the two tasks are, the more shared knowledge they have. To achieve our goal, we should combine all similar tasks as the shared knowledge view. In order to focus on the similar tasks, we propose to use task-based selfattention mechanism to attend to them. Inspired by (Zhang et al., 2018), given the concatenation of the output of Adapter-BERT for all previous and current tasks, h To compare the similarity between tasks i ≤ t and j ≤ t, we calculate similarity s ij via We then compute the attention score α j,i to indicate which similar tasks (similar to the current task t) should be attended to based on the current task data, The attention score is applied to each task in h (≤t) m to get the attention output o j using weighted sum: where v(·) and q(·) are two functions for transforming feature spaces: Lastly, we multiply the output of the attention layer by a scale parameter and add back to the input feature h where γ is a learnable scalar and it is initialized to 0. This allows the model to first learn on the current task and then gradually learn to assign more weights to other tasks.

Knowledge Sharing Loss
The output of the task-based self-attention provides us the knowledge sharing view h (≤t) CKS . Along with the output of Adapter-BERT for the current task h (t) m , we can easily perform contrastive learning between these two views. Note that h (≤t) CKS is computed based on the current task data and their corresponding class labels, so we give the two views have the same label and thus we can integrate the label information in our CKS loss, where N is the batch size and N yn is the number of examples in the batch that have the label y n . h (≤t)

CKS is the first view while h (t)
m is the second view. The shared knowledge between them represents the shared knowledge between previous and current tasks. Different from the CED loss, the CKS loss leverages the class information and thus can have multiple positive pairs decided by whether two samples share the same class label.

Contrastive Supervised Learning of the Current Task (CSC)
We further improve the performance of the current task by adopting the supervised contrastive loss (Khosla et al., 2020) on the current task h

Final Loss
The final loss is the weighted average of the supervised cross entropy (CE) loss, CSC loss, and the proposed CED and CKS losses:

Experiments
This section evaluates the proposed CLASSIC system and compares it with both non-continual learning and continual learning baselines.

Experiment Datasets
We use 19 ASC datasets to produce sequences of 19 tasks. Each dataset is a set of aspect and sentiment annotated review sentences from reviews of a particular product and represents a task. The datasets are from 4 sources: (1) HL5Domains (Hu and Liu, 2004): review sentences of 5 products; (2) Liu3Domains (Liu et al., 2015): review sentences of 3 products; (3) Ding9Domains (Ding et al., 2008): review sentences of 9 products; and (4) SemEval14: review sentences of 2 products -SemEval 2014 Task 4 for laptop and restaurant. To be consistent with the existing research (Tang et al., 2016), sentences with both positive and negative sentiments about an aspect are not used. Statistics of the 19 datasets are given in Table 1.

Compared Baselines
We employ 46 baselines, which include both noncontinual learning and continual learning methods.
Since little work has been done in DIL, we adapt the recent TIL systems to DIL by merging classification heads to form DIL systems. Non-Continual Learning Baselines: Each of these baselines builds a separate model for each task independently, which we call a ONE variant. It thus has no knowledge transfer or CF. There are 8 ONE variants. Four are created using (1) BERT with fine-tuning, (2) BERT (Frozen) without finetuning (3) Adapter- BERT (Houlsby et al., 2019) and (4) W2V (word2vec embeddings trained with the Amazon review data in (Xu et al., 2018) using FastText (Grave et al., 2018). Adding CSC (Contrastive Supervised learning of the Current task) creates another 4 variants. We adopt the ASC network in (Xue and Li, 2018), taking aspect term and review sentence as input for BERT variants. For W2V variants, we use their concatenation.
Continual Learning (CL) Baselines. The CL setting has 38 baselines in 5 categories. The first category uses a naive CL (NCL) approach. It simply uses a network to learn all tasks with no mechanism to deal with CF or knowledge transfer. Like ONE, we have 8 NCL variants. The second category has 11 baselines created using recent CL methods KAN (Ke et al., 2020b), SRK (Lv et al., 2019), HAT (Serrà et al., 2018), UCL (Ahn et al., 2019), EWC (Kirkpatrick et al., 2016), OWM (Zeng et al., 2019) and DER++ (Buzzega et al., 2020). KAN and SRK are for document sentiment classification. We use the concatenation of the aspect and the sentence as input. HAT, UCL, EWC, OWM and DER++ were originally designed for image classification. We replace their original image classification networks with CNN for text classification (Kim, 2014). HAT is one of the best TIL methods with almost no forgetting. UCL is a recent TIL method. EWC is a popular CIL method, which was adapted for TIL in (Serrà et al., 2018). They are converted to DIL versions by merging their classification heads. OWM (Zeng et al., 2019) is a CIL method, which we also adapt to a DIL method like EWC. DER++ and SRK can work in the DIL setting. HAT and KAN require task id as an input in testing and cannot function in the DIL setting. We create two variants of HAT (and KAN): using the last model in testing as CLASSIC does or detecting task id using the entropy method ent in (von Oswald et al., 2020). This category uses BERT (Frozen) as the base. The third category has 7 baselines using Adapter-BERT. KAN and SRK cannot be adapted to use adapters. The fourth category uses W2V, which gives another 11 baselines. The final category has one baseline LAMOL (Sun et al., 2020), which uses the GPT-2 model.
Evaluation Protocol: We follow the standard CL evaluation method in (Lange et al., 2019). We first present CLASSIC a sequence of ASC tasks for it to learn. Once a task is learned, its training data is discarded. After all tasks are learned, we test using the test data of all tasks without giving task ids.

Hyperparameters
Unless otherwise stated, the adapter uses 2 layers of fully connected network with dimensions 2000. The task id embeddings have 2000 dimensions. A fully connected layer with softmax output is used as the classification head in the last layer of BERT. We use 400 for s max in Eq. 5, dropout of 0.5 between fully connected layers. The temperature τ in each contrastive objective is set to 1 (see Supplementary for parameter tuning). The weight of each objective in Eq. 14 is set to 1. We use the embedding of [CLS] as the output of Adapter-BERT. For CKS and CSC, we use l 2 normalization on the output of Adapter-BERT before computing the contrastive loss. The training of BERT, Adapter-BERT and CLASSIC follow that of (Xu et al., 2019). We adopt BERT BASE (uncased). The max length of the sum of sentence and aspect is 128. We use Adam optimizer and set the learning rate to 3e-5. For the SemEval datasets, 10 epochs are used and for all other datasets, 30 epochs are used based on results from validation data. All runs use the batch size 32. For CL baselines, we train all models with the learning rate of 0.05, early-stop training when there is no improvement in the validation loss for 5 epochs and set the batch size to 64. We use the code provided by their authors and adopt their original parameters (for EWC, we adopt the variant implemented by (Serrà et al., 2018)).

Results and Analysis
As the order of the 19 tasks can influence the final results, we randomly select and run 5 task sequences and report their average results in Table 2. We compute both accuracy and Macro-F1, where Macro-F1 is the main metric as the imbalanced classes introduce biases in accuracy.  Overall, Table 2 shows that CLASSIC outperforms all baselines markedly.
(2). Comparing ONE variants and NCL variants, we see that under W2V, NCL variants are much bet-ter than ONE variants. This indicates ASC tasks are similar and have shared knowledge. Catastrophic forgetting (CF) is not a major issue for W2V.
However, BERT NCL (fine-tuning) is much worse than BERT ONE and Adapter-BERT NCL (adapter-tuning) as BERT fine-tuning learns highly task specific knowledge (Merchant et al., 2020). While this is desirable for ONE, it is bad for NCL because task specific knowledge is hard to share across tasks, which causes forgetting (CF). The +csc options are poor for BERT ONE and NCL.
(3). Various continual learning (CL) baselines with BERT (Frozen) are also markedly weaker than CLASSIC. Baselines that can use Adapter-BERT are also much poorer than CLASSIC. Note that SRK and KAN cannot work with Adapter-BERT.
(5). Since both KAN and HAT need task id in testing and the DIL setting does not provide task id, they have no results. But we use the last model (+last) or use an existing entropy-based method (+ent) (von Oswald et al., 2020) to automatically identify the task id for each test instance. These variants are also markedly weaker than CLASSIC.
(6). LAMOL is based on GPT-2 and its performance is weaker than CLASSIC too.
Effectiveness of Knowledge Transfer. The results under CLASSIC(forward) in Table 2 are the average results computed using the accuracy/MF1 of each task when it was first learned. The results under CLASSIC are the final average results after all tasks are learned, including backward transfer. By comparing ONE variants and CLAS-SIC(forward), we can see whether forward transfer is effective. By comparing CLASSIC(forward) and CLASSIC, we can see whether the backward transfer can improve further. We see both forward and backward transfers are effective.

Ablation Experiments
The results of ablation experiments are given in Table 3. "-CKS", "-CSC" and "-CED" mean without constrastive knowledge sharing, contrastive supervised learning on the current task and contrastive ensemble distillation, respectively. Table 3 clearly shows that each of the components is effective and they work in concert to produce the best final result.

Conclusion
This paper studied domain incremental learning (DIL) of a sequence of ASC tasks without knowing  the task ids in testing. Our method CLASSIC uses adapters to exploit BERT and to deal with BERT CF in fine-tuning, and the proposed contrastive continual learning to transfer knowledge across tasks and to distill knowledge from previous tasks to the current task so that the last model can be used for all tasks in testing and no task id is needed. Our experimental results show that CLASSIC outperforms the state-of-the-art baselines.
Finally, we believe that the idea of CLASSIC is also applicable to some other NLP tasks. For example, in named entity extraction, we can build a better model to extract the same types of entities from text of different domains. Each domain works on the same task but no data sharing (the data may be from different clients with privacy concerns). Since this is an extraction task, the backbone model needs to be switched to an extraction model.

A Appendix
A.1 Detailed Datasets Statistics Table 1 in the main paper has already showed the number of examples in each dataset. Here we provide additional details about aspects and sentiments. The detailed statistics of the 19 datasets or tasks are given in Table 4 here. Table 5 reports the standard deviation of CLASSIC and the considered baselines over 5 runs with random seeds using one random task sequence. We can see the results are stable. Table 6 shows that our proposed method CLASSIC can be adapted for the Task Incremental Learning (TIL) setting of continual learning, which requires the task ids during testing, but our DIL setting does not require. We can observe that CLASSIC in the TIL setting also outperforms existing TIL baselines. Table 7 reports the number of parameters (regardless of trainable or non-trainable) and the training execution times of different models. The execution time is computed as the average training time per task. Our experiments were run on GeForce GTX 2080 Ti with 11G GPU memory.

A.5 Hyper-parameters and Validation Results
Sec. 4.3 in the main paper reported the best hyperparameters. Regarding hyper-parameter search, we performed grid search on the temperature parameter τ within {0.03, 0.5, 0.8, 1}, batch size within {32, 64, 128}, and s max within {140, 200, 300, 400}. We also experimented with whether to apply the l 2 normalization before contrast and whether to use the logits or the second last layer to do the contrast. We did not save the validation results but the reported test results in the paper are given by the parameters with the best validation performance. We encourage the reviewers and interested readers to play with the submitted code. Table 4: Statistics of the datasets. S.: number of sentences; A: number of aspects; P., N., and Ne.: number of positive, negative and neutral aspect polarities respectively. Note that SemEval14 has 3 classes of polarities while the others have only 2 classes (positive and negative) because in these datasets, those sentences without sentiment (neutral) are not annotated with aspects. Thus, we cannot use them for aspect sentiment classification (ASC).