Serial Contrastive Knowledge Distillation for Continual Few-shot Relation Extraction

Continual few-shot relation extraction (RE) aims to continuously train a model for new relations with few labeled training data, of which the major challenges are the catastrophic forgetting of old relations and the overfitting caused by data sparsity. In this paper, we propose a new model, namely SCKD, to accomplish the continual few-shot RE task. Specifically, we design serial knowledge distillation to preserve the prior knowledge from previous models and conduct contrastive learning with pseudo samples to keep the representations of samples in different relations sufficiently distinguishable. Our experiments on two benchmark datasets validate the effectiveness of SCKD for continual few-shot RE and its superiority in knowledge transfer and memory utilization over state-of-the-art models.


Introduction
Relation extraction (RE) aims to recognize the semantic relations between entities in texts, which is widely applied in many downstream tasks such as language understanding and knowledge graph construction.Conventional studies (Zeng et al., 2014;Heist and Paulheim, 2017;Zhang et al., 2018) mainly assume a fixed pre-defined relation set and train on a fixed dataset.However, they cannot work well with the new relations that continue emerging in some real-world scenarios of RE.Continual RE (Wang et al., 2019;Han et al., 2020;Wu et al., 2021) was proposed as a new paradigm to solve this situation, which applies the idea of continual learning (Parisi et al., 2019) to the field of RE.
Compared with conventional RE, continual RE is more challenging.It requires the model to learn emerging relations while maintaining a stable and accurate classification of old relations, i.e., the socalled catastrophic forgetting problem (Thrun and Mitchell, 1995;French, 1999), which refers to the father, follows . . . . . . . . .Several studies (Wang et al., 2019;Sun et al., 2020) have shown that the memory-based models are more promising for NLP tasks, and a number of memory-based continual RE models (Cui et al., 2021;Zhao et al., 2022;Hu et al., 2022;Zhang et al., 2022) have made significant progress.
In real life, the shortage of labeled training data for relations is an unavoidable problem, especially severe in emerging relations.Therefore, the continual few-shot RE paradigm (Qin and Joty, 2022) was proposed to simulate real human learning scenarios, where new knowledge can be acquired from a small number of new samples.As illustrated in Figure 1, the continual few-shot RE paradigm expects the model to continuously learn new relations through abundant training data only for the first task, but through sparse training data for all subsequent tasks.Thus, the model needs to identify the growing relations well with few labeled data for them while retaining the knowledge on old relations without re-training from scratch.As relations grow, the confusion about relation representations leads to catastrophic forgetting.In continual fewshot RE, catastrophic forgetting becomes more severe since the few samples of new relations may not be representative for these relations.The possibility of confusion between relation representations greatly increases.Since the emerging relations are few-shot, the problem of overfitting becomes another key challenge in the continual few-shot RE task.The overfitting for samples in few-shot tasks aggravates the model's forgetting of prior knowledge as well.Existing few-shot learning works (Fan et al., 2019;Gao et al., 2019a;Obamuyide and Vlachos, 2019;Geng et al., 2020) are worthy of reference by continual few-shot RE models to ensure good generalization.
Inspired by knowledge distillation (Hinton et al., 2015) to transfer knowledge well and contrastive learning (Wu et al., 2018) to constrain representations explicitly, we propose SCKD, a model built with serial contrastive knowledge distillation for continual few-shot RE.Through it, we tackle the aforementioned two key challenges.First, how to alleviate the problem of catastrophic forgetting?SCKD follows the memory-based methods for continual learning and preserves a few typical samples from previous tasks.Furthermore, we present serial knowledge distillation to preserve the prior knowledge from previous models and conduct contrastive learning to keep the representations of samples in different relations sufficiently distinguishable.Second, how to mitigate the negative impact of overfitting caused by sparse samples?We leverage bidirectional data augmentation between memory and current tasks to obtain more samples for few-shot relations.The pseudo samples generated in serial contrastive knowledge distillation can help prevent overfitting as well.
In summary, our main contributions are twofold: • We propose SCKD, a novel model built with serial contrastive knowledge distillation for resolving the continual few-shot RE task.With the proposed serial knowledge distillation and contrastive learning with pseudo samples, our SCKD can take full advantage of memory and effectively alleviate the problems of catastrophic forgetting and overfitting under considerably few memorized samples.
• We perform extensive experiments on two benchmark datasets FewRel (Han et al., 2018) and TACRED (Zhang et al., 2017).The results demonstrate the superiority of SCKD over the state-of-the-art continual (few-shot) RE models.Furthermore, the proposed data augmentation, serial knowledge distillation, and contrastive learning all contribute to performance improvement.

Related Work
In this section, we review related work on continual RE and few-shot RE.
Continual RE.The goal of continual learning is to accomplish new tasks sequentially without catastrophically forgetting the acquired knowledge from previous tasks.For continual RE, RP-CRE (Cui et al., 2021) refines sample embeddings for prediction with the generated relation prototypes from memory.However, its relation prototype calculation is sensitive to typical samples.CRL (Zhao et al., 2022)  Few-shot RE.Few-shot learning aims to leverage only a few novel samples to adapt the model for solving tasks.For few-shot RE, its goal is to enable the model to quickly learn the characteristics of relations with very few samples, so as to accurately classify these relations.At present, there are two main lines of work: (1) The metric learning methods (Fan et al., 2019;Gao et al., 2019a) use various metric functions (e.g., the Euclidean or Cosine distance) learned from prior knowledge to map the input into a subspace so that they can distinguish similar and dissimilar sample pairs easily to assign the relation labels.
(2) The meta-learning methods (Obamuyide and Vlachos, 2019;Geng et al., 2020) learn general relation classification experience from the meta-training stage and leverage the experience to quickly converge on specific relation extraction during the meta-testing stage.In this paper, our problem setting is different from the above few-shot RE works, as we expect the model to continuously learn new few-shot relations instead of conducting the few-shot relation learning just once.Furthermore, these few-shot RE works do not have the capacity for continual learning.

Task Definition
The objective of RE is to identify the relations between entity mentions in sentences.Continual RE aims to accomplish a sequence of J RE tasks {T 1 , T 2 , . . ., T J }, where each task T j has its own dataset D j and relation set R j .The relation sets of different tasks are disjoint.Once finishing T j , D j is no longer available for future learning, and the model is assessed on all previous tasks {T 1 , . . ., T j } for identifying Rj = j i=1 R i .Also, the trained model serves as the base model for the subsequent task T j+1 .
In real-world scenarios, labeled training data for new tasks are often limited.Therefore, we define the continual few-shot RE task in this paper, where only the first task T 1 has abundant data for model training and the subsequent tasks are all few-shot.
Let N be the relation number of each few-shot task and K be the sample number of each relation, the task can be called N -way-K-shot.A continual fewshot RE model is expected to perform well on all historical few-shot and non-few-shot tasks.

Our Framework
Algorithm 1 shows the end-to-end training for task T j , with the model Φ j−1 previously trained.Following the memory-based methods for continual learning (Lopez-Paz and Ranzato, 2017;Chaudhry et al., 2019), we use a memory Mj−1 to preserve a few samples in all previous tasks {T 1 , . . ., T j−1 }.
1. Initialization (Line 1).The current model Φ j inherits the parameters of Φ j−1 , except for Φ 1 randomly initialized.We adapt Φ j on D j to learn the knowledge of new relations in T k .
2. Prototype generation (Lines 2-6).Inspired by (Han et al., 2020;Cui et al., 2021), we apply the k-means algorithm to select L typical samples from D j for every relation r ∈ R j , which constitute a memory M r .The memory for the current task is M j = r∈R j M r , and the overall memory for all observed relations until now is Mj = Mj−1 ∪ M j .Then, we generate a prototype p r for each r ∈ Rj .
3. Data augmentation (Line 7).To cope with the scarcity of samples, we conduct bidirectional data augmentation between D j and Mj .By measuring the similarity between entities in samples, we generate an augmented dataset D * j and an augmented memory M * j by mutual replacement between similar entities.

Serial Contrastive Knowledge Distillation
(Lines 8-10).We construct a set of pseudo samples based on the prototype set.Then, we carry out serial contrastive knowledge distillation with the pseudo samples on D * j and on M * j , respectively, making the sample representations in different relations distinguishable and preserve the prior knowledge for identifying the relations in previous tasks well.
We detail the procedure in the subsections below.

Initialization for New Task
To adapt the model for the new task T j , we perform a simple multi-classification task on dataset D j .
Specifically, for a sample x in T j , we use special tokens two entities in x, respectively.Then, we obtain the representations of special tokens using the BERT encoder (Devlin et al., 2019).Next, the feature of sample x, denoted by f x , is defined as the concatenation of token representations of We obtain the hidden representation h x of x as where W ∈ R d×2h and b ∈ R d are two trainable parameters.d is the dimension of hidden layers.h is the dimension of BERT hidden representations.LN(•) is the layer normalization operation.Finally, based on h x , we use the linear softmax classifier to predict the relation label.The classification loss, L csf , is defined as where y x,r ∈ {0, 1} indicates whether x's true label is r.P x,r denotes the r-th entry in x's probability distribution calculated by the classifier.

Prototype Generation
After the initial adaption above, we pick L typical samples for each relation r ∈ R j to form memory M r .We leverage the k-means algorithm upon the hidden representations of r's samples, where the number of clusters equals the number of samples that need to be stored for representing r.Then, in each cluster, the sample closest to the centroid is chosen as one typical sample.
To obtain the prototype p r for r, we average the hidden representations of L typical samples in M r : The prototype set Pj stores the prototypes of all relations in Rj , i.e., Pj = ∪ r∈ Rj {p r }.

Bidirectional Data Augmentation
For a sample x in D j or Mj , the token representations of [E 1 ] and [E 2 ] generated by BERT are used as the representations of corresponding entities.We obtain the entity representations from all samples and calculate the cosine similarity between the representations of any two different entities.Once the similarity exceeds a threshold τ , we replace each of the two entities in the original sample with the other entity.Our intuition is that one certain entity in a sentence is replaced by its close entity with everything else unchanged, the relation represented by the sentence is unlikely to change much.For example, "The route crosses the Minnesota River at the Cedar Avenue Bridge."and "The route crosses the River MNR at the Cedar Avenue Bridge."have the same relation "crosses".We assign the same relation label to the new samples as their original samples and store them together as the augmented dataset D * j and the augmented memory M * j .

Serial Contrastive Knowledge Distillation
Knowledge distillation (Hinton et al., 2015;Cao et al., 2020) has demonstrated its effectiveness in transferring knowledge.In this paper, we propose a serial contrastive knowledge distillation method to leverage the knowledge from the previous RE model to guide the training of the current model.The procedure of serial contrastive knowledge distillation is depicted in Figure 2. We detail it below.
Feature distillation.In this step, we expect the encoder of the current model to extract similar features with the previous model.For a sample x, let f j−1 x and f j x be x's features extracted by the previous model Φ j−1 and the current model Φ j , respectively.We propose a feature distillation loss to enforce the extracted features unbiased towards new relations: Pseudo samples generation.We attempt to construct pseudo samples for all the observed relations, which are used in the next hidden contrastive distillation step.Specifically, we assume the sample representations of relations follow the Gaussian distribution with the corresponding prototypes as their average values.The construction of pseudo samples is based on prototype set Pj , and one pseudo sample for r can be constructed as follows: where η ∼ N (0, 1) is a standard Gaussian noise, and δ r is the root of the diagonal covariance based on the hidden representations of all r's samples when r first appears in the relation set of one task.
The diagonal covariance consists of the variance in each dimension, which can describe the differences in each dimension of the sample representations belonging to that relation.We multiply the Gaussian noise with the root of the diagonal covariance and add the result to the prototype representation for generating pseudo samples.In this way, the generated samples can more closely match the real samples of the relation rather than random.We repeat the above operation n times for each relation in {T 1 , . . ., T j } and store the constructed pseudo samples in the pseudo sample set Sj .
Hidden contrastive distillation.In this step, we expect the current model to obtain similar hidden representations with the previous model.We also want to keep the hidden representations of samples in different relations distinguishable.
First, we consider the distillation between sample hidden representations.We feed a sample x's feature f j x into the dropout layers of the previous model Φ j−1 and the current model Φ j to obtain the hidden representations, which are denoted by h j−1 x and h j x , respectively.Then, we formulate the representation distillation loss as follows: Moreover, based on the previously-constructed pseudo samples and the real samples from the training data, we conduct contrastive learning to make the hidden representations of samples for different relations as distinct as possible, which can enhance the knowledge distillation.To achieve this, we mine hard positives and hard negatives with previous representations while contrasting them with current representations, which can ensure that the current model can obtain similar representations as the previous model.We put forward a distillation triplet loss function: where z + max and z − min are selected through h j−1 x .z + max is the representation farthest from h j−1 x in all sample representations that belong to the same relation with x, and z − min is the representation nearest from h j−1 x in all sample representations that belong to the different relations with x.
Overall, the loss function for hidden contrastive distillation is defined as Prediction distillation.In this step, we expect the classifier of the current model to predict similar probability distributions with the classifier of the previous model on the previous relation set.
For a sample x's hidden representation h j x , the output logits of the previous model are We propose a prediction distillation loss function: where T is the temperature scalar.This prediction distillation loss encourages the predictions of the current model on previous relations to match the soft labels by the previous model.The total distillation loss consists of the above three losses: where α, β and γ are adjustment coefficients.
We optimize the classification loss and distillation loss with multi-task learning.Therefore, the final loss is where λ 1 and λ 2 are also adjustment coefficients.

Experiments
In this section, we assess the proposed SCKD and report our results.The datasets and source code for SCKD are accessible from GitHub.Evaluation metrics.We measure average accuracy in our experiments.At task T j , it can be calculated as ACC j = 1 j j i=1 ACC j,i , where ACC j,i denotes the accuracy (i.e., the number of correctlylabeled samples divided by all samples) on the test set of task T i after training the model on task T j .We repeat the experiments six times using random seeds, and report means and standard deviations.
Competing models.We compare SCKD against two baselines: The finetuning model trains the RE model only with the training data of the current task while inheriting the parameters of the model trained on the previous task.It serves as the lower bound.The joint-training model stores all samples from previous tasks in memory and uses all the memorized data to train the re-initialized model for the current task.It can be regarded as the upper bound.
We also compare SCKD with four recent opensource models for continual RE: RP-CRE (Cui et al., 2021), CRL (Zhao et al., 2022), CRECL (Hu et al., 2022), and ERDA (Qin and Joty, 2022).Since RP-CRE, CRL, and CRECL do not investigate the few-shot scenario while ERDA reported its results under the "loose" evaluation which picks no more than 10 negative labels from the observed labels, we re-run these models using their source code and report the new results.KIP-Framework (Zhang et al., 2022) has not released its source code, thus we cannot re-run it for comparison.
Implementation details.We develop our SCKD based on PyTorch 1.7.1 and Huggingface's Transformers 2.11.0 (Wolf et al., 2020).See Appendix A for the selected hyperparameter values.
For a fair comparison, we set the random seeds of the experiments identical to those in (Qin and Joty, 2022), so that the task sequence is exactly the same.We employ the "strict" evaluation method proposed in (Cui et al., 2021), which chooses the whole observed relation labels as negative labels for evaluation.We stipulate that the memory can only store one sample for each relation (L = 1) when running all models.

Main Results
Table 1 lists the result comparison on the 10-way-5-shot setting on the FewRel dataset and the 5-way-5-shot setting on the TACRED dataset.We have the following findings: (1) Our proposed SCKD performs significantly better than the competing models on all tasks.After learning all tasks, SCKD outperforms the second-best model CRECL by 2.99% and 6.09% on FewRel and TACRED, respectively.(2) Regarding the two baselines, the finetuning model leads to rapid drops in average accuracy due to severe overfitting and catastrophic forgetting.The joint-training model may not always be the upper bound (e.g., T 2 to T 5 on FewRel) due to the extremely imbalanced data distribution.Besides, after learning the final task of FewRel, SCKD can achieve close results to the joint-training model with considerably few memorized samples.
(3) ERDA performs worst among the four com-peting models.This is because the extra training data from the unlabeled Wikipedia corpus for data augmentation may contain errors and noise, which makes the model unable to fit the emerging relations well.( 4) RP-CRE, CRL, and CRECL can effectively acquire knowledge from new relations without catastrophic forgetting of prior knowledge.However, their performance is all affected by the limited memory size, since they all need more memorized samples for each relation to generate more representative relation prototypes.

Ablation Study
We conduct an ablation study to validate the effectiveness of each module.Specifically, for "w/o distillation", we disable the serial contrastive knowledge distillation module.For "w/o augmentation", we use the original (not augmented) dataset and memory.For "w/o both", we update the model via the simple re-training on memory.From Table 2, we obtain several findings: (1) The average accuracy at each task reduces when we disable any modules, showing their usefulness.(2) If we remove the serial contrastive knowledge distillation module, the results drop drastically, which shows that knowledge distillation and contrastive learning can alleviate catastrophic forgetting and overfitting.
Furthermore, we conduct a fine-grained ablation study to investigate serial contrastive knowledge distillation.We disable L fd , L rd , L dtr , L pd in the  model update, to assess their influence.Table 3 shows the results, and we have several findings: (1) The results decline if we remove any losses, which demonstrates that each loss contributes to the overall performance.
(2) The drops caused by disabling the distillation triplet loss L dtr are most obvious since SCKD cannot keep the hidden representations of samples in different relations sufficiently distinguishable without contrastive learning.

Comparison with Few-shot RE Models
We compare SCKD with classic few-shot RE models provided in (Gao et al., 2019b).For a fair comparison, the few-shot RE models treat the training and test sets of previous tasks as the support and query sets for training, respectively.The training set of the current task serves as the support set for testing.We test our model and the few-shot models using the accuracy on the test set of current task.
Table 4 presents the results, and we observe that SCKD is always superior to GNN (CNN), Proto (CNN), Proto (BERT), and BERT-PAIR, as it conducts contrastive learning with pseudo samples on the few-shot tasks, which maximizes the distance between the representations of different relations.

Knowledge Transfer Capability
Backward transfer (BWT) measures how well the continual learning model can handle catastrophic forgetting.The BWT of accuracy after finishing all tasks is defined as follows: Figure 3 shows the BWT of SCKD and the competing models.Due to the overwriting of learned knowledge, BWT is always negative.The performance drops of SCKD are the lowest, showing its effectiveness in alleviating catastrophic forgetting.See Appendix B.1 for the 10-shot results on FewRel and TACRED.

Sample Representation Discrimination
To investigate the effects on discriminating sample representations, we use t-SNE (van der Maaten and Hinton, 2008) to visualize the sample representations of six selected relations after the training of CRECL and SCKD.
From Figure 4, we see that, compared to CRECL, SCKD can make the representations of samples in different relations more distinguishable.For example, the two relations, "spouse" and "follows", with close sample representations in CRECL can be clearly separated by SCKD, which shows that SCKD has a better ability to maintain the differences between relations.

Influence of Memory Size
For the memory-based continual RE models, memory size has an important impact on performance.Due to the limited samples in the few-shot scenario, the models only store one sample per relation (L = 1) in the previous experiments.In this experiment, we conduct experiments on the 10-way-10-shot setting of FewRel with different memory sizes (L = 2, 3).We choose this setting because it ensures that the memorized data only occupy a small fraction of all samples.
The comparison results are shown in Table 5, and we can see that: (1) With memory size growing, all the models perform better, confirming that memory size is a key factor that affects continual learning.
(2) SCKD maintains the best performance with different memory sizes, which demonstrates the effectiveness of SCKD in leveraging the memory for continual few-shot RE.See Appendix B.2 for the results on TACRED.

Conclusion
In this paper, we propose SCKD for continual fewshot RE.To alleviate the problems of catastrophic forgetting and overfitting, we design the serial contrastive knowledge distillation, making prior knowledge from previous models sufficiently preserved while the representations of samples in different relations remain distinguishable.Our experiments on FewRel and TACRED validate the effectiveness of SCKD for continual few-shot RE and its superiority in knowledge transfer and memory utilization.For future work, we plan to investigate how to apply the serial contrastive knowledge distillation to other classification-based continual few-shot learning tasks.

Limitations
The work presented here has a few limitations: (1) The proposed model belongs to the memory-based methods for continual learning, which requires a memory that costs extra storage.In some extremely storage-sensitive cases, there may be restrictions on the usage of our model.(2) The proposed model has currently been evaluated under the RE setting.It is better to transfer it to other continual few-shot learning settings (e.g., event detection and even image classification) for a comprehensive study.

A Environment and Hyperparameters
We run all the experiments on an X86 server with two Intel Xeon Gold 6326 CPUs, 512 GB memory, four NVIDIA RTX A6000 GPU cards, and Ubuntu 20.04 LTS.The training procedure is optimized with Adam.Following the convention, we conduct a grid search to choose the hyperparameter values.Specifically, the search space of important hyperparameters is as follows: 1.The search range for the dropout ratio is [0.2, 0.6] with a step size of 0.1.For all the competing models ERDA (Qin and Joty, 2022), RP-CRE (Cui et al., 2021), CRL (Zhao et al., 2022) and CRECL (Hu et al., 2022), we just assign the same memory size as ours, and retain other hyperparameter settings reported in their original papers.

B.1 Knowledge Transfer Capability
Figure 5 presents the 10-way-10-shot BWT results on FewRel and the 5-way-10-shot BWT results on TACRED.From this figure as well as Figure 3 in the main text, we can observe that: (1) SCKD achieves the best BWT scores again under this different shot setting.(2) Compare with the competing models, the performance of SCKD declines lowest, which shows that SCKD alleviates catastrophic forgetting effectively.

B.2 Influence of Memory Size
To enrich the experimental results on the influence of memory size, we also conduct an experiment on TACRED with different memory sizes and show the results in Table 7.Based on these results and the results listed in Table 5 of the main text, we can find that: SCKD maintains the best performance with different memory sizes not only on FewRel but also on TACRED.This demonstrates that our model is effective and versatile in making good use of memory.

B.3 Results with Different Shots
Table 8 shows the 10-way-10-shot results on the FewRel dataset and the 5-way-10-shot results on the TACRED dataset.Based on these results and the experimental results on memory size listed in This demonstrates that our model can make better use of memory.

Figure 3 :
Figure 3: Results of BWT on FewRel and TACRED.movement spouse located on terrain feature main subject follows sport
and add into Mj ; 5 Rj ← Rj−1 ∪ R j ; 6 generate prototype set Pj based on Mj ; 7 generate augmented dataset D * j and memory M * j by mutual replacement; 8 generate pseudo sample set Sj based on Pj ; 9 update Φ j by serial contrast.knowl.distill.on D * j , Sj ; // re-train current task 10 update Φ j by serial contrast.knowl.distill.onM * j , Sj ; // memory replay

Table 2 :
Ablation study on modules.

Table 4 :
Result comparison with few-shot RE models.

Table 6 :
Hyperparameter setting in our model.