Refining Sample Embeddings with Relation Prototypes to Enhance Continual Relation Extraction

Continual learning has gained increasing attention in recent years, thanks to its biological interpretation and efficiency in many real-world applications. As a typical task of continual learning, continual relation extraction (CRE) aims to extract relations between entities from texts, where the samples of different relations are delivered into the model continuously. Some previous works have proved that storing typical samples of old relations in memory can help the model keep a stable understanding of old relations and avoid forgetting them. However, most methods heavily depend on the memory size in that they simply replay these memorized samples in subsequent tasks. To fully utilize memorized samples, in this paper, we employ relation prototype to extract useful information of each relation. Specifically, the prototype embedding for a specific relation is computed based on memorized samples of this relation, which is collected by K-means algorithm. The prototypes of all observed relations at current learning stage are used to re-initialize a memory network to refine subsequent sample embeddings, which ensures the model’s stable understanding on all observed relations when learning a new task. Compared with previous CRE models, our model utilizes the memory information sufficiently and efficiently, resulting in enhanced CRE performance. Our experiments show that the proposed model outperforms the state-of-the-art CRE models and has great advantage in avoiding catastrophic forgetting. The code and datasets are released on https://github.com/fd2014cl/RP-CRE.


Introduction
As one of the most important tasks in information extraction (IE), relation extraction (RE) has been * Corresponding author widely applied in many downstream tasks, such as knowledge base construction and completion (Riedel et al., 2013). The goal of RE is to recognize a relation predefined in knowledge graphs (KGs) for an entity pair in texts. For example, given the entity pair [Christopher Nolan, Interstellar] in the sentence "Interstellar is an epic science fiction film directed by Christopher Nolan", the relation thedirector-of should be recognized by an RE model.
Conventional RE models (Zeng et al., 2014;Zhou et al., 2016;Zhang et al., 2018a) always assume a fixed pre-defined set of relations and perform once-and-for-all training on a fixed dataset. Therefore, these models can not well handle the learning of new relations, which often emerge in many realistic applications given the continuous and iterative nature of our world (Hadsell et al., 2020). To adapt to such a situation, the paradigm of continual relation extraction (CRE) is proposed (Wang et al., 2019;Han et al., 2020;Wu et al., 2021). Compared with conventional RE, CRE focuses more on helping a model keep a stable understanding of old relations while learning emerging relations, which in fact could be precisely modeled by continual learning.
Continual learning (or lifelong learning) systems are defined as adaptive algorithms capable of learning from a continuous stream of information (Parisi et al., 2019), where the information is progressively available over time and the number of learning tasks is not pre-defined. Continual learning remains a long-standing challenge for machine learning and deep learning (Hassabis et al., 2017;Thrun and Mitchell, 1995), as its main obstacle is the tendency of models to forget existing knowledge when learning from new observations (French, 1999), which is called as catastrophic forgetting. Recent works try to address the problem of catastrophic forgetting in three ways, including consolidation-based methods (Kirkpatrick et al., 2017), dynamic archi-tecture Fernando et al., 2017) and memory-based methods (Lopez-Paz and Ranzato, 2017;Aljundi et al., 2018;Chaudhry et al., 2018), in which memory-based methods have been proven promising in NLP tasks.
In recent years, some memory-based CRE models have made significant progress in overcoming catastrophic forgetting while learning new relations, such as EA-EMR (Wang et al., 2019), MLLRE (Obamuyide and Vlachos, 2019), CML (Wu et al., 2021) and EMAR (Han et al., 2020). Despite of their effectiveness, there are some challenges remaining in current CRE. One noticeable challenge is how to restore the sample embedding space disrupted by the learning of new tasks, given that RE models' performance is very sensitive to the quality of sample embeddings. Another challenge is that most existing CRE models have not fully exploited memorized samples. In order to enhance RE performance and overcome the overfitting problem caused by high replay frequency, the samples memorized in these models usually have the same magnitude as the original training samples (Wu et al., 2021), which is unrealistic in real-world tasks.
Inspired by prototypical networks (Snell et al., 2017) for few-shot classification, we employ relation prototypes to represent different relations in this paper, which help the model understand different relations well. Furthermore, these prototypes are used to refine sample embeddings in CRE. This process is named as prototypical refining in this paper. Specifically, the prototype for a specific relation is the average embedding of typical samples labeled with this relation, which are collected by Kmeans and memorized by our model for future use. The prototypical refining can help our model recover from the disruption of embedding space and avoid catastrophic forgetting during learning new relations, thus enhance our model's CRE performance. Another advantage of prototypical refining is the efficient utilization of memorized samples, resulting in our model's less dependence on memory size.
Our contributions in this paper are summarized as follows: (1) We propose a novel CRE model which achieves enhanced performance through refining sample embeddings with relation prototypes and is effective in avoiding catastrophic forgetting.
(2) The paradigm we proposed for refining sam-ple embeddings takes full advantage of the typical samples stored in memory, and reduces the model's dependence on memory size (number of memorized samples).
(3) Our extensive experiments upon two RE benchmark datasets justify our model's remarkable superiority over the state-of-the-art CRE models and less dependence on memory size.
However, most of these models can only extract a fixed set of pre-defined relations. Hence, continual relation learning, i.e., CRE, has been proposed to overcome this problem. Existing continual learning methods can be divided into three categories: (1) Regularization methods (Kirkpatrick et al., 2017;Zenke et al., 2017; alleviate catastrophic forgetting by imposing constraints on updating the neural weights important to previous tasks. (2) Dynamic architecture methods Fernando et al., 2017) change architectural properties in response to new information by dynamically accommodating novel neural resources.
(3) Memory-based methods (Lopez-Paz and Ranzato, 2017;Aljundi et al., 2018;Chaudhry et al., 2018) remember a few examples in previous tasks and continually replay the memory with emerging new tasks. For CRE, the memory-based methods have been proven most promising (Wang et al., 2019;Han et al., 2020). In addition, in order to accurately represent relations with limited samples, the idea of prototypical networks is intro-duced into RE (Gao et al., 2019;Ding et al., 2021).
There are also many memory networks proposed to remember information of long periods, such as LSTM (Hochreiter and Schmidhuber, 1997) and memory-augmented neural networks (Graves et al., 2016;Santoro et al., 2016). Besides, a new memory module (Santoro et al., 2018) has demonstrated its success in relational reasoning, which employs multi-head attention to allow memory interaction.

Methodology
In this section, we introduce our CRE model in details. At first, we formalize the problem of CRE and the memory module used in our model.

Task Formalization
In general, a single relation extraction (RE) task is to identify (classify) the relation between two entities expressed in a sentence. Formally, the objective of CRE is to accomplish a sequence of K RE tasks {T 1 , T 2 , . . . , T K }, where the kth task T k has its own training set D k and re- In fact, each task T k is an independent multiclassification task to identify various relations in R k . A CRE model should perform well on extracting the relations in all K tasks after being trained with the samples of these tasks. In other words, the model should be capable of identifying the relation of a given entity pair intoR k , whereR k = ∪ k i=1 R i is the relation set already observed till the k-th task.
Inspired by current CRE models (Wu and He, 2019;Han et al., 2020), we adopt an episodic memory module to store typical samples of relations that the model has learned in former tasks. The memory module for relation r is represented as a memorized where each sample is labeled with r and O is the memory size (sample number). Therefore, the episodic memory for the observed relations in T 1 ∼ T k isM k = ∪ r∈R k M r .

Model Learning Pipeline
The learning procedure of our model for a current task T k is shown in Algorithm 1. The procedure contains four major steps: Prototype Generation (line 2 ∼ 13): We first obtain the prototype p r of each old relation r in Algorithm 1: Training procedure for T k 31 end 32 feed P k into M; 33 for i = 1 to epochs2 do 34 update E, M and C according to L 2 oñ M k with the prototypical refining conducted by M; 35 end R k−1 by averaging the embeddings of memorized samples in M r with sample encoder E (Section 3.3). These prototypes constitute a prototype set P k , which is used to memorize model's embedding space before training on T k . Note that the encoder E is continuously changing with tasks, the prototypes of relations need to be regenerated at BERT [CLS] [  Figure 1: The structure of sample encoder E. the beginning of each task. Initial Training (line 16 ∼ 18): The parameters in sample encoder E and relation classifier C are tuned with the training samples in D k (Section 3.4).
Sample Selection (line 19 ∼ 31): For each relation r in R k , which is unobserved in the former tasks, we retrieve all samples labeled with r from D k . Then we use K-means algorithm to cluster these samples. In each cluster, we take the sample closest to the centroid as the memorized typical sample of r, to constitute M r (Section 3.5). Then, we generate r's prototype p r based on M r to expand the prototype set P k .
Prototypical Refining (line 32 ∼ 35): To recover the disruption of sample embedding space, which is caused by training on T k , we use relation prototype set P k to refine sample embeddings. Specifically, P k is used to initialize our attentionbased memory network M (Section 3.6). The samples inM k are encoded into embeddings by E, and then refined by M before being fed to C, to compute the loss function and update model parameters.
In general, the parameter update of our model for T k includes two stages: (1) Initial training on D k , where samples are encoded by encoder E. (2) Prototypical refining onM k , where sample embeddings are generated by encoder E and then refined by memory network M.
Next, we introduce this procedure in detail.

Sample Encoder
The structure of this sample encoder is displayed in Figure 1, which is used to obtain the embedding of each sample. In our model, the encoder E is built upon BERT (Devlin et al., 2019;Wolf et al., 2020), given its excellent performance on text encoding as a representative pre-trained language model. In addition, entity information has been proven effective

Initial Training for New Task
According to the general assumption of CRE, all relations in R k are unobserved in former tasks T 1 ∼ T k−1 . We first introduce the model's initial training on a simple multi-classification task.
Specifically, classifier C in our model is a linear softmax classifier. For training set D k , the loss function is defined as where P (y i |x i , t i ) is calculated by classifier C based on sample (x i , t i , y i )'s embedding output by sample encoder E.

Selecting Typical Samples to Memorize Relations
For each relation r in R k , we select several typical samples into M r after the initial training with D K . As the budget of memory is relatively smaller, it is important to select informative and diverse samples to represent r. Inspired by (Han et al., 2020), we apply K-means algorithm upon the embeddings of r's samples, which are generated by sample encoder E. Suppose the number of clusters is O, which is also the number of typical samples that we will store to represent r. Then, in each cluster we choose the sample closest to the centroid to represent the cluster and add it into the memory. Such operation ensures that the samples stored in the memory are diverse enough and representative for the relation.

Refining Sample Embeddings with Relation Prototypes
We propose this module to refine the sample embeddings.
After the initial training for the new task T k , old relations' embedding space is likely to be disrupted because the model is tuned towards fitting T k 's learning objective (Section 3.4). Instead of just replaying memorized samples for recovery, which is a common practice in continual learning, we refine sample embeddings based on relation prototypes.
Before applying our prototypical refining, we first obtain the prototype embedding p r for each old relation r inR k−1 to constitute the prototype set P k . This step (Prototype Generation) is conducted before the initial training for T k (Initial Training) to memorize the former state of our model. Then, we construct an attention-based mem-ory network M based on P k for prototypical refining, as shown in Figure 2. This network's input is the sample embedding generated by E, and its output is fed into C for relation classification. Based on prototypical refining conducted by memory network M, our model's embedding space is restored.
Given a sample (x, t, y), its embedding h ∈ R d is generated by E and will be fed to memory network M. We also denote the head number of our memory network as N and the hidden dimension of each head as d 1 . The output of the i-th attention head is h i ∈ R d 1 , which is computed as where q i ∈ R d 1 is the linear transformation of input h, and K i , V i ∈ R L×d 1 (L is the current size ofR k ) is the linear transformation of P k . Then, we concatenate each head's output into the output of multi-head attention layer as At last, the final output of M is a residual output computed ash where W 2 ∈ R d×d is also a trainable matrix.h is the refined embedding of (x, t, y), which incorporating the information of prototypes P k through Equation 3 and is fed to the classifier C.
We takeM k as the training set in this stage and the loss function is where (x i , t i , y i ) is a sample inM k , and P (y i |x i , t i ) is calculated by C based on its embedding, which is first generated by E and refined by M. Based on the typicality and diversity of memorized samples (samples that can well represent most samples in this relation), training onM k can restore the disrupted embedding space of our model with a relatively small computational cost, which allows our model to regain a stable understanding of old relations.

Prediction
In order to maintain the consistency of training and prediction, our model uses the embeddings refined by M for prediction after training on a new task.

Datasets
Our experiments were conducted upon the following two widely used datasets. The training-testvalidation split ratio is 3:1:1.
FewRel (Han et al., 2018) It is an RE benchmark dataset originally proposed for few-shot learning, which is annotated by crowd workers and contains 100 relations and 70,000 samples in total. In our experiments, we used the version of 80 relations that has been used (as the training and valid set) for CRE.
TACRED (Zhang et al., 2017) It is a large-scale RE dataset with 42 relations (including no relation) and 106,264 samples built over newswire and web documents. Based on the open relation assumption of CRE, we removed no relation in our experiments. At the same time, in order to limit the sample imbalance of TACRED, we limited the number of training samples of each relation to 320 and the number of test samples of each relation to 40.

Compared Models
We introduce the following state-of-the-art CRE baselines to be compared with our model in our experiments. (Wang et al., 2019) maintains a memory to alleviate the problem of catastrophic forgetting.
CML (Wu et al., 2021) proposes a curriculummeta learning method to tackle the order-sensitivity and catastrophic forgetting in CRE.
As we adopt pre-trained language model for sample encoding, we replace the encoder (Bi-LSTM) in EMAR with BERT for a fair comparison. This EMAR's variant is denoted as EMAR+BERT. Besides, we denote our CRE model with relation prototypes as RP-CRE in result display. Since our model only uses the information of memorized samples in attention-based memory network, we further proposed a variant of our model denoted as RP-CRE+Memory Activation, by adding a memory activation (Han et al., 2020) step before attention operation, to verify whether more memory replay is needed.

Experimental Settings
In previous CRE experiments (Wang et al., 2019;Han et al., 2020), relations are first divided into 10 clusters to simulate 10 tasks. However, there are two drawbacks of this setting: (1) Recognizing all relations before training is unrealistic and contrary to the setting of lifelong learning. (2) The relations in one cluster generally have more semantic relevance. Therefore, we adopted a completely random sampling strategy on relation-level in our experiments, which is more diverse and realistic. In addition, the task order of all models is exactly the same.
In the context of continual learning, we pay more attention to the variation trend of models' performance while learning new tasks. Therefore, after training for each new task, we will evaluate the classification accuracy of the models on the test set, which is composed of the test samples of all observed relations.
Given that most recent CRE models are evaluated by distinguishing true relation labels from a small number of sampled negative labels (Wang et al., 2019), which is too simple and rigid for realistic applications. Therefore, we take a rigorous multi-classification task on all observed relations as the evaluation of our model. It is also the reason that the baselines' performance is much Table 1: Accuracy (%) on all observed relations (which will continue to accumulate over time) at the stage of learning current task, indicating that our model (RP-CRE) significantly surpasses other models and has an advantage in comparison with EMAR+BERT. Model  T1  T2  T3  T4  T5  T6  T7  T8  T9  T10  EA- worse than their reported results in the original papers. The method of choosing hyper-parameter for our model is manual tuning. For reproducing our experiment results conveniently, our model's source code, detailed hyper-parameter configurations and processed samples are provided on https://github.com/fd2014cl/RP-CRE.

Overall Performance Comparison
The performance of our model and baselines are shown in Table 1, where the reported scores are the average of 5 rounds of training. Hyper-parameter configurations of baselines are the same as that reported in original papers. Result of each task is the accuracy on test data of all observed relations. Based on the results, we find that: (1) Our strict test and sampling strategy actually increase the difficulties of CRE, causing great difficulties to the compared CRE models. This phenomenon is especially obvious in TACRED that has class-imbalance, even if we have made some restrictions to the number of samples for each relation.
(2) Pre-trained language models, such as BERT, can gain outstanding performance in CRE. Take EMAR for example, replacing Bi-LSTM in it with BERT brings more than 50% of improvement for the last task in FewRel (46.3% to 73.8%), and more than 150% of improvement in TACRED (25.1% to 71.0%). We think this is mainly due to BERT's capability of making rapid migration to new tasks. The remarkable advantage of the BERT-based models in Table 1 in TACRED further justifies BERT's insensitivity to sample imbalance.
(3) Compared with EMAR+BERT, our model also has great advantage, proving that our model can take full advantage of memorized samples and maintain relatively stable performance in continual learning.
(4) Adding memory activation to our models did not significantly improve performance, indicating that it is sufficient to adopt relation prototypes in CRE.
(5) Note that all models have similar performance on the former tasks, but our model obtains more stable performance towards the emergence of new tasks. It implies our model's advantage in long-term memory, which will be proven in Section 4.5.
The average time consumption (on the machine with a single RTX3090) of training RP-CRE is 1h28min, EMAR is 37min and EMAR+BERT is 3h21min. Our model's time consumption is mainly due to the massive parameters of BERT. Given our model's apparent performance improvement with respect to EMAR, such time consumption is relatively acceptable. Table 2: Accuracy (%) on the test sets from every previous task at the stage of learning the last task (with the same size of memory), indicating that our model has better performance on previous tasks. T1  T2  T3  T4  T4  T6  T7  T8  T9  T10

Long-term Effectiveness of Episodic Memory
To explore long-term effectiveness of episodic memory in our model, we compared our model with EMAR+BERT on FewRel, which is similar to our model in selecting memorized samples. Results are shown in Table 2, where each score is the classification accuracy for all relations on test set of each former task. We conclude that after training on 10 sequential tasks, our model performs better on the former tasks. It indicates that our model has a much stable understanding of old relations in old tasks. In both models, memorized samples of old relations are used to restore the model's performance on old relations (memory reconsolidation in EMAR, prototypical refining in our model). In order to find the reason of EMAR's inferior performance on the former tasks, we display the visualization the varying of sample embedding space during model training.
Concretely, we used t-SNE (Van der Maaten and Hinton, 2008) for dimension reduction and chose memorized samples from relation participant for visualization, which were fed into the two models on the same task. Figure 3 shows the sample positions in the embedding space, where the blue dots represent the memorized samples and the red dot represents the relation's prototype (the centroid of memorized samples before learning the new task). Left two sub-figures display how sample embedding space is disrupted by the learning of new tasks. Right two sub-figures display how the model recovers.
From Figure 3, we notice that although EMAR's sample embeddings are getting closer to the former centroid (relation prototype) after memory reconsolidation, they converge in fact. Comparatively, our model restores the embedding space while retaining the diversity between samples. In terms of typicality and diversity of memorized samples, it is not our purpose to encode all memorized samples into exactly the same point in the embedding space, since it may damage the diversity of these samples and reduce the information provided by the samples, during model's recovery from disrupted condition.
This result is mainly due to that the loss function selected in EMAR's memory reconsolidation is too radical, focusing only on reducing the absolute distance between a memorized sample and the relation prototype. Therefore, our strategy of refining sample embeddings with relation prototypes (Section Figure 4: Comparison of model's dependence on memory size, indicating that our model has a weaker dependence on memory size. The X-axis is the serial ID of current task, and Y-axis represents the model's classification accuracy on test set from all observed relations at current stage. 3.6) better preserves the diversity of memorized samples, as it takes into account various features of samples rather than the true labels. It eventually retains more information of memorized samples.

Model Dependence on Memory Size
In most memory-based CRE models, memory size (number of memorized samples) is a key factor affecting model performance. However, most of previous models do not fully utilize the information provided by memorized samples, resulting in their dependence on memory size. Even worse, the memorized samples have the same magnitude as the original samples. In Section 3.6, we have emphasized the advantages of our model in retaining and full utilization of memory information. We verified whether our model relies less on memory size through comparison experiments, of which the results are shown in Figure 4.
We chose EMAR+BERT as the main competitor, in which the configuration and task sequence remained unchanged. The only variable we adjusted is memory size. Based on the results we conclude that, as memory size decreases, our model obtains less decreased performance than EMAR+BERT (performance degradation is inevitable). Even though EMAR showed a relatively stable performance in the first two tasks, its performance dropped significantly in the subsequent tasks. This is consistent with the long-term effectiveness of memory we have analyzed in Section 4.5. The diversity of samples in EMAR would gradually disappear, making it highly dependent on memory size. Comparatively, our model's dependence on memory size is weak because it preserves the diversity of samples.

Conclusion
In this paper, we propose a novel CRE model obtaining enhanced performance through refining sample embeddings. In our model, the sample embeddings are refined by an attention-based memory network fed with relation prototypes, that are generated from memorized samples. The comparison experiments show that our model significantly outperforms current state-of-the-art CRE models. As most current CRE models are memory-based, we further explore the long-term effectiveness of episodic memory. The results show that our model has great advantages in maintaining diversity of memorized samples and performs well in avoiding catastrophic forgetting of old relations (tasks). Because of the efficiency in memory mechanism, our model depends less on memory size. In future work, we will explore whether the mechanism of refining sample embeddings with prototypes can be used in other classification-based continual learning tasks.