Improving Sequential Model Editing with Fact Retrieval

,


Introduction
Pre-trained Language models (PLMs) are trained on massive amounts of texts, encoding knowledge in their parameters, and having remarkably succeeded in knowledge-driven tasks such as question answering (Kwiatkowski et al., 2019;Chen et al., 2021Chen et al., , 2022;;Hu et al., 2023) and reasoning (Mihaylov et al., 2018;He et al., 2023).Such parametric knowledge complements (Pan et al., 2023) the explicit and structured knowledge in widely used knowledge graphs (Pan et al., 2017a,b).
As Pre-trained Language models are deployed widely, the need to keep their parametric knowledge correct and up-to-date without massive retraining costs becomes increasingly important (Sinitsin Other methods (a) involve continuous modification of PLM's parameters.However, the efficiency of modification decreases as the number of edits grows, resulting in poor performance for SME.Our method (b) leverages an fact retrieval framework, guaranteeing consistent modification efficiency regardless of the number of edits.This improvement enhances both the scalability and performance of SME.Furthermore, by leveraging factual information relevant to editing, our retrieval method better adapts to SME scenarios.et al., 2019).Prior works have proposed Model Editing (De Cao et al., 2021), enabling fast, dataefficient PLMs updates without damaging model performance on unmodified data.These methods focus on simultaneously modifying edits; however, the errors in PLMs are unpredictable and pervasive, thus correcting them at once is impractical.
To this end, Sequential Model Editing (SME) (Huang et al., 2023) has been proposed to fix a series of bugs in order while maintaining the previous edits and performance across unmodified data.This is a challenging task, since, for a highly non-linear model, even slight perturbations might significantly alter the model's output.As shown in Figure.1(a),prior works on SME make modifications by either directly modifying the parameters of the lan-guage model (Meng et al., 2022(Meng et al., , 2023) ) or continuously adding parameters to the model (Huang et al., 2023), and they use additional storage to maintain the model's performance on unmodified data (Locality).They have shown promise only to deal with a small number of modifications, and suffer from insufficient expressiveness: 1) As the number of edits increases, the model's parameters undergo significant changes, or the number of parameters becomes large, resulting in forgetting previous edits or unmodified data and gradually increasing the cost of modifications.2) In SME, the edits may clash with pre-sampled data stored to preserve Locality.Thus, the cost of maintaining memory will increase when dealing with more edits.
One feasible way to address the abovementioned issue is the memory-based method, which uses an edit-history memory while keeping the original model frozen.However, the existing method (Mitchell et al., 2022b) primarily focuses on batch editing, assuming that all editing data is known and utilizing the data to train models for data-type identification and modification, which makes it inefficient when handling Sequential Model Editing.
To fully leverage the advantages of the memorybased method, we propose RASE, a Retrieval Augmented Sequential Model Editing framework, (cf.Figure.1(b))stores edits with their fact description and parameter shift on PLMs in memory and uses a query module to retrieve from them to apply each modification individually.With this framework, we can simplify complex continuous modifications by breaking them down into multiple individual edits, and the SME can be decomposed into the following two sub-tasks: 1) Edits classification: determining whether the input needs modification.We propose a fact-aware contrastive model to identify edits without pre-training on any annotated datasets by learning the sentence and fact embedding in a self-supervised way.2) Editing: determining how to efficiently modify each edit.We enhance the generalization capabilities of existing editors by incorporating factual information with the editing data.
Experimental results on fact-checking and question answering tasks indicate that RASE can rectify a series of mistakes (up to thousands) while retaining the model's performance on unmodified data.Our contributions include: • We propose RASE, a retrieval augmented knowledge editing framework, constructing a fact-patch memory, and using a query module to classify the edits by retrieving related facts from the memory and achieving efficient, stable, and expansible SME.
• Our method can be integrated with other factual knowledge editors, enhancing the generalization capability of each editor by leveraging facts related to the edits, and achieving more reliable modifications.
• Experiments show that RASE can support large PLMs for stable and efficient continuous editing.Moreover, we utilize ChatGPT to re-rank retrieval results, further enhancing the accuracy of fact retrieval and identify modified and unmodified data more accurately, thus maintaining the model's performance on the unedited data better.

Model Editing
The task of Model Editing (ME) (De Cao et al., 2021) is to intervene the target model's behaviour on a specific example while preventing the model from forgetting unmodified data.Previous research can be classified into two categories: Specification-based Methods These methods (Zhu et al., 2020;De Cao et al., 2021;Mitchell et al., 2022a;Meng et al., 2022Meng et al., , 2023;;Han et al., 2023) fix the bugs through locating and modifying specific parameters in PLMs.However, a minimal change in parameter space may produce a completely different output for many data, which may leave post-edit model potentially unreliable.
Addition-Based Methods Instead of directly modifying the parameters of PLMs, the Addition-Based Methods (Dong et al., 2022;Mitchell et al., 2022b;Huang et al., 2023) utilize an additional module to apply the edits while keeping the parameters of the PLMs fixed.They allow better preservation of the original language model's performance and exhibits excellent scalability.
However, both types of methods focus on batch editing.In practice, language model errors often require timely and sustainable correction, and the to-be-modified data is unknown.

Sequential Model Editing
The task of Sequential Model Editing (SME) (Huang et al., 2023) is to fix a series of mistakes of a target model as soon as they appear.We use the facts embeddings as the key and the parameter shift for edits as the value.Figure (c) presents the retrieval framework, where the query module retrieves information from the memory for each input.Based on the retrieval results, we make appropriate modifications to the PLM, correcting the data that needs to be modified.
Formally, given a PLM F(•) and an edit stream D {(x 1 , y x 1 ), (x 2 , y x 2 ), . . ., (x n , y xn )} with n edit inputs, the task is to effectively correct the output of F(x i ) when F(x i ) ̸ = y x i , while maintaining accurate predictions for previous edits and unmodified data.We denote E as the editing function, after processing the i-th input, the F i (•) can be represented as: (1) Compared with ME, the desiderata for the SME method are as follow: (1) Reliability, the editor is supposed to successfully edit data sequentially, and the post-edit model should retain the output for previous edits after editing every edit; (2) Generality, the editor should generalize over the equivalent inputs 1 about the edits; (3) Locality, the editor should retain its accuracy on unmodified data.
Existing editing methods (Meng et al., 2022;Mitchell et al., 2022b;Huang et al., 2023) in Sequential Model Editing (SME) still exhibit limitations regarding the generalization and the scale of edits, moreover, due to their direct modify the parameters of the PLMs, the performance of locality may also become unreliable.To address it, we propose a retrieval-augmented editing framework, leveraging factual information associated with the edits as guidance and enhancing the generalization capability of the editor, leading to stable and scale sequential editing, furthermore, since we will keep the original PLMs frozen, and only make corre- 1 The inputs with the same meaning can differ in natural language expressions called equivalent inputs; e.g., 'Michael Jordan was born in ?' vs. 'The birthplace of Michael Jordan is ?'. sponding changes to the PLMs based on the retrieval results, the performance of the retriever can ensure the reliability of the model's locality.

Approach
Figure 2 shows the overview of RASE.In a nutshell, we first train a fact encoder and a sentence encoder in a self-supervised way to maximize the similarity between sentences and their corresponding fact description.After applying each edit successful, we encode the corresponding fact as a key (Section 3.1.1),treat the required parameter shift as a value (Section 3.1.2),and store the (key,value) pair.During the evaluation, we compare a given input with the keys in the memory (Section 3.2).If a key is found, it indicates that the data needs modification, and the value will be applied to the language model to complete the editing.Otherwise, the input is simply passed to the original model.We illustrate our approach by introducing the facts-patch memory's construction and usage.

The facts-patch memory
The memory M = (K, V) contains facts representation K and the edit operation V, with each key k i mapping to one value v i .When an input has the highest similarity to any key in memory, we can consider that the input needs modification.Therefore, we only need to learn the embeddings that can maximize the similarity between the input and its relevant facts in order to achieve data classification for editing without relying on labeled data.

Construction of Key K
We use the key K = {k 0 , k 1 , ..., k i , ...} to determine whether the input needs editing during evaluation.Each k i is generated by the fact encoder Enc f (•), which aims to maximize the similarity between k i and the relevant sentences.Among many possible implementations of the encoder, a straightforward way is to utilize a language model or an existing retriever Izacard et al. (2022a) to encode the data.However, the representation is ineffective for cases where sentences and factual descriptions are distinct but semantically similar, especially for some sentences where only one word is different.
Thus, we propose a fact-aware contrastive learning method to training the encoder.Specifically, we training two Encoders Enc f (•) and Enc s (•) to get the fact embeddings E f and sentence embeddings E s respectively, and the goal is to maximize the similarity between the embeddings of sentences and factual information.As shown in Figure 3, for the fact "<Iphone 14, is developed by, Apple Inc.>", we view "Iphone 14 is developed by Apple Inc." as a factual description and use Enc f (•) to encode it.Then, we use Enc s (•) to encode its corresponding sentences as positive pairs (indicated in green and blue.) and unrelated sentences (indicated in grey.) as negative pairs.We trained Enc f (•) and Enc s (•) using the following three contrastive losses: Facts Contrastive Loss (L F 2F ).We use L F 2F to maximize the similarity between the fact embedding E f and its positive embedding E dp f (dp means it is generated by dropout noise within transformer layers), and minimize the similarity between the fact and other facts.In this way, fact embedding is learned in a self-supervised way.
Sentences Contrastive Loss (L S2S ).L S2S and L F 2F are computed in the same way.Since one fact corresponds to multiple positive sentences, so the average representation of these sentences is viewed as the sentence embedding E s of L S2S .
Contrastive Loss between Sentence and Fact (L S2F ) pulls the embeddings of sentences and their related fact close while keeping unrelated facts apart.To ensure that positive samples' similarity surpasses a predefined threshold, we use a marginbased similarity to better discriminate distinct but semantically similar pairs by increase the similarity between negative samples and decrease the similarity between positive samples.In summary, our total loss is: where λ i is hyperparameter.See more details in Appendix A. Notably, enable to enhance the generalization of editor, the factual description is stored simultaneously with the fact embedding in M without labels.e.g. for the sentence "IPhone 14 was created by?", the factual description is: "IPhone 14 || iPhone 14 is developed by || IPhone 14 was created by?"During the evaluation, each input will be encoded by Enc s (•) and computed with K to determine whether a modification is required.

Construction of Value v i
The value v i for each key k i represents the edit operation for the edit x e .In this paper, we choose an addition-based editor, T-Patch (Huang et al., 2023) and a specification-based editor ROME (Meng et al., 2022), as the base editor due to their efficiency and the ability to continue editing.
T-Patch (Huang et al., 2023) only requires training a certain number of neurons for each edit and inserts them at a designated layer of the Transformer.In contrast to Huang et al. (2023), we does not require additional memory to store training data to satisfy locality.Moreover, we concatenate the factual description with the input as a prompt to improve the generalization of the editor.For T-Patch, the value v i is the extra neurons.
ROME (Meng et al., 2022) locates the knowledge required modification with a key-value pair in one of the middle MLP layers and modifies the corresponding knowledge by directly updating the key-value pair.For ROME, we use factual description as a prompt to improve the generalization of the editing process.The value v i consists of vector value v * , lookup key k * and a shared matrix C. For both two editors, we also store the edits simultaneously with the v i in M to help to identify the data type.The details of these two editors are in Appendix B. We maintain the M during training in real-time.See more details in Appendix C.

The usage of memory
In this section, we introduce how to utilize memory M to identify which data requires modification and achieve sequential model editing.
For each data, after we obtain the embedding of the input x using the sentence encoder H = Enc s (x), we match the H with the keys K in the memory M by the score function: where Cos(•) is the cosine similarity, K means the fact embedding matrix in memory M .
We then select the Top-K scores from score and calculate the following conditions: (1) MAX: the highest similarity score; (2) DIFF: the difference between the top-1 and top-2 score; (3) STD: the standard deviation of the Top-K scores.When the following conditions are met, the data is the edit: where t, t d and t s are the thresholds of the similar score, difference score and standard deviation value respectively.
In Eq. 4, for data with maximum score less than t, we make judgments based on the distribution of the score, such as STD and DIFF.However, it is difficult to determine whether they need to be modified based on their distribution when they are similar but different.Therefore, we propose the two-way score to enhance the differentiation between data points and improve the identification of edits.
Two-Way Score incorporates the similarity between the edits in the Top-K data and the input data.The final score is: where H i means the edit embedding calculated by Enc s (x) for the edits corresponding to the Top-K fact.Finally, we re-select the Top-K scores from score t and use Eq.4 to identify the edits.
We use the same data splits for both datasets as Generalization Rate (GR): After editing T edits, we evaluate if the post-edit model f T is edited successfully on the equivalents of the edits in D edit .
Training Retain Rate (TrainR): After editing T edits, we compare the performances of the post-edit model f T and the initial model f 0 on sub-dataset D tr which is randomly sampled from D ′ train .Test Retain Rate (TestR): After editing T edits, we compare the performances of the post-edit model f T and the initial model f 0 on original validate dataset D test .
We use T-Patcher and ROME as the baseline editors to combine with our framework, denoting them as RASE-Patcher (R-Patcher) and RASE-ROME (R-ROME) respectively.We edit on Bert-base (Devlin et al., 2019), Bart-base (Lewis et al., 2020) provided by Meng et al. (2022), and GPT-2 XL 1 models.For R-Patcher, we insert five neurons to the last FFN layer on the FEVER and ZsRE datasets.We training the fact and sentence Encoder with a minibatch consists of 64 facts.For each fact, we select two consistent sentences as positive examples, other facts and sentences related to other facts serve as negative examples.The threshold in Eq.4 is t = 0.9, t d = 0.15 and t s = 0.05.The Top-K used in Two-Way score is 5.Other parameters, like learning rate, will be set as those in the ROME and T-Patcher.All of our experiments were implemented on a single NVIDIA A100 GPU.See more details in Appendix E.

Editing Small scale Models
Table 1 and Table 2 present the editing results on small scale models with R-Patcher.We first evaluate RASE on small number of edits.RASE significantly improves editing generalization.Table 1 illustrates the results for a small number of edits.Our method demonstrates competitive performance across two datasets and five metrics.It is noteworthy that the generalization (GR) is improved when guided by factual information (+Pt).We further enhance model generalization by incorporating consistently sampled data(+Eq) into the loss calculation during editing.
RASE can maintain stable performance on more edits.Table 2 illustrates the results for a large number of modifications.In comparison to Table 1, other methods experience a decline in performance as the number of edits increases.However, RASE maintains consistent performance while demonstrating an advantage in generalization.We also test RASE on ZsRE with 4500 edit conditions, further validating its stability and sustainability.In contrast, T-Patcher's efficiency decreases as more modifications are made due to the continual addition of parameters to the language model.It becomes challenging for T-Patcher to perform largescale consecutive edits within a certain time.For SERA, as the number of editing increases, its accuracy in discriminating modified data decreases, resulting in poor performance on TrainR and TestR.And MEND is a hyper-network editing method that predicts parameter changes for modifying current data by learning gradients of editing data.The predictions of MEND are highly dependent on the parameters of LMs, while continuous editing leads to constant parameter changes in LMs, rendering MEND ineffective for achieving sequential modifications.On the other hand, RASE maintains consistent efficiency even as the number of edit data increases.RASE can combine with large-scale language models such as ChatGPT to improve performance.For the FEVER dataset, we achieve good results in modification performance (SR, GR, ER), but there is a decrease in TrainR.This is because numerous sentences are similar to the edits but dif-fer by only a few words in FEVER.To address this, we employ ChatGPT to re-rank the Top-K results of retriever to evaluate if the input needs modification.We select the input that satisfies one of the three conditions in Eq.( 4) as the hard data, construct the factual description of Top-K as a K-item decision task, and then let ChatGPT solve the problem.If there is no suitable answer, return 'None'.Appendix D shows the usage of ChatGPT.The results with ChatGPT on both FEVER and ZsRE show that combining the results from the large model with our method can further enhancing the accuracy of fact retrieval and identify modified and unmodified data more accurately, thus maintaining the model's performance on the unedited data better.

Editing on large LMs
We use the GPT-2 XL (1.5B) to test the performance on a large-scale model.RASE can be flexibly combined with other editors and stably edit larger language models.The results are shown in Table 3.When modifying 1000 data in ZsRE, ROME and MEMIT achieve impressive results, particularly in TrainR and TestR.However, FT and FT-MEM exhibit lower effectiveness, and MEND can not perform any modifications.Moreover, these methods display poor performance when evaluating the results of previous edits.In contrast, RASE preserves the modified results well without compromising the original model's performance.Even for TrainR, RASE is very close to MEMIT.Although the overall performance in terms of generalization is relatively low, RASE demonstrates improvements compared to ROME.This further confirms that the fact-enhanced approach can enhance the generalization of different editors.
As the number of edits expanded to 5000, other methods exhibited a significant decline in effectiveness, particularly MEMIT.This is because MEMIT alters many parameters for each modification.As the edits accumulate, the original parameters of the model become disrupted, leading to the complete failure of the model.In contrast, RASE maintains favourable results and revalidating our approach's stability and sustainability.

Retrieval analysis
Retrieval architecture analyse.We compare different encoder architectures which are treated as retriever.The results are shown in Table 4.When directly using the Bart model as the encoder, the model struggles to identify which data needs modification.Izacard et al. (2022a) has achieved good results, but it has a lower HIT metric, indicating that it can successfully retrieve similar data but struggles to accurately identify some similar but different samples.After using RoBERTa with contrastive loss we proposed, there was a significant performance improvement.Furthermore, -w/o L S2F shows that the margin-based similarity can further enhance the HIT value and improving the overall recognition rate.Relation for Retrieval augmented and Editing.Knowledge editing (KE) is an arising and promising research area that aims to alter the parameters of some specific knowledge stored in pre-trained models so that the model can make new predictions on those revised instances while keeping other irrelevant knowledge unchanged.Retrieval augmented models design retrieval strategies to incorporate world knowledge into a fixed LLM through prompting.Both of them are re-training free and have been shown to be effective for the issue of knowledge staleness.
However, Model editing methods prefer to edits LLM parameters directly, to update LM's knowledge, it is hard to maintain the other knowledge which are correct.Meanwhile, retrieval augmented models may not guarantee LMs can always update their predictions when given retrieved knowledge, due to the LLMs may prioritize their own parametric knowledge and ignores the retrieved nonparametric knowledge which be called knowledge conflict.
In this paper, we combine the advantage of retrieval augmented and editing methods.The retrieval module identifies the edits, and the editing module corrects the erroneous data.In our setting, the editing module processes only one data point at a time, so the retrieval module has a greater impact on ER, TrainR, and TestR, while the editing module has a greater influence on SR and GR.

Cost analysis
Regarding editing efficiency, we note that the efficiency of the existing continuous editing method, T-Patcher, gradually decreases as the number of edits increases.Our tests found that using T-Patcher to modify 1000 edits on an A100 GPU takes 2-3 days.On the other hand, RASE can perform edits at a stable speed.It takes approximately 50 seconds to modify a data.On GPT-2 XL, our efficiency remains consistent with ROME and MEMIT, but our method enables more sustained and reliable modifications.The time required to continuously modify 1000 edits is shown in Table 5. Regarding the Extra memory cost, our retrieval augmented framework uses additional memory to store previous edits to achieve sequential model editing.For R-Patcher, we save extra neurons for each edit, while for R-ROME, we save weight offsets for each edit.In T-Patcher, a portion of the training data is sampled to preserve locality to calculate memory loss.We estimate that storing 40,000 data would require approximately 120MB of space, and additional costs would be associated with maintaining the memory.Similarly, ROME also requires around 160MB of additional storage to maintain locality.In contrast, we use memory to save the previous edits.The cost for 1000 edits with R-ROME is 75MB, and 300MB for R-Patcher.Furthermore, the increase in memory usage is a fixed cost that does not increase with the number of edits.
Regarding the Editing layer, we follow the same settings as T-Patcher and ROME regarding the editing layer.For T-Patcher, we edit the FFN (Feed-Forward Network) in the last layer of the Transformer.For ROME, in GPT2-XL, we edit the FFN layer in the 17th layer.However, as Hase et al. (2023) suggests that "Many layers could store a fact, and it happens that some do."Therefore, there may be better choices for continuous editing than continuously modifying a specific layer.It would be beneficial to explore more flexible editing strategies by incorporating interpretability in the future.

Case study
Figure 4 gives a sense of how RASE would perform in practice, our approach can accurately modify X1-X3 without being affected by the content in memory K4, thus keeping X4 unchanged.On the hand, while T-Patcher can also correct X1-X3, it makes a mistake on X4, due to the influence of X3.This is why we adopt retrieval enhancement instead of directly modifying the model parameters.

Knowledge Editing
Editing parametric knowledge is not as straight forward (Pan et al., 2023) as editing (Wang et al., 2009(Wang et al., , 2010(Wang et al., , 2014) ) knowledge graphs (Pan et al., 2017a,b).For editing parametric knowledge, a natural way is to use constrained fine-tuning to update parameters based on new examples (Zhu et al., 2020)  Hyper-network to predict the parameter shift.Other methods like Meng et al. (2022Meng et al. ( , 2023) ) proposed a direct editing method and achieved great results on batch editing.More recently, some methods have developed external modules for edits and do not require access to base model parameters (Dong et al., 2022;Mitchell et al., 2022b).In order to apply the editing to real-world challenges better, Huang et al. (2023) proposed the Sequential Model Editing task and trained Transformer-Patcher, achieving edit the model up to a thousand times continuously.

Contrastive Learning
The key idea of contrastive learning (CL) is to pull semantically similar samples close and keep different samples apart (Hadsell et al., 2006;Chen et al., 2020).By employing contrastive learning objectives, Gao et al. (2021); Yan et al. (2021); Chuang et al. (2022); Zhou et al. (2022) fine-tuned the pretrained language models, resulting in significant advancements in learning unsupervised sentence embeddings.In order to alleviate the need for an annotated dataset, Gao et al. (2021); Liu et al. (2021) proposed a simple contrastive learning framework that used dropout noise within transformer layers to generate positive pairs.Nishikawa et al. (2022) proposed a contrastive learning method for learning sentence embeddings between sentences and their related entities sampled from Wikidata.

Retrieval-augmented language model
Retrieval augmentation can enhance language models' performance without significantly increasing the parameters and computation (Tirumala et al., 2022;Mialon et al., 2023).Khandelwal et al. (2020) increased PLMs memorization capacity by accessing memory units and an external look-up table.Borgeaud et al. (2022); Lazaridou et al. (2022); Izacard et al. (2022b) showed that retrieval improves performance across a variety of tasks such as question answering (Kwiatkowski et al., 2019), fact checking (Thorne et al., 2018b), dialogue (Dinan et al., 2019).Mitchell et al. (2022b) proposed a memory-based approach for knowledge editing.Inspired by retrieval, we view the editing task as a retrieval and augmentation process, construct a memory to store editing data, apply a certain modification through retrieval, and achieving stable and continuous editing.It is worth noting that, the editing is motivated by a list of bad cases of the form (question, wrong-answer, correct-answer), where the correct-answer is the knowledge that we mentioned above.Therefore,we only retrieve from all the facts related to the data that needs to be modified.And we assume that we have known all the erroneous and their corresponding correct answers.

CONCLUSION
This paper focuses on sequential model editing and proposes RASE, a retrieval augmented sequential model editing framework, to enhance the performance of existing editors in a plug-and-play manner, and achieve efficient, stable, and expansible Sequence Model Editing.We construct a fact-patch memory in a self-supervised way and utilize the memory to enhance the model's continuous editing capability.During editing, we use fact information related to the modified data as prompt to enhance the generalization of the editor.RASE has achieved favourable results under different scales of language models and varying numbers of edits.Additionally, it can be flexibly applied to different editors and integrating with large language models like ChatGPT can further enhance editing performance.
In the future, on the one hand, we plan to investigate knowledge editing for some complex tasks, such as reasoning.And explore how to integrate retrieval methods with model editing better.On the other hand, we might look into the connections between editing parametric knowledge and knowledge editing for uncertain knowledge graphs (Pan et al., 2005;Stoilos et al., 2006;Pan et al., 2007;Qi et al., 2007;Şensoy et al., 2013).
To learn sentence and fact representation, we training two Encoders Enc f (•) and Enc s (•) to get the fact embedding E f and sentence embedding E s respectively.We construct a minibatch containing L facts, for each fact f i , there are S corresponding positive sentence s i j , i ∈ [1, L], j ∈ [1, S] 1 , and any sentence unrelated to f i and any fact other than f i serve as negative examples for f i .The input batch is denoted as . We train Enc f (•) and Enc s (•) using three contrastive losses and use RoBERTa (Liu et al., 2019) as the baseline PLMs.For convenience, we use f and s denote the Embedding E f = Enc f (f ) and E s = Enc s (s) respectively: Facts Contrastive Learning (F2F) aims to learn fact embedding in an unsupervised way.Given two embedding with different dropout masks f i and f + i , the training loss is: where τ is a temperature hyper-parameter, and sim(•) is the cosine similarity.
Sentences Contrastive Learning (S2S) aims to learn sentence embedding in an unsupervised way.Since there are S sentences in the same batch which are both related to the same f i , we treat S sentences as a group and learn sentence representations by the loss: L j=1 e sim(sg,s j )/τ , where s g = P ool(s 1 i , s 2 i , ..., s S i ) is the mean representation of a group of inputs related to the same factual f i , P ool(•) is the mean-pooling function, and s + g is the dropout represents about s g .Contrastive Learning between Sentence and Fact (S2F) pulls the embedding of inputs and their related fact close while keeping unrelated facts apart.We use a margin-based similarity to increase the similarity between negative samples and weaken the similarity between positive samples, to ensuring that the similarity of positive samples is 1 As shown in Figure 3, the fact is "Iphone 14 is developed by Apple Inc.", and the postive sentence is "Iphone 14 was created by?", the negative sentence is "What club does Messi play for?".higher than a certain threshold: where * ∈ [1, S], m, n is the margin value for positive and negative pairs.Ps(•) and Ns(•) are denoted as the positive and negative similarity scores respectively.

Ps(s
For negative pairs, we will them a punishment when h = cos(s * i , f i ) − cos(s * i , f j ) less than m: (10) In summary, our total loss for Contrastive Learning is: where λ i is hyper-parameters.
We train the encoder separately on ZsRE and FEVER datasets.It is worth noting that the dataset used for training the encoder and the dataset used for editing are sampled separately.And these encoders are pre-trained before the editing process.During editing, we utilize these encoders only.The L is 64, and the maximum value of S is 5.

B Detail of the base editor
Transformer-Patcher In this paper, we follow (Huang et al., 2023), use Transformer-Patcher as a based editor because it only requires training a certain number of neurons for each edit and inserting them at a designated layer of the Transformer.In contrast to Huang et al. (2023), the patch used in our paper does not require additional memory to store training data to satisfy locality.Moreover, to enhance the generalization of the editor, we propose a fact-guided editor.We concatenate the factual information obtained through retrieval with the input data as a prompt to improve the generalization of the editing process.
For an edit (x e , y xe ) and the fact describe F xe , the input for the edit is X e = [x e ; F xe ], and then we add few neurons to the FFN of special Transformer layers to alter the output of the edit.Formally, denote the hidden state of X e is h, for a standard FFN in Transformer blocks, the output of FFN(h) is calculated by: where Act(•) is a non-linear activation function such as Relu, FFN Substituting the Eq. ( 13) into Eq.( 15) and calculating the following: After training, we use target label y xe and the edited output u xe to calculate the loss l e : where L(•) is a loss function (e.g.The Cross Entropy Loss).
Rank-One Model Editing (ROME) (Meng et al., 2022) applies a rank one edit to the downprojection matrix in a MLP layer in the model.It views the linear operate in Transformer with parameter W as a key-value store for a set of vector keys K = [k 1 , k 2 , ...k n ] and corresponding vector values V = [v 1 , v 2 , ..., v n ], by solving W K ≈ V .Each k i and v i denote a question and answer, respectively.Then for a new key-value pair (k * , v * ), we can insert this new data into the target layer by solving an optimization: where W is the weight matrix for the original linear, C = KK T is s a constant that we precache by estimating the uncentered covariance of k from a sample of Wikidata text, and Λ = (v * − W k * )/(C −1 k * ) T k * is a vector proportional to the residual error of the new key-value pair on the original memory matrix.k * is the average value over a small set of texts ending with the subject s, and v * is learning by optimization.In this paper, we use the ROME as one of the base editors because it does not need extra training and editing each data efficiently.

C Memory maintenance and usage
To further enhance the efficiency of memory retrieval, we employ the following methods to maintain the memory.
1) Adding: When a new edit has never appeared in the memory, we store the corresponding keyvalue pair.The key is the fact embedding about the edit and Encoder by the fact encoder Enc f (•), and the value is the editing operation and the fact description for the edit.
2) Updating: When the key of a new edit is similar or identical to an existing key in the memory M , we merge the corresponding modification data with the existing one in the memory and train them together.
After the facts-patch memory is constructed, we will use the query module (section 3.2) to judge if each input needs to be edited, If not, the input will calculate by the original LMs.Once the input is the edit, we will apply the corresponding value to LMs temporarily.For T-Patch, we insert the trained extra neurons to the last layer's FFN of Transformer; for ROME, we use the k * and v * to calculate the parameter shift △W and replace the LMs parameters W of layer l by W = W + △W .Once the current edit computation is completed, the model will be reset.Therefore, when we can successfully identify which data requires modification, our method ensures that the performance of the model on unmodified data remains unaffected.

D Constructing questions through ChatGPT
For an input s: "Brigitte Macron is married to someone who is President of the French Republic."we use our score function (cf.(3)) to select the Top-K facts, the result is: F acts = ["Brigitte Macron is the fiancee of Emmanuel Macron.","Brigitte Macron is engaged.","Brigitte Macron was born on April 23, 1953.","Peggy Sue Got Married is a 1933 American film."]with the similar score [0.939, 0.917, 0.731, 0.359, 0.354, 0.346].
However, the fact "Brigitte Macron is the fiancee of Emmanuel Macron." is not equivalent to "Brigitte Macron is married to someone who is President of the French Republic.",so our method should not edit this input.However, due to the similar high score, our query function chooses the first fact as the key and edits this data, influencing the result of the language model.For this instance, we constructed a multiplechoice question as a prompt: Question: Which of the following sentence expresses the same meaning as the sentence "Brigitte Macron is married to someone who is President of the French Republic.",If there is no answer, reply "None".
Option Then, we utilize ChatGPT to answer the question and combine the answer as retrieval results with our framework, enabling a better assessment of whether the data needs modification.

E Details of Experiments setting
We evaluate our models on fact-checking dataset FEVER (Thorne et al., 2018a) and Zero-Shot Relation Extraction (ZsRE) dataset (Levy et al., 2017).We apply the BERT-base model Devlin et al. (2019) at the FEVER dataset.For ZsRE, we apply the BART-base model Lewis et al. (2020).We use the same data splits for both datasets as Huang et al. (2023).We use the original validate and employ it as D test and the original D T rain is split into an edited set D edit , a new training set D ′ train and a new validation D val in a ratio (0.8:0.8:0.1 for FEVER and 0.9:0.075:0.025for ZsRE).
We denote the initial model as f 0 which is trained on D ′ train and validated on D val .Then the model f 0 is sequentially edited on D edit .For FEVER, the accuracy of the initial model is 87.6% on the edited dataset, 94.6% on the train dataset, and we get 10496 instances for the edited dataset; the mistake data is about 1300.For ZsRE, the accuracy of the initial model is 47.1% on the edited dataset and 56.9% on the training dataset; as a result, we get 5352 instances for the edited dataset, and the mistake data is about 2800.For both datasets, we randomly sampled a subset from D Edit Retain Rate (ER): After edited T edits, we evaluate how many past edits are retained by the final model f T : Generalization Rate (GR): After edited T edits, we evaluate if f T is edited success on the equivalent dataset of the edit example in D edit : where N t is the number of the t-th edit equivalent input.
Training Retain Rate (TrainR): After edited T edits, we compare the performance of the final model of f T and the initial model f 0 on subdataset D tr which is randomly sampled from D Our baselines include: • Fine-Tuning (FT) (Zhu et al., 2020): Directly fine-tunes the model on the edit example.
• FT with KL divergence (FT+KL) (Zhu et al., 2020): Fine-tunes the model on the edit example with an extra Kullback-Leibler (KL) constrained.
• MEND (Mitchell et al., 2022a): Using a hypernetwork to learn a parameter shift and then apply it to the model.
• SERA (Mitchell et al., 2022b): A variant of a memory-based model editor, which is provided by Huang et al. (2023).
• ROME (Meng et al., 2022): A Locate and Edit method for decoder-only models.
• MEMIT (Meng et al., 2023): A method extension of ROME that modifies the MLP weights of a range of critical layers.
• T-Patcher (Huang et al., 2023): A sequential editing method which adds and trains a few neurons in models.

F Dataset Samples
Figure 5 shows an example in ZsRE.For the triplet information involved in the question "<Cari Lekebusch, date_of _birth, 1972>", we convert it into a similar question format "Cari Lekebusch || When did Cari Lekebusch get born?|| 1972" and use it as a factual description.When using facts to enhance generalization, we remove the labels: "Cari Lekebusch || When did Cari Lekebusch get born?". Figure 6 shows an example in FEVER.The structure for FEVER is similar to ZsRE, but the difference is that we construct f act_rep_use based on the statements and actual facts from FEVER.e.g.For the fact: <Amerigo Vespucci,place_of _birth, Italian>, we use the fact represent: "Amerigo Vespucci || Amerigo Vespucci was Spanish.|| False", due to the claim is "Amerigo Vespucci was Spanish.".

Figure 1 :
Figure 1: Comparison of RASE with other methods.Other methods (a) involve continuous modification of PLM's parameters.However, the efficiency of modification decreases as the number of edits grows, resulting in poor performance for SME.Our method (b) leverages an fact retrieval framework, guaranteeing consistent modification efficiency regardless of the number of edits.This improvement enhances both the scalability and performance of SME.Furthermore, by leveraging factual information relevant to editing, our retrieval method better adapts to SME scenarios.

Figure 2 :
Figure 2: Illustration of RASE. Figure (a) illustrates the sentence encoder and fact encoder trained in a selfsupervised way. Figure (b) shows the memory construction process.We use the facts embeddings as the key and the parameter shift for edits as the value.Figure (c) presents the retrieval framework, where the query module retrieves information from the memory for each input.Based on the retrieval results, we make appropriate modifications to the PLM, correcting the data that needs to be modified.

Figure 3 :
Figure 3: Fact-aware contrastive learning method: we represent the inputs (in blue), their equivalents (in green), and corresponding facts (in orange) as positives, while representing other facts and sentences (in gray) as negatives.Our goal is to pull together the positive pairs and push away the negative pairs by training two Encoders.
. However, in PLMs, a minor parameter change could change the model's predictions on many samples.Then, De Cao et al. (2021); Mitchell et al. (2022a); Han et al. (2023) trained a

′
train with the size of 10,000 as D tr , and we training the fact Encoder and sentence Encoder on D ′ train without D tr and validate on D val .And suppose there are T mistakes in D edit , I represents the indicator function, after editing the t-th edit example (x t , y xt ), we obtain a post-edit model f t , and we use the following rate to evaluate our method.Success Rate (SR): For each edit, we test if the post-edit model f t outputs the desired prediction: t (x t ) = y xt ).(19) y)∈Dtr I(f T (x) = y) (x,y)∈Dtr I(f 0 (x) = y) .(22) Test Retain Rate (TestR): After edited T edits, we compare the performance of the final model of f T and the initial model f 0 on original validate dataset D test .T rainR = (x,y)∈Dtest I(f T (x) = y) (x,y)∈Dtest I(f 0 (x) = y) .
Huang et al. (2023), specifically we employ the original validate as D test and the original D T rain is split into an edit set D edit , a new training set D ′ train and a new validation D val .We denote the initial model as f 0 which is trained on D ′ train and validated on D val .The model f 0 is sequentially edited on D edit .Supposing that there are T total mistakes in D edit , I(•) represents the indicator function.After editing the t-th edit example (x t , y xt ), we obtain a post-edit model f t , and we use the following metrics to evaluate our method:Success Rate (SR): For each edit, we test if the post-edit model f t produces the desired prediction.Edit Retain Rate (ER): After editing T edits, we evaluate how many past edits are retained by the final model f T .

Table 1 :
Results on small number of edits.R-Patcher is RASE-Patcher, +Pt is the results after we use the fact information, +Eq denotes that we add the equivalent inputs during editing based on R-Patcher+Pt .

Table 2 :
Results on large number of edits.N denotes the number of edits we edit, +ChatGPT means we use the ChatGPT to enhance our model (+Pt ), -means we use the D tr as the edited dataset, so we did not calculate the retain score on D tr .

Table 3 :
Editing Results on GPT-2 XL.FT-MEM means we include previously modified data and newly modified data as inputs for fine-tuning.R-ROME means we combine RASE with ROME.It is worth noting that in TrainR and TestR, some results exceeding 1 are due to the impact of modified data, which unintentionally corrects erroneous data that was not modified.

Table 4 :
Results on different encoder .E&E and N&N means the model successfully identifies the edits and unmodified data, respectively.EBN represents misidentifying edits as non-modified data and while NBE represents the opposite.HIT: represents the ratio of successful retrieval of the correct key value by the model.

Table 5 :
Time cost for different Editor on 1000 edits.
d 1 are the weight matrix of two linear in FFN respectively, b k and b v are two bias vectors.After add extra neurons (K n