Editing Common Sense in Transformers

Editing model parameters directly in Transformers makes updating open-source transformer-based models possible without re-training (Meng et al., 2023). However, these editing methods have only been evaluated on statements about encyclopedic knowledge with a single correct answer. Commonsense knowledge with multiple correct answers, e.g., an apple can be green or red but not transparent, has not been studied but is as essential for enhancing transformers' reliability and usefulness. In this paper, we investigate whether commonsense judgments are causally associated with localized, editable parameters in Transformers, and we provide an affirmative answer. We find that directly applying the MEMIT editing algorithm results in sub-par performance and improve it for the commonsense domain by varying edit tokens and improving the layer selection strategy, i.e., $MEMIT_{CSK}$. GPT-2 Large and XL models edited using $MEMIT_{CSK}$ outperform best-fine-tuned baselines by 10.97% and 10.73% F1 scores on PEP3k and 20Q datasets. In addition, we propose a novel evaluation dataset, PROBE SET, that contains unaffected and affected neighborhoods, affected paraphrases, and affected reasoning challenges. $MEMIT_{CSK}$ performs well across the metrics while fine-tuning baselines show significant trade-offs between unaffected and affected metrics. These results suggest a compelling future direction for incorporating feedback about common sense into Transformers through direct model editing.


Introduction
Transformer-based language models (LMs) have achieved great success in NLP (Brown et al., 2020) but they still exhibit factual mistakes (Lewis et al., 2020;Shuster et al., 2021), commonsense mistakes Figure 1: Proposed framework -MEMIT CSK , for editing and evaluating plausible commonsense knowledge in Transformers.Given a plausible <Subject, Verb, Ob-ject> commonsense statement, MEMIT CSK edits parameters at different token and layer locations (described in §3).Edited model is evaluated for semantic generalization (depicted in dark blue box) and configuration generalization defined in §3.(Bender and Koller, 2020;Marcus, 2021;Talmor et al., 2019;Bhargava and Ng, 2022), and consistency errors (Tam et al., 2022;Devaraj et al., 2022;Weng et al., 2020).Retraining or finetuning LMs to overcome these errors is costly and uninterpretable.To address this, prior research (Meng et al., 2022(Meng et al., , 2023) ) has shown that model predictions often correlate strongly with certain neuron activations and parameter editing methods can effectively correct encyclopedic factual mistakes.
However, it remains unclear whether these editing methods can scale beyond encyclopedic facts to fix commonsense errors in Transformers.Commonsense knowledge involves more uncertainty and variation than encyclopedic knowledge.Consider a subject-verb-object triple (s, v, o).In the encyclopedic domain, s and v often map to one "o", e.g., the Eiffel Tower is located in the city of "Paris".On the contrary, commonsense knowledge is harder to enumerate and s and v can be mapped to many "o", e.g., an apple has colors that can plausibly be "green", "red", "yellow", "white", and their interpolation.We aim to answer (i) whether commonsense plausibility information is also localized in specific hidden states of Transformers, and if so, (ii) can model editing on those units effectively repair incorrect commonsense plausibility judgments?
To this end, we focus on the subject-verb-object binary plausibility classification task utilizing two commonsense datasets, 20 Questions (20Q, Porada et al. (2021)) and Physical Plausibility Commonsense (PEP3k, Wang et al. (2018)).We perform causal mediation analysis (Pearl, 2001;Vig et al., 2020;Meng et al., 2022) on GPT-2 Large and XL models and their fine-tuned checkpoints (Base-finetuned models), at various part-of-speech locations.While the zero-shot models perform poorly on the task and exhibit no causal pattern, we find clear causal associations between predictions and localized parameters at subject, verb, and object locations in the Base-finetuned models.We then investigate if we can edit relevant parameters in the Base-finetuned models to correct their mistakes.While directly applying the MEMIT editing algorithm (Meng et al., 2023) to edit subject tokens results in sub-par performances, we extend MEMIT to MEMIT CSK by editing various token locations and improving the edit layer selection strategy.
We demonstrate the advantage of MEMIT CSK compared to fine-tuning the model ("repairfinetuning")2 from two angles: semantic generalization and configuration generalization.Semantic generalization requires that commonsense judgments are repaired while their paraphrases, neighbors, and reasoning-based queries are also answered correctly -some should be affected and others unaffected by the editing.We create a PROBE SET for 20Q and PEP3k datasets to contain efficacy, unaffected neighborhood, affected neighborhood, affected paraphrase, and affected reasoning challenges.We also evaluate configuration generalization for each method to determine whether a strategy (hyperparameter combination) picked on an EDIT VALIDATION SET can achieve good performance on a separate EDIT SET.Our proposed framework for editing and evaluating commonsense knowledge in transformers is depicted in Fig. 1.
Our contributions are five-fold.(1) We show strong causal associations between commonsense judgments and localized parameters in Base-finetuned GPT-2 Large and XL models.(2) We extend the MEMIT editing algorithm to MEMIT CSK by varying edit tokens and improving the edit layer selection strategy, resulting in 4.58% and 1.99% F1 improvement for GPT-2 XL on EDIT VALIDATION SET of PEP3k and 20Q.(3) GPT-2 XL edited by MEMIT CSK outperforms repair-finetuned baselines by 10.97% and 10.73% F1 on the EDIT SET of PEP3k and 20Q, exhibiting favorable configuration generalization.(4) GPT-2 XL edited by MEMIT CSK performs well across the affected and unaffected metrics in our constructed PROBE SET for semantic generalization, while fine-tuned baselines exhibit significant tradeoffs between unaffected and affected metrics.( 5) We show that edited models achieve clearer associations between judgments and localized parameters on previously incorrectly predicted samples, solidifying the correlation between causal analyses and performances.These results suggest a compelling future direction of incorporating feedback about common sense in transformers on the fly through direct model editing.

Background
The MEMIT (Mass Editing Memory in a Transformer) method proposed by Meng et al. (2023) demonstrates its effectiveness in editing up to 10,000 factual associations in transformer models on zsRE (Levy et al., 2017) and their proposed COUNTERFACT dataset, designed to test factual recall.We describe some background here but otherwise refer the reader to Appendix A.1 and Meng et al. (2022Meng et al. ( , 2023) ) for a more detailed description.

Causal Tracing
Given a model, the method takes a concatenation of subject s and verb v as input prompt x and predicts the corresponding object o as prediction y.For a correctly-predicted (x, y) pair, causal tracing consists of the following three steps: Clean run -The input prompt is provided to the model and the predicted probability of the correct object, P [y], is calculated; Corrupted run -The subject tokens are corrupted with noise and the corresponding probability of the ground truth object, P * [y], is computed; Corrupted-with-restoration run -The same corrupted input is given, but at a certain token i and layer l, the model is forced to output the clean state activation from the clean run.In this setting, the probability of the correct object, Severed Causal Tracing: To disentangle the impact of MLP and attention in each layer, MEMIT analyzed the effect on the attention layer by fixing the MLP output at the corrupted run value, so that it is unaffected when inserting clean state h l i .This can be viewed as severing the MLP effect when analyzing the effect on attention.Similarly, this can be done by severing attention layers.

Memory Editing
MEMIT identified the crucial parameters significantly impacting the model's prediction through causal tracing.They selected the layer with the highest AIE and its preceding layers as the edit layers R. We extend MEMIT's editing strategy, described in Meng et al. (2023), to the commonsense domain.

Method
We now set out to investigate our main research question: is commonsense plausibility information also localized in specific MLP hidden states of an LM, and, if so, can MEMIT-style editing effectively repair incorrect commonsense plausibility judgments?
To investigate this, we conduct experiments that address important sub-questions, focusing specifically on the commonsense plausibility task (Porada et al., 2021).The task is to predict a label y ∈ {True, False} given an input triple x = (s, v, o).An example can be seen in Fig. 1 3 .

Is high task performance needed to achieve a strong causal tracing result?
Because model parameter editing relies on selecting a token and layer position based on the maximum AIE, we hypothesize that model performance may impact the resulting causal tracing graph.In particular, since a model that performs near-random on a task will also perform close-to-random during a corrupted run, overall AIEs may be low as a result.This relationship has not been investigated in prior work -in contrast to the factual encyclopedic datasets used in previous studies, the zero-shot performance of language models on the commonsense plausibility task can be poor.Thus, we perform causal tracing on commonsense datasets in two experimental settings: zero-shot (Meng et al., 2022), and after fine-tuning models on plausibility tasks; we refer to this fine-tuning as base-finetuning.
3.2 Does the part of speech and model layer locations affect causal tracing conclusions and edit success?
Prior work on editing encyclopedic knowledge focuses on subject corruption and editing since factual knowledge is mostly associated with the subject and the object is directly predicted.In contrast, common sense and plausibility judgments depend on each element of the sentence.Therefore, we analyze three types of corruption and edit locations: subject, verb, and object.MEMIT (Meng et al., 2023) edits a five-layer window whose last layer has the highest AIE in the severed causal graph.This strategy only considered the last layer effect but ignored all the other layers in the window.To mitigate this, we consider edit layers as a hyperparameter and search from a list of MEMIT's five-layer window and also the window having max moving average of AIE4 .A detailed explanation of our layer selection strategy is presented in Appendix A.7.We denote our modified editing method with varying edit tokens and a more robust layer selection strategy as MEMIT CSK .

Does MEMIT CSK exhibit configuration generalization?
Prior work on model editing tunes hyperparameters and reports performances of editing algorithms on the same data splits.We study configuration generalization -whether editing hyperparameters pre-selected on some data can be effectively transferred to an unseen data split.The motivation is that running parameter sweeps on new data points for editing can be time-consuming and costly.Since commonsense knowledge is innumerable, it is favorable if users may provide contextual feedback to change model behaviors on the fly using preselected hyperparameters.We thus create an EDIT VALIDATION SET and an EDIT SET for each dataset.We select hyperparameters on the EDIT VALIDA-TION SET and study the transferability of the bestfound setting of MEMIT CSK and repair-finetuning baselines to EDIT SET ( §5.3).

Does MEMIT CSK exhibit semantic generalization?
It is not enough to report the success of a direct editing method on the original dataset since edit methods can (and should) have propagational effects on instances beyond the dataset (Meng et al., 2022).To compare and assess semantic generalization of updates, we augment incorrectly predicted samples with neighborhood instances and paraphrases that should be affected by an edit, similar to the prior fact editing work.We additionally include neighborhood instances that should not be affected.Performance on the unaffected neighborhood measures the update's specificity, while performance on the affected neighborhoods and affected paraphrases indicates its generalization.Additionally, editing the plausibility of a commonsense statement should affect reasoning chains involving that statement.Entities and knowledge are interconnected, often requiring updates to one component of commonsense knowledge when modifying another.To this end, we add a fourth category of augmentations, affected reasoning, to test whether successful edits correct aspects of a model's commonsense reasoning.The augmentations, which form the PROBE SET, are excluded during editing and solely used for evaluation purposes.We provide examples in Fig. 1 and Table 1.5

Does MEMIT CSK outperform finetuning for repairing commonsense knowledge?
To answer our main research question, we compare MEMIT CSK applied to the MLP hidden states most strongly identified by our causal tracing experiments against finetuning baselines, which we refer to as repair-finetuning.We compare both methods' performance on edit efficacy (how many incorrect predictions are fixed), overall F1 score and relapse (how much the edit hurts by changing previously correct predictions), and semantic generalization metrics.Unlike prior work, we also investigate whether such improvements exhibit them-4 Experimental Setup

Models
We perform experiments on GPT-2 Large and XL (Radford et al., 2019). 6We finetune checkpoints from Huggingface Transformers (Wolf et al., 2020) 2021) into an 80%-20% split.The EDIT SET is created using the test set from Porada et al. (2021).Because both datasets' instances are unnatural (e.g., "man swallow paintball"), we use GPT-3 text-davinci-003 to reformat them into natural language while retaining the (s, v, o) format, e.g., "A man swallows a paintball".More details and dataset statistics are in Appendix A.2.
We report three metrics on the EDIT VALIDA-TION SET and EDIT SET: F1 Score (↑), a measure of overall performance; Efficacy (↑), the percentage of previously-incorrect predictions which are corrected by an update method; and Relapse (↓), the percentage of instances which were previously predicted correctly but are now predicted incorrectly following an update.

Constructing the PROBE SET
For the subset of EDIT SET that was incorrectly predicted by both GPT-2 Large and XL Base Model, we augment each instance with neighborhood instances that should or should not be affected by an edit that fixes the incorrect prediction on the dataset instance using GPT-3 (details in Appendix A.9).We combine the incorrectly predicted instances from EDIT SET and the per-instance augmentations to form the PROBE SET for evaluating semantic generalization.Dataset examples are in Table 1 and statistics in Appendix A.2. Unaffected Neighborhood.To evaluate the specificity of the edits, for each {s, v, o}, we generate a set of relevant but different instances (s ′ , v, o) and (s, v, o ′ ) that should not change when {s, v, o} is edited.The metric measures the percentage of post-update predictions arg max P(s ′ , v, o) and arg max P(s, v, o ′ ) that remain equivalent to preupdate predictions.
Affected Neighborhood.To assess the impact of changes on similar meaning prompts for each (s, v, o), we generate a set of synonyms as (s ′ , v, o), (s, v ′ , o) and (s, v, o ′ ).The score measures the percentage of post-update predictions arg max P(s ′ , v, o), arg max P(s, v ′ , o) and arg max P(s, v, o ′ ) which are equal to the ground truth label for (s, v, o).
Affected Paraphrase.To evaluate the impact on synonymous prompts, we generate a set of paraphrases as (s ′ , v ′ , o ′ ).Since paraphrases should also be successfully edited, the metric is the percentage of post-update predictions arg max P(s ′ , v ′ , o ′ ) which are equal to the ground truth label for (s, v, o).
Affected Reasoning.To assess the updated model's connectivity, we generate a two-step chain of valid reasoning prompts {R 1 , R 2 }.For instance, with the phrase "Furnishings do not make noise", R 1 could be "Furnishings are inanimate objects", and R 2 = "Inanimate objects cannot make noise".The metric is the percentage of post-update predictions arg max P(R 1 ) and arg max P(R 2 ) which are equal to the True label.

Editing and Finetuning Methods
We select hyperparameters to maximize F1 on the EDIT VALIDATION SET ( §3.3).For editing, we search for the edit layer range, edit token position (last {s, v, o}), and learning rate.For the repairfinetuning baseline, we search for the learning rate, batch size, and the number of epochs.
For editing, we perform causal tracing on the correctly-predicted samples of the EDIT VALIDA-TION SET to inform layer selection.We apply repair-finetuning and editing methods to repair incorrect predictions on EDIT VALIDATION SET, EDIT SET, and PROBE SET.
We explore two variants of repair-finetuning.RFT Fixed Epoch uses the same exact configuration found on EDIT VALIDATION SET.We hypothesize that it is prone to overfitting due to the absence of early stopping.To maximize the potential of repair-finetuning, we analyze another variant RFT Early Stop , which runs for a maximum of 10 epochs and selects the checkpoint with the highest F1 score on the entire EDIT SET.This should mitigate overfitting and reduce relapse.In contrast, the editing experiments always use the exact configuration obtained from EDIT VALIDATION SET.

High task performance is crucial for achieving strong causal tracing results
Zero-shot prompting produced near random accuracies (51.30% and 51.87% on the EDIT VALIDA-TION SET split of PEP3k and 20Q respectively for GPT-2 XL) and chaotic causal patterns with no localization as shown in Fig. 2 7 .In contrast, the Base Model exhibited significantly superior performance (77.12% on PEP3k and 73.96% on 20Q) and the resulting causal patterns were more distinct with a substantially higher AIE and strong localization.Therefore, we deduce that a significant correlation exists between high task performance and strong causal patterns, and use the Base Model for editing experiments.

Targeted part of speech and layer locations affect causal tracing conclusions and edit success
As shown in Fig. 3, the last token at the later layers has a high AIE which is trivial since fixing hidden states or MLPs in those layers restores most of the required information.We also observed strong AIE at the earlier layers for the corrupted tokens.This finding is non-trivial and emphasizes the importance of earlier layers while predicting plausibility.AIE is more pronounced at the last corrupted token compared to the first corrupted token consistently across all models and datasets.Therefore, we focus on the last (s, v, o) editing.Additional causal tracing results are present in Appendix A.10. Fig. 4 compares the average AIE at last corrupted token for unmodified, severed MLP and Attention causal graphs for all edited tokens.We notice a clear gap in AIE for MLP graphs at the earlier layers.This observation aligns with previous observations in MEMIT for encyclopedic knowledge.In contrast to encyclopedic facts, we observed the highest AIE in earlier MLP layers instead of middle layers.This demonstrates the importance of earlier layers in commonsense predictions.Interestingly, in the object corruption plot, we observed a small peak at the beginning, before the highest AIE later.We thus expanded the hyperparameter space to include the initial layer windows for the object edit layers.Table 2 presents edit layers included in hyperparameter search with the max moving average of AIE, comparing windows of size 3 and 5 using different editing tokens {s, v, o}.In all cases, the max moving average resulted in a different set of layers selected than MEMIT, where the max AIE layer is used to edit 5 layers-the selected layer and the previous 4 layers.Model in PEP3k.The object edit F1 score is higher by +17.58% in 20Q.This indicates the importance of varying editing tokens.The best editing method outperforms repair-finetuning baseline consistently for both datasets with much lower relapsed scores.

MEMIT CSK exhibits configuration generalization
The editing method continues to perform well after transferring the best hyperparameters to EDIT SET; in comparison, both repair-finetuning baselines performance drops significantly.Noticeably, RFT Fixed Epoch method has high efficacy but a much higher relapse score, between 38.36-64.96%,causing a significant decrease in the F1 score due to overfitting.The three editing methods on {s, v, o} outperform the repair-finetuning methods by 10.54-15.43%for the updated F1 score, exhibiting a better configuration generalization performance.

MEMIT CSK exhibits semantic generalization
Table 5 shows GPT-2 XL results on PROBE SET 10 .Compared to the editing methods, the repairfinetuning baselines struggle to balance the affected and unaffected samples.RFT Early Stop performs well in unaffected neighborhoods but struggles with the affected statements (measured by average).RFT Fixed Epoch reached higher performance on affected subsets but suffered with unaffected neighborhoods.In comparison, the editing methods showed balanced improvements across metrics.We also noticed that the affected neighborhood scores are generally high except for the specific editing token; e.g., while editing the object token, the affected object neighborhood score is low. 10Base Model has 0% efficacy on PROBE SET by design.

MEMIT CSK outperforms fine-tuning for repairing commonsense knowledge
To measure improvement, we re-conduct causal analysis via each token {s, v, o} corruption using successfully edited statements.Fig. 5
Inspired by the linear associative memory property of feedforward layers in Transformers (Anderson, 1972;Geva et al., 2021Geva et al., , 2022) ) and success with the approach in convolutional models (Bau et al., 2020), recent works have proposed to edit MLP weights directly (Meng et al., 2022;Dai et al., 2022;Yao et al., 2022).In the encyclopedic factual domain, Meng et al. (2022) proposed to edit single facts by fitting a Rank One Model Edit (ROME) to the parameters of an MLP layer, and showed it outperformed prior methods.Our work builds on Meng et al. (2023), which extended this approach to thousands of edits by altering the weights of a range of MLP layers.Hase et al. (2023a) demonstrate that many early edit layers can work well with MEMIT; this partially motivates our extensive layer hyperparameter search.Recent work by Cohen et al. (2023) proposes a dataset for evaluation of a variety of ripple effects in editing methods with factual knowledge and concludes that models fail to capture these effects.All aforementioned works focus on encyclopedic factual knowledge, unlike ours.

Conclusion
This paper demonstrates strong causal relations between commonsense plausibility judgments and early MLP layers in Transformers.These parameters are directly editable for repairing commonsense mistakes.We improve the MEMIT parameter editing algorithm to MEMIT CSK for commonsense plausibility prediction by varying edit tokens and by improving the layer selection strategy.GPT-2 Large and XL models edited by MEMIT CSK outperform repair-finetuned baselines by more than 10% F1 score on EDIT SET.Additionally, we construct a PROBE SET that contains unaffected and affected neighborhoods, affected paraphrases, and affected reasoning challenges for comprehensive evaluation.MEMIT CSK effectively generalizes on related and unrelated neighborhoods annotated in our PROBE SET, exhibiting semantic generalization while repair-finetuned baselines demonstrate significant trade-offs between unaffected and affected metrics.These results indicate a compelling direction of incorporating feedback about common sense in transformers on the fly through direct model editing.

Limitations
In this work, we experiment with repairing commonsense mistakes by the GPT-2 Large and XL models.We are unable to investigate larger opensourced models like GPT-J (Wang and Komatsuzaki, 2021) and GPT-NeoX (Black et al., 2022) due to resource limitations.Investigating the research questions described in §3 on larger models is a natural next step.We focus on the binary plausibility prediction task but envision that parameter editing could improve models on various commonsense tasks in future work.
Our experiments show that the optimal edit token (subject, verb, or object) varies among datasets.The specific location of a single generalized optimal edit token, if it exists, requires further investigation, while different editing methods for commonsense knowledge can be proposed.

Ethics Statement
This study proposes a framework to evaluate and correct commonsense mistakes in GPT-2 models, focusing on predicting the plausibility of commonsense statements.Commonsense knowledge is highly contextualized and varies significantly across locations and cultures.Biases and stereotypes present in edit datasets may inadvertently lead to erroneous and potentially harmful model judgments.Malicious actors may exploit model editing to incorporate false information into models.It is crucial to employ meticulously curated datasets in future research and during the deployment of these models in real-world scenarios.

A.1 Causal Tracing Background
Given a model, the method takes a concatenation of subject s and verb v as input prompt x, then predicts the corresponding object o as prediction y.For example, for the statement "Paris is the capital of ", a model is tasked with predicting "France" as the most-likely next token.Taking a correctly predicted x, y pair, Causal tracing consists of the following three steps: Step 1: clean run.Given the input prompt x, they collect all hidden activation values from the model, where T is number of input tokens in x and L is number of model layers.Concretely, for each input x, where a l i is the attention value and m l i is the corresponding MLP value.The predicted probability of the correct object is denoted as P [y].
Step 2: corrupted run.In this setting, certain part of the input prompt x is corrupted with noise.In a clean run, x is embedded as h T .However, here, they set h (0) i := h (0) i + ϵ, for all tokens i in the subject token11 .The probability of ground truth value y produced in this run is denoted as P * [y].Note that the model prediction is likely to be incorrect due to the noisy input.Step 3: corrupted-with-restoration-run.The model runs inference using the noisy input embedding created in the corrupted run, with the difference that the model is also forced to output the clean state activation h l î at certain token î and layer l.If the model successfully produces the correct output using a small number of clean states, there is likely to be a strong casual relationship between these states and the model output.The probability of the correct object is denoted as P * , clean h (l)

A.2 Datasets
Physical Event Plausibility (PEP3k; Wang et al., 2018) consists of 3,062 statements in (subject s, verb v, object o) format about semantically plausible and implausible events.It covers a wide range of possible (but not necessarily common) events with high annotator label agreement.20 Questions (20Q)12 is a dataset of 5,096 commonsense statements written by crowd annotators in games of "20 questions" and labeled as plausible or implausible.We use the (s,v,o) format of the dataset constructed by Porada et al. (2021), where x = (s, v, o) and y ∈ {T rue, F alse}.
Examples from each dataset are given in Table 1.Statistics of our created data splits are in Tables 6  and 7 A.3 Base Model vs. Zero-Shot for 20Q Dataset Comparison of Base Model and zero-shot model for the 20Q dataset is in Fig. 6.

A.4 Original MEMIT Editing Results
Table 8 shows the detailed metrics and editing parameters for MEMIT applied on EDIT VALIDATION SET.

A.5 Hyperparameters
Base Finetuning The GPT-2 Large and XL models are initially finetuned on the training set with the next-token prediction objective.Table 9 presents the optimal hyperparameters identified for the basefinetuning method.

Repair Finetuning
Table 10 shows the best hyperparameters for the repair-finetuning method.The method was very sensitive to small changes in learning rate while the other parameters worked well over a long range of values.Note that we use early stopping and restore the weights to the best performing model based on the F1 score.

MEMIT CSK
The Table 11 shows the hyper-parameters for the editing method.The method was slightly sensitive to the learning rate and very sensitive to the edit token.Note that a KL divergence factor of 0.0625 was used as the default value for all editing experiments.Appendix A.8 contains an ablation study of the KL divergence factor.

A.6 GPT-2 Large Results for Configuration and Semantic Generalization
The GPT-2 Large results for configuration generalization experiments are in Table 12.The GPT-2 Large results for semantic generalization experiments are in Table 13.

A.7 Layer Selection Strategy
For demonstration purposes let's assume our model has only 10 layers.The average indirect effects of these layers at our desired edit token (let's assume last verb token) are: [0.0, 0.1, 0.2, 0.3, 0.5, 0.4, 0.4, 0.3, 0.2, 0, 0] Let's also assume that we are considering only 5 layer windows.The highest average indirect effect is observed at the 5th layer with value 0.5.According to MEMIT, the optimal edit layers will be a 5 layer window ending at the highest AIE layer, in this case it will be the layers 1, 2, 3, 4, 5. Now let's calculate the moving average of 5 layer windows.The moving average of layers 1-5 is (0.0 + 0.1 + 0.2 + 0.3 + 0.5)/5 = 0.22, similarly the moving average of layers 2-6 will be (0.1 + 0.2 + 0.3 + 0.5 + 0.4)/5 = 0.3 and so on.The moving averages of all 5 layer windows are: [0.22, 0.3, 0.36, 0.38, 0.36, 0.26] The maximum moving average is observed for layers 4-8 with value 0.38.In our method, we would also consider layers 4-8 as in our hyperparameter search space along with layers 1-5.

KL Divergence Factor
The Table 14 shows how the performance of the editing method changes when varying the KL Divergence Factor in terms of Accuracy and F1 score.The ablation study is conducted using the GPT-2 Large model on the PEP3k dataset, and the verb token is used for editing in the EDIT VALIDATION SET dataset.The chosen hyperparameters align with those presented in Table 11.Table 13: Efficacy and semantic generalization results for the PROBE SET for GPT-2 Large.Balanced improvements are observed for editing methods across metrics, with the object token editing method performing the best.In comparison, the repair-finetuning models show skewed performance between unaffected and affected metrics.Refer to §5.4 for a detailed discussion.

Cut-Off Factor
This hyperparameter is introduced to "early stop" the optimization step.13When the probability of y i exceeds this cut-off factor upon adding the residual δ i to the transformer's hidden state h L i , the optimization step is stopped.
The Table 15 demonstrates how the performance of the editing method changes when varying the "Cut-Off" Factor in terms of Accuracy and F1 score.The ablation study is conducted using the GPT-2 Large model on the PEP3k dataset, with the verb token used for editing in the EDIT VALIDATION SET dataset.The chosen hyperparameters align with those presented in Table 11.Note that the default value of "No Factor" is used to report the performance of all editing methods, i.e., there was no "early stopping" of the optimization step.
GPT output to expected key-value pairs.We manually evaluate some examples to ensure quality.In summary, there can be up to 5 augmentations per augmentation type for each instance.
For each of the editing locations, we see that the Last Token has higher AIE towards the later layers of the model which is consistent with the results of MEMIT on encyclopedic knowledge.Focusing on the subject, verb, and object tokens, we see that all of them show high AIE in the early layers of the corrupted tokens and that the effect on the corresponding last corrupted token is more pronounced than that of the first corrupted token.This shows that selecting the last subject/verb/object token and the early layers of the model should give good results for the editing method.These patterns are consistent across all the models and datasets.
Given the text: Furnishings make noise subject token: Furnishings object token: noise Q1.In the text, replace just the subject token with a different word.The replaced text should be a valid sentence.The replaced token can be a hyponym or similar word of the original subject token.Write up to 5 such variants.
Q2.In the text, replace just the verb token with a different word.The replaced text should be a valid sentence.The replaced token can be a verb that follows or precedes the original verb token.Write up to 5 such variants.
Q3.In the text, replace just the object token with a different word.The replaced text should be a valid sentence.The replaced token can be a hyponym or similar word of the original object token.Write up to 5 such variants.
Q2. Replace the object token with a completely unrelated word and make a new text.Make 5 such replacements.

Figure 2 :
Figure 2: Base-finetuned vs. Zero-shot GPT-2 XL causal tracing on PEP3k EDIT VALIDATION SET.Patterns are unclear for the Zero-shot model while they are distinct for the Base Model.Consistent observations are found for the 20Q dataset (Fig. 6).

Figure 3 :
Figure 3: Causal tracing for GPT-2 XL Base Model on PEP3k EDIT VALIDATION SET when different tokens are corrupted, {s, v, o} (in order).See Appendix A.10 for GPT-2 Large and 20Q results.

Figure 4 :
Figure 4: Severed causal tracing results for {s, v, o} for GPT-2 XL base on PEP3k EDIT VALIDATION SET Figure 5: Causal tracing for GPT-2 XL models on successfully corrected statements in the PEP3k EDIT VAL-IDATION SET.For the RFT Early Stop model, we observe similar patterns as Fig. 3 for both token corruptions.For the edited model, an improved pattern is observed at v.
The three runs produced P [y], P * [y] and P * , clean h l i [y].Two metrics are then defined to measure the states effect between these runs.Total effect (TE) is calculated as P [y] − P * [y], while the indirect effect (IE) of a specific hidden state h l i is calculated as P * , clean h (l) i [y] − P * [y].The average total effect, ATE and average indirect effect, i.e.AIE, are computed across multiple examples for each hidden state.

Figure 6 :
Figure 6: Zero-shot vs. Base Model causal tracing results for GPT-2 XL on 20Q EDIT VALIDATION SET.

Figure 7 :
Figure 7: Prompt to generate affected paraphrase for "Furnishings make noise (false)"

Figure 8 :
Figure 8: Prompt to generate affected reasoning neighborhood for "Furnishings make noise (false)"

Figure 9 :
Figure 9: Prompt to generate affected neighborhood for "Furnishings make noise (false)"

Figure 10 :
Figure 10: Prompt to generate unaffected neighborhood for "Furnishings make noise (false)"

Figure 11 :
Figure 11: Prompt to fix grammar in a triple "furnishing make noise"

Figure 12 :
Figure 12: Causal tracing results for GPT-2 XL Base Model on 20Q EDIT VALIDATION SET when different parts of the input are corrupted.

Figure 13 :
Figure 13: Causal tracing results for GPT-2 Large Base Model on 20Q EDIT VALIDATION SET when different parts of the input are corrupted.
is computed.8215 Total effect (TE) is defined as P [y] − P * [y], while the indirect effect (IE) of a specific hidden state h l i is defined as P * , clean h (l) The average total effect (ATE) and average indirect effect (AIE) are computed across multiple examples for each hidden state.

Table 1 :
Computers make noise Fixtures make noise Furniture can be noisy Furnishings are inanimate objects Furnishings make color Furnishings produce noise Furniture can create sound Inanimate objects cannot make noise Furnishings make noise False Furnishings make sound Furniture can be a source of noise Examples chosen through random sampling from the PEP3k and 20Q PROBE SET.Unaffected neighborhood samples are created by individually augmenting the subject and object with different, but relevant instances from the source statement.Likewise, affected neighborhood samples are created by individually augmenting the subject, verb, and object with synonymous instances from the source statement.Further details are in §4.2.1.

Table 2 :
Layer with max AIE and set of layers with max moving average AIE for the PEP3k EDIT VALIDATION SET These two changes resolve in MEMIT CSK .Table 3 compares original MEMIT 8 (only subject edit with fixed edit layers) with the bestperforming edit of MEMIT CSK on EDIT VALIDA-TION SET.MEMIT CSK consistently outperforms MEMIT across datasets and models.

Table 3 :
Comparison of MEMIT and best performing MEMIT CSK on EDIT VALIDATION SET.MEMIT editing is on s, while MEMIT CSK is on best among {s, v, o}.

Table 4 reports
GPT-2 XL results for EDIT VALI-DATION SET and EDIT SET 9 .The GPT-2 Large results are in Appendix A.6 Table12.For the EDIT VALIDATION SET performance, the verb edit F1 score is higher by +17.97% compared to the Base

Table 4 :
Configuration generalization results based on the best hyperparameters identified for EDIT VALIDATION SET and applied to EDIT SET for GPT-2 XL.The editing methods display high configuration generalization compared to repair-finetuning.Refer to §5.3 for further discussion.GPT-2 Large results are in Appendix A.6 Table12.

Table 5 :
Efficacy and semantic generalization results on PROBE SET for GPT-2 XL.Balanced improvements are observed for editing methods across metrics, with the s and o edits performing the best.Refer to §5.4 for a detailed discussion.GPT-2 Large results are in Appendix A.6 Table13.causalpattern and AIE remain similar to the Base Model in Fig.3.In contrast, the v edited model (F1 Score 95.09%) shows an enhanced AIE for all types of corruption.Specifically, a high AIE of 0.468 is recorded at the last verb token for verb corruption.These findings confirm that localization and AIE improve for the edited model at the edit location.
displays the causal graphs for best-performing edit: v edited model and the best repair-finetuned model: RFT Early Stop based on Table 4.For RFT Early Stop (F1 Score 90.16%), the overall

Table 6 :
Number of samples in the Training Set, EDIT VALIDATION SET, EDIT SET, and PROBE SET.

Table 7 :
Number of samples in the PROBE SET.

Table 8 :
Editing results after applying original MEMIT on EDIT VALIDATION SET.

Table 9 :
Base Model hyperparameters for Training Set

Table 10 :
Hyper-parameters for RFT Fixed Epoch tuned for EDIT VALIDATION SET and applied to EDIT SET and

Table 12 :
Configuration generalization results based on the best hyperparameters identified for the EDIT VALIDATION SET and applied to the EDIT SET for GPT-2 Large.The editing method displays high configuration generalization while both variants of the repair-finetuning method have a lower F1 Score on the EDIT SET.Refer to §5.3 for further discussion.

Table 14 :
Ablation study of the KL Divergence Factor on the GPT-2 Large model edited using the verb token on layers l ∈ 2, 3, 4, 5, 6 in the EDIT VALIDATION SET split of PEP3k.Note that default KL Factor of 0.0625 is used to report the performance of all editing methods.

Table 15 :
Ablation study of the "Cut-Off" Factor on the GPT-2 Large model edited using the verb token on layers l ∈ 2, 3, 4, 5, 6 for PEP3k EDIT VALIDATION SET.