Editing Factual Knowledge in Language Models

The factual knowledge acquired during pre-training and stored in the parameters of Language Models (LMs) can be useful in downstream tasks (e.g., question answering or textual inference). However, some facts can be incorrectly induced or become obsolete over time. We present KnowledgeEditor, a method which can be used to edit this knowledge and, thus, fix 'bugs' or unexpected predictions without the need for expensive re-training or fine-tuning. Besides being computationally efficient, KnowledgeEditordoes not require any modifications in LM pre-training (e.g., the use of meta-learning). In our approach, we train a hyper-network with constrained optimization to modify a fact without affecting the rest of the knowledge; the trained hyper-network is then used to predict the weight update at test time. We show KnowledgeEditor's efficacy with two popular architectures and knowledge-intensive tasks: i) a BERT model fine-tuned for fact-checking, and ii) a sequence-to-sequence BART model for question answering. With our method, changing a prediction on the specific wording of a query tends to result in a consistent change in predictions also for its paraphrases. We show that this can be further encouraged by exploiting (e.g., automatically-generated) paraphrases during training. Interestingly, our hyper-network can be regarded as a 'probe' revealing which components need to be changed to manipulate factual knowledge; our analysis shows that the updates tend to be concentrated on a small subset of components. Source code available at https://github.com/nicola-decao/KnowledgeEditor


Introduction
Using pre-trained transformer-based Language Models (LMs; Vaswani et al., 2017;Devlin et al., 2019;Radford et al., 2019;Lewis et al., 2020;Brown et al., 2020) Figure 1: Left: a model f with parameters θ prefers a prediction y for input x (e.g., y is the mode/argmax of a discrete distribution parameterized by f (x; θ)). Right: our method uses a hyper-network g to update the parameters of f to θ such that f (x; θ ) prefers an alternative prediction a without affecting the prediction y of any other input x = x. Our model edits the knowledge about x stored in the parameters of f . become a standard practice in NLP. Factual knowledge induced during pre-training can help in downstream tasks, but it can also be incorrect or become obsolete over time (e.g., not reflecting changes of heads of states or country populations). Developing reliable and computationally efficient methods for bug-fixing models without the need for expensive re-training would be beneficial. See Figure 2 for an example of revising the memory of a model that initially misremembered Namibia's capital.
Unlike conventional Knowledge Bases (KBs) that explicitly store factual knowledge, neural models implicitly memorize facts in their parameters. One cannot easily access and interpret their computation and memories (Ribeiro et al., 2016;Belinkov and Glass, 2019;Voita et al., 2019;De Cao et al., 2020), thus, modifying their knowledge is a challenging problem. Motivated by practical considerations, we formulate the following desiderata for a method aimed at tackling this problem (see Section 2 for a more formal treatment): • Generality: be able to modify a model that was not specifically trained to be editable (i.e., no need for special pre-training of LMs, such as using meta-learning);  Left: model top-k predictions from Beam Search. Right: top-k after using our method conditioning on changing 'What is the capital of Namibia?' from 'Namibia' (wrong) to 'Windhoek' (correct prediction). Changing one fact also changes a semantically equivalent question and keeps the predictions from other facts the same.
• Reliability: be able to successfully update a specific fact without affecting the rest of the acquired knowledge; • Consistency: the changes should be consistent across equivalent formulations of a fact (e.g., when asked to update an answer for one question, answers to its paraphrases should change accordingly). The problem has been previously tackled in Zhu et al. (2020) and Sinitsin et al. (2020), as discussed in detail in Section 3. However, both do not ensure that the edited model will be 'reliable', i.e. that the rest of the knowledge would not be badly affected, and that the changes are 'consistent' across equivalent inputs. Additionally, Sinitsin et al.'s (2020) method requires expensive specialized training of the original network. While re-training the original network was feasible in their applications (e.g., in machine translation), it is problematic when the network is a pre-trained LM. We propose a novel method that overcomes these limitations.
We treat editing the memories of a neural model as a learning-to-update problem. We use an efficient parameterization of a hyper-network that is trained to update the LM parameters when provided with a single fact that needs to be modified. We do not require meta-learning, re-training or fine-tuning of the original network. We employ constrained optimization in training: we constrain the edited model to retain the same predictions as the original one regardless of the distance between the original and updated models in the parameter space. We show how this framework can be extended to incorporate (e.g., automatically-generated) paraphrases in training, further improving consistency. Figure 1 shows an outline of our method.
Differently from both previous methods, we do not have to select a subset of parameters to update as we let our model learn that by itself. In fact, our hyper-network can be regarded as a 'probe' revealing which components of the network need to be changed to manipulate factual knowledge, i.e. revealing the 'causal mediation mechanisms' (Vig et al., 2020). We observe that the updates end up being concentrated in a restricted set of model components, even though we do not encourage any kind of sparsity. Interestingly, the most-updated components are different from the groups of parameters receiving large gradients (see Figure 4).

Contributions
Our contributions are as follows: • we define the task of knowledge editing and propose a set of evaluation metrics; • we propose KNOWLEDGEEDITOR that learns to modify LMs memories efficiently and reliably while maintaining consistent predictions for semantically equivalent inputs; • we verify that our proposed method largely meets our desiderata-while other baselines based on fine-tuning fail-testing it with different LM architectures on knowledgeintensive tasks such as fact-checking and open-domain question answering; • we analyze the updates for KNOWLEDGEEDI-TOR and the alternatives.

Task
We want to edit the memory of a neural language model such that when, presented with an input, its output reflects a revised collection of facts. Unfortunately, the knowledge of a language model is typically opaque to us, being stored non-locally across a large number of parameters and architectural components. Thus, concretely, to operational-ize the task, we seek a change in the model's parameters that affects predictions from the model only for a specific input. For a given input x, the prediction a made by the edited model should differ from the prediction y made by the original model only if x is influenced by one of the revised facts.

Definition
More formally, we have a model x → f (x; θ) with trained parameters θ, and a dataset of revisions x, y, a ∈ D, i.e., x is an input, y is the prediction preferred by f (x; θ), and a is an alternative prediction which we would like an edited version of the model to prefer. Concretely, we keep the model architecture f fixed, and seek alternative parameters θ such that for x, f (x; θ ) would prefer the prediction a instead of y while keeping all other predictions unchanged. In practice, we approximate the set of 'all other predictions' using a finite data set O x of pairs x , y with x = x. Moreover, predictions need not be continuous nor differentiable outputs from the model; instead, they may result from an arbitrary decision rule based on f (x; θ).
For example, when f (x; θ) parameterizes a discrete distribution p Y |X over the output space, the most standard decision rule is to output the mode of the distribution: y = arg max c∈Y p Y |X (c|x, θ). 2 Semantically equivalent inputs Optionally, for some revision x, y, a ∈ D, we may also have a set P x of inputs semantically equivalent to x (e.g., automatically-generated paraphrases). Such a set can be used in at least two ways: i) to obtain explicit supervision for changes that should be realized in tandem with x, y, a ; and, independently of that, ii) to evaluate whether an edited model makes consistent predictions on semantically equivalent inputs. Note that in this work we never use paraphrases at test time, only for training and evaluation of our approach; generating them at test time, while potentially helpful, would have compromised efficiency.

Evaluation
To test if a method g, producing edited parameters θ , meets our desiderata, we measure: 1. success rate: how much g successfully updates the knowledge in θ , measured as accu-2 Whereas in text classification solving this is straightforward (for Y is small), in sequence-to-sequence we resort to beam search to approximate the mode (for Y is too large or unbounded). racy of revised predictions for inputs in D; 2. retain accuracy: how well θ retains the original predictions of f , measured as accuracy wrt input-output pairs in sets O x ; 3. equivalence accuracy: how consistent the predictions of the revised model θ are for semantically equivalent inputs, measured as accuracy of the revised predictions for all P x ; 4. performance deterioration: how much test performance of the updated model deteriorates. 3 These values are obtained by comparing predictions of f (·; θ) and f (·; θ ) for different subsets of inputs (e.g., D, O x , P x ) and against different targets (e.g., gold-standard, original predictions, or alternative predictions). While these metrics are straightforward to compute in principle, some can be computationally demanding. For example, retain accuracy depends on predictions for all inputs we have access to, which is potentially the entirety of the downstream task's validation/test data. 4 Previous work has evaluated similar versions of this task differently. Sinitsin et al. (2020) measure performance deterioration and success rate but do not measure retain accuracy nor equivalence accuracy. A small performance deterioration does not guarantee high equivalence accuracy as the former is sensitive to changes in cases where the original model makes wrong decisions. Assessing accuracy against old or revised facts, which Zhu et al. (2020) also do, does not help to measure the retain accuracy. We argue that preserving model predictions for inputs not in D is critical in production settings, where model predictions might have been extensively analyzed and tested. For x ∈ D, we aim to maintain all original predictions as well as the model scores f (x ; θ ) itself, effectively avoiding the need to re-calibrate the models (for example, in applications where probability estimates are used downstream). Sinitsin et al. (2020) propose a meta-learning approach (Finn et al., 2017) for model modification that learns parameters that are easily editable at test time (e.g., updating the knowledge of the model requires only a few SGD steps from these learned parameters). To have a reliable method, they employ a regularized objective forcing the updated model not to deviate from the original one. This technique suffers from three main limitations: i) it requires expensive and specialized pre-training, ii) it is sensitive to many hyper-parameters (e.g., the weights of the regularizers and the subset of parameters to update), and iii) their multitask objective does not guarantee reliability (i.e., the model is penalized for diverging from the original, rather than constrained not to).
Instead of penalizing an updated model for deviating from the original one, Zhu et al. (2020) use constrained optimization. They use a less computationally expensive procedure as they re-finetune on a specific downstream task (with altered data). Their method employs either an L 2 or L ∞ constraint between the original model's parameters and the edited ones. However, a norm-based constraint on parameters ignores the highly nonlinear nature of LMs and how parameters determine the outputs of the model. Indeed, a minimal change in parameter space may produce a completely different output for many datapoints leading to a potentially unreliable method. Additionally, they show the need to select a subset of parameters to be updated, which requires extra development effort. Zhu et al.'s (2020) method is similar to Elastic Weight Consolidation (Kirkpatrick et al., 2017), a technique developed for preventing catastrophic forgetting in neural network models. Petroni et al. (2019) show that pre-trained language models recall factual knowledge without fine-tuning, which they do by feeding specific prompts to LMs. Handcrafted prompts have been found not to be the best option to extract knowledge from LMs, and various solutions have been proposed to understand what LMs 'know' (Jiang et al., 2020;Shin et al., 2020;Liu et al., 2021). Additionally,  show that large models can be fine-tuned to access their internal memories to answer questions in natural language without any additional context and with surprisingly high accuracy-a setting they referred to as closed-book question answering. Although performing quite well, these models cannot reach the prediction quality of alternatives that retrieve and use context. Approaches that incentivize memorization of factual knowledge show to be beneficial for many downstream tasks suggesting that research on methods that effectively edit the memory of a model is indeed important . Some recent hybrid approaches that use both implicit and explicit memory show some benefits for question answering (Févry et al., 2020;Verga et al., 2020). Notably, language models that only rely on internal implicit memory are state-of-the-art for (multilingual-) Entity Linking (De Cao et al., 2021a,b). An effective mechanism for editing LM's implicit memory may be applicable in all these settings.

Knowledge in Language Models
Causal Interventions Identification of minimal changes to neural networks needed to achieve a certain behaviour has been studied in the context of research in interpreting neural networks (Lakretz et al., 2019;Vig et al., 2020;Elazar et al., 2021;Csordás et al., 2021). The components which need to be updated can be interpreted as controlling or encoding the corresponding phenomena (e.g., subject-verb agreement). Much of this research focused on modifying neuron activations rather than weights and on sparse interventions (e.g., modifying one or a handful of neurons). While far from our goals, there are interesting connections with our work. For example, our analysis of updates in Section 6.4, though very limited, may shed some light on how factual knowledge is encoded in the parameters of a model.

Method
We propose to treat the task of editing the memory of a neural model as a learning problem. Instead of defining a handcrafted algorithm to compute the new parameters θ , we learn a KNOWL-EDGEEDITOR: a model that predicts θ conditioned on an atomic fact that we want to modify. Concretely, KNOWLEDGEEDITOR is a hypernetwork (Ha et al., 2017)-i.e., a neural network that predicts the parameters of another network. Since the task requires every other prediction to stay the same-except the one we desire to change-we cast the learning task as a constrained optimization problem.
Optimization For an input x, changing the prediction of a model f (·; θ) to a corresponds to minimizing the loss L(θ; x, a) incurred when a is the target. Preserving the rest of the knowledge corresponds to constraining the updated parameter θ such that model outputs f (·; θ ) do not change for x ∈ O x . Our editor g is a neural network parameterized by φ which we choose by optimising the following objective for each data-point x, y, a ∈ D: where P x is the set of semantically equivalent inputs to x (for convenience we assume it contains at least x), θ = θ + g(x, y, a; φ), C is a constraint on the update, and the margin m ∈ R >0 is a hyperparameter. The constraint is used to express our desire to preserve model outputs unchanged for x = x. Note that only x, but not the rest of P x , are provided as input to the editor, as these will not be available at test time. In our models, f (x; θ) parameterizes a discrete distribution p Y |X over the output sample space Y, hence we choose to constrain updates in terms of sums of Kullback-Leibler (KL) divergences from the updated model to the original one: The constraint pushes the updated model to predict output distributions identical to the original one for all x = x. An alternative constraint we could employ is an L p norm over the parameter updates such that g is optimized to make a minimal update to the original model parameter: This constraint was previously used by Zhu et al. (2020). However, such a constraint, expressed purely in parameter space and without regards to the model architecture f , does not directly encourage model outputs to be close to original ones in function space (i.e., the two functions to be similar). Neural models are highly non-linear functions, so we do not expect this type of constraint to be effective. This will be empirically demonstrated in Section 6.
Tractable approximations Non-linear constrained optimization is generally intractable, thus we employ Lagrangian relaxation (Boyd et al., 2004) instead. The constraint itself poses a computational challenge, as it requires assessing KL for all datapoints in the dataset at each training step. For tractability, we evaluate the constraint approximately via Monte Carlo (MC) sampling (see Appendix A for more details). Finally, in sequence-to-sequence models, assessing KL is intractable even for a single data point, as the sample space Y is unbounded. In such cases we approximate the computation on a subset of the sample space obtained via beam search.
Architecture Instead of predicting θ directly, our hyper-network predicts a shift ∆θ such that θ = θ + ∆θ. A naive hyper-network implementation might be over-parameterized, as it requires a quadratic number of parameters with respect to the size of the target network. Thus, we apply a trick similar to Krueger et al. (2017) to make g tractably predict edits for modern large deep neural networks (e.g., BERT). Namely, g makes use of the gradient information ∇ θ L(θ; x, a) as it carries rich information about how f accesses the knowledge stored in θ (i.e., which parameters to update to increase the model likelihood given a). 5 We first encode x, y, a , concatenating the text with special separator and feeding it to a bidirectional-LSTM (Hochreiter and Schmidhuber, 1997). Then, we feed the last LSTM hidden states to a FFNN that outputs a single vector h that conditions the further computations. To predict the shift for a weight matrix W n×m ∈ θ, we use five FFNNs conditioned on h that predict vectors α, β ∈ R m , γ, δ ∈ R n and a scalar η ∈ R. Then where σ is the Sigmoid function (i.e., x → (1 + exp(−x)) −1 ), andσ indicates the Softmax function (i.e., x → exp(x)/ i exp(x i )). With this formulation, the parameters for the hyper-network φ scale linearly with the size of θ. An interpretation of Equation 3 is that an update ∆W is a gated sum of a scaled gradient of the objective and a bias term. The scale for the gradient and the bias are generated via an outer vector product as it allows for efficient parameterization of a matrix with just three vectors. The gate lets the model keep some parameters unchanged.
Margin annealing The margin m is a hyperparameter and therefore fixed. However, i) it is hard to choose since it is task-dependent, and ii) it should be as small as possible. If the margin is too small, however, we risk having a small feasible set, and the model may never converge. To address both issues, we pick some initial value for the margin and anneal it during training conditioned on validation performance: when the model successfully changes > 90% of the predictions, we multiply the margin by 0.8. We stop decreasing the margin once it reaches a desirable small value. The annealing procedure prevents the model from diverging while increasingly tightening the constraint.

Experimental Setting
We aim to evaluate the effectiveness of KNOWL-EDGEEDITOR comparing to baselines on knowledge-intensive tasks where the importance of modifying the memory of a large LM has a broad impact. We then test our method on closed-book fact-checking and closed-book question answering with the metrics proposed in Section 2.2.

Baselines
We compare against two baselines: i) fine-tuning and ii) the method proposed by Zhu et al. (2020). Fine-tuning corresponds to using standard gradient descent, minimizing the loss for the fact/prediction we want to revise. For this, we follow Sinitsin et al. (2020) and employ RMSProp (Tieleman and Hinton, 2012). 6 We set the learning rate to 10 −5 and stop upon successfully changing the output of the model or having reached a maximum of 100 gradient steps. Zhu et al.'s (2020) method extends fine-tuning with an L ∞ constraint on parameters. 7 Following both Sinitsin et al. (2020) and Zhu et al. (2020) we report these baselines fine-tuning all parameters or just a subset of them. We limit the search to selecting entire layers and base our decision on performance on a subset of the validation set. Note that selecting a subset of parameters for update requires an extensive search, which KNOWLEDGEEDITOR dispenses with by automatically learning it.

Models and data
We evaluate on closed-book fact-checking (FC) fine-tune a BERT base model (Devlin et al., 2019) on the binary FEVER dataset (Thorne et al., 2018) from KILT (Petroni et al., 2021). We also evaluate on a task with a more complex output space: closedbook question answering (QA). For that we finetune a BART base model (Lewis et al., 2020) with a standard seq2seq objective on the Zero-Shot Relation Extraction (zsRE) dataset by Levy et al. (2017). We evaluate on this dataset because it is annotated with human-generated question paraphrases that we can use to measure our model's robustness to semantically equivalent inputs. We create alternative predictions for FC simply flipping the labels, whereas for QA we pick all hypotheses enumerated via beam search except the top-1. The latter ensures high-probability outcomes under the model distribution. We generate semantically equivalent inputs with back-translation. See Appendix B for technical details on models and data collection. Table 1 reports the main results for fact-checking and question answering.

Results
Overall, KNOWL-EDGEEDITOR achieves high performance in all metrics. Some other methods also achieve high accuracy in some metrics but always sacrificing others (i.e., never meeting all our desiderata at once).
We compare methods along different metrics (as opposed to a single one), as there is no way to precisely determine the importance of each of these metrics. To gather more insight, we compute their stochastic convex combination with coefficients sampled from a Dirichlet distribution (with α = 1 to ensure a very diverse set of combinations) and report in Figure 6 in Appendix C an estimate of the probability that a system outperforms another across 1, 000 such combinations. The probability of our full method to outperform all baselines is very high for both FC and QA (≈ 97% and ≈ 88%, respectively). In Figure 5 in Appendix C, we show the distributions of the combined scores (i.e., the raw data for the approximation reported in Figure 6). We then analyze different aspects of our method and the baselines.

Success rate
Every method achieves an almost perfect success rate on fact-checking. All methods but ours apply updates in a loop, stopping either when the new model is successfully updated or after reaching a maximum number of iterations. The success rate for KNOWLEDGEEDITOR is not 100% because we do not apply more than one update even in case of failure. To this end, we also show an experiment with our method with multiple updates within a loop employing the same stopping criteria as the baselines. Note that we apply this only at test time (i.e., we do not train for multiple updates). When applying multiple updates also our method reaches a 100% success rate on fact-checking and almost perfect accuracy (> 99%) for QA. 8 Closed-book QA is a more challenging task since the output space is text and not just a binary label. In this setting, KNOWLEDGEEDITOR achieves high accuracy (≈ 95% or > 99% with the loop). Among all methods, KNOWLEDGEEDI-TOR gets the best success rate while also obtaining the best retain accuracy. In QA, Zhu et al.'s (2020) method does not reach a good success rate (≈ 80%). We searched hyperparameters for their method also to have high retain accuracy, and indeed that is higher than regular fine-tuning. However, unlike fact-checking, regular fine-tuning for QA gets almost perfect scores but at the expense of the retain accuracy. Sequence-to-sequence models are more sensitive to a slight parameter shift. This happens because minor changes may completely alter the top-k prediction from beam search (in the case of QA). Differently, in a binary classifier (in the case of FC) the probability of a prediction can change substantially without crossing the decision boundary (usually set at 0.5 when not calibrated).

Retaining previous knowledge
KNOWLEDGEEDITOR maintains the predictions in the validation set almost perfectly (retain accuracy 8 Even if we do not train for multiple subsequent updates, its success opens the possibility to add this at training time. We leave the exploration of this technique to future work. is ≈ 98% for both FC and QA). Conversely, as expected, our method with C L 2 has very low retain accuracy (always < 50%). C L 2 suffers from catastrophic forgetting because it does not enforce the updated model to be close to the original one in function space (i.e., the two functions to be similar) but just in parameter space.
Fine-tuning all layers is successful but it affects the previously acquired knowledge negatively: retain accuracy is ≈ 87% and ≈ 68% for FC and QA, respectively, while performance deterioration in ≈ 2% and ≈ 4%. Fine-tuning a single layer is more effective as it prevents over-fitting (the best model updates the 1st layer in both FC and QA). However, in FC the updated model does not generalize on semantic equivalent inputs: the accuracy on paraphrases is much lower even than versions of our methods which do not use paraphrases in training (42% vs. > 81%), and even more so when compared to those which use them (> 94%).
Fine-tuning with Zhu et al.'s (2020) method does not affect performance for FC much, which is not surprising since standard fine-tuning already gets almost perfect scores. Differently, in the QA setting, using their constrained optimization boosts the retain accuracy (up to +4% to normal finetuning) but at the cost of a low success rate (≈ 80% where fine-tuning gets the perfect score).

Accuracy on paraphrases
We evaluate our method both with and without the additional supervision of paraphrases to improve generalization-that corresponds to have P x as the set of paraphrases of x or P x = {x} in Equation 1, respectively. Without this additional supervision,   KNOWLEDGEEDITOR is already competitive in equivalence accuracy. However, employing this additional supervision is clearly beneficial on both tasks: we get the same success rate and re-train accuracy but equivalence accuracy improves by > 70% on FC and > 30% on QA, respectively (for generated paraphrases). In FC, although finetuning of a single layer proved to be optimal in terms of success rate and retain accuracy, it performs poorly for paraphrases. That is the model successfully updates the prediction of a particular datapoint, but does not update predictions of paraphrases. This indicates that fine-tuning to edit the knowledge of a model does not generalize well, and it overfits to specific inputs. On QA, also Zhu et al. (2020) performs poorly compared to our or other methods. When other methods perform on par or better than ours on paraphrases, they do not have good retain accuracy (e.g., see QA fine-tuning on Table 1). Fine-tuning on QA seems to generalize better than on FC, but does not preserve previous knowledge. In Table 1 we also report both the accuracy on the set of generated and human-generated paraphrases. Surprisingly, the scores on human-generated paraphrases are higher. We speculate that this happens because automatic paraphrases are sometimes not semantically equivalent or fluent.

Analysis of model updates
In Figure 3 we plot the distribution of logits of the original and updated model on FC for different methods. With an ideal method, all logits before and after an update have to stay the same (except the ones we want to change). From that figure, we can see distributions of different types of errors such as datapoints whose predictions were mistakenly flipped (from true to false or the other way around). These errors are mostly concentrated around the origin, where small perturbations make logits cross the decision boundary. When finetuning all layers, we can see a clear impact on logits, they undergo a lot of change (i.e., points do not concentrate around the diagonal). Indeed, finetuning makes many datapoints cross the decision boundary and their probabilities to change from the original ones. The failure of C L 2 is visible in Figure 3b as this method preserves almost none of the previous predictions. Instead KNOWLEDGEED-ITOR preserves almost all of the predicted labels as well as their probabilities (most datapoints in Figure 3c stay on the diagonal).
We also report visualizations of the average weight updates for the QA experiment in Figure 4. We report the setting with additional supervision from paraphrases (but the heatmaps are similar without them). There are three main observations from this plot. First, gradients are mostly concentrated on the first encoder layer and the last decoder layer. Gradients explain why the best subset of parameters to update is the first layer. Secondly, fine-tuning does not preserve gradient magnitudes and updates the whole model almost uniformly. That happens because of the optimizer's adaptive  learning rate that initially erases the gradient direction. The gradient direction plays a role only after a couple of gradient steps, but most of the time, the method only needs one step to modify its knowledge. Lastly, our updates are sparser and are not consistent with the gradient for changing the predictions. That indicates that our method learns to use the gradient in a meaningful way (i.e. ignoring some directions or manipulating its magnitude). It is surprising that the knowledge manipulation seems to be achieved by primarily modifying parameters affecting the shape of the attention distribution (W K self and W Q self ) rather than, e.g., values (W V self ). As we discussed, the hyper-network may be regarded as a probe providing insights about the mechanism used by the model to encode the knowledge (Vig et al., 2020). For example, the focus on the bottom layer is already intriguing, as it contrasts with claims that memorization happens in top layers of image classification models (Stephenson et al., 2021), hinting at substantial differences in the underlying memorization mechanisms in NLP and vision. Proper investigation is however outside of the scope of this study. See Appendix C for some additional analysis.

Conclusions
In this work, we explore the task of editing the factual knowledge implicitly stored in the parameters of Language Models. For this task, we formally define desiderata, the objective, and a set of metrics to measure the efficacy of different methods. We concretely evaluate that on two benchmarks based on closed-book fact-checking and question answering. We propose KNOWLEDGEEDITOR, a method based on a hyper-network that learns to modify implicit knowledge stored within LM parameters efficiently and reliably. We provide comprehensive evaluations for our models against different variants of fine-tuning demonstrating the advantage of our approach. The magnitude of the updates predicted by our method may unfold the mechanisms used by the LMs to encode factual knowledge; we leave such investigation for future work.

Ethical Considerations
Technology built upon pre-trained LMs inherits some or all of their potential harms (Bender et al., 2021). Our technology for editing the knowledge of LMs does not exacerbate their potential harms and can, in fact, be used to mitigate harms, as models can be corrected once problems are discovered. However, we note that malicious uses of our knowledge editor are possible. For example, malicious agents may use the techniques presented in this work to inject incorrect knowledge into LMs.

A Relaxation and Approximation of Constrained Optimization
Given a objective to minimize in the form of can be solved with Lagrangian relaxation (Boyd et al., 2004) using a multiplier α ∈ R ≥0 and be approximated by sampling y ∼ p(y) to Equation 5 can be evaluated with automatic differentiation and optimized via gradient descent.

B Experimental setting B.1 Fact-checking
We evaluate on closed-book fact-checking (FC) using the binary FEVER dataset (Thorne et al., 2018) from KILT (Petroni et al., 2021). FEVER has 104,966 training and 10,444 validation instances respectively. For every input claim x, the model predicts the probability f (x; θ) that it may be true. This is done without retrieving any evidence from a corpus, instead, just by relying on the knowledge accumulated during pre-training and encoded in its own parameters-this is similar to  that investigate closed-book and zero-shot FC using masked-LMs. Concretely, we ask the LM to perform binary classification. We fine-tune a BERT base model (Devlin et al., 2019) with an additional linear layer on top that maps the hidden state corresponding to the BOS (beginning of a sentence) token to the probability of the positive label. Given the available supervision, we train the architecture to maximize the model likelihood penalized by entropy regularization and weight decay. The final model has an accuracy of 77.1%. 9

B.2 Question answering
We also evaluate on a task with a more complex sample space: closed-book question answering (QA). Here QA is treated as a sequence-tosequence problem from question to answer without retrieving nor providing any evidence . This, as in FC, emphasises the role of the knowledge acquired in pre-training and encoded in the parameters of the model. For this task, we used the Zero-Shot Relation Extraction (zsRE) dataset by Levy et al. (2017). We prefer zsRE to other popular QA datasets such as SQuAD (Rajpurkar et al., 2016), Natural Questions (Kwiatkowski et al., 2019) or TriviaQA (Joshi et al., 2017) because it is annotated with humangenerated question paraphrases that we can use to evaluate our model's robustness to semantically equivalent inputs. zsRE is specifically constructed not to have relation overlaps between training and test (i.e. it is zero-shot). We re-split the dataset to have the same distribution in training and test splits-we are not interested in zero-shot specifically, so we avoid the additional complexity it entails. The original zsRE dataset has 147,909 training and 3,724 validation instances respectively. After re-splitting and employing all paraphrases, we have 244,173 training and 27,644 validation instances respectively. For this task, we fine-tune a BART base model (Lewis et al., 2020) with a standard seq2seq objective, i.e., maximizing the model likelihood given the observed output sequence (Sutskever et al., 2011 and regularized with dropout (Srivastava et al., 2014) and label smoothing (Szegedy et al., 2016). The final model has an accuracy (exact match between model prediction and gold standard) of 22.1%. 10

B.3 Generating alternative predictions
Generation of alternative predictions is taskdependent as it requires producing a plausible substitute target for a given input-e.g., if we need to edit the knowledge about a head of a state, a plausible substitute label should be a person, not a random (even if well-formed) string. Fact-Checking is straightforward: we simply flip the label, as it is binary classification. For QA, we exploit highprobability outcomes under the model distribution as a proxy to plausible revisions. In particular, we pick all hypotheses enumerated via beam search except the top-1. 11

B.4 Semantically equivalent inputs
We would like the updated model to be consistent for semantically equivalent inputs (see P x in Section 2 and 4) as opposed to just learning a new specific and isolated datapoint. This consistency is indicative of an effective editing mechanism that taps into the knowledge stored in the model. However, not all datasets come with paraphrases of its inputs (e.g., in our case FEVER does not come with paraphrases and zsRE only has paraphrases for 30% for the dataset). To this end, we generate semantically equivalent inputs using roundtrip translation (Sennrich et al., 2016;Wieting and Gimpel, 2018

B.6 Training details
The original models which we want to modify are trained with a batch size of 256 using Adam (Kingma and Ba, 2015) (learning rate of 3e-5) with weight decay (1e-2) and a linear schedule with warm-up (50k total number of updates and 500 warm-up updates). We trained for a maximum of 20 epochs and employ model selection using accuracy on the validation set. 12 KNOWLEDGEEDITOR models are trained with a batch size of 1024 for FC and 256 for QA using Adam (learning rate of 3e-4 for the parameters and 1e-1 for the Lagrangian multiplier) with weight decay (1e-2) and a linear schedule with a warm-up (200k total number of updates and 1k warm-up updates). We trained for a maximum of 200 epochs and employ model selection using overall accuracy (success rate and retain accuracy) on the validation set (approximated using mini-batches). 13 The margin for the C KL is annealed between 1e-1 and 1e-3 for the fact-checking model, and between 1e-3 and 1e-5 for the BART question answering model. For the sequence-to-sequence loss, we employ a cross-entropy loss with label smoothing of 0.1.

C Additional Results
Update Analysis During preliminary experiments, we studied a version of our hyper-network that did not exploit gradient information (see Equation 3). Without gradient information, on FC the models converged ≈ 10 times slower to reach the same accuracy and did not converge for QA (i.e., the model was not able to get > 75% success rate and > 50% retain accuracy). That suggest that the gradients are helpful and actually used by our hyper-network but should not used directly, without a modification. To better show this, in Table 2 we report correlations between different update methods and the gradient in terms of cosine similarities between updates. Naturally, fine-tuning and the gradient are highly correlated, but our method (with and without additional paraphrases supervision), poorly correlates with the others. Low cosine similarity can be due to two factors i) the model indeed projects the gradient to a different and more 'knowledge preserving' direction, or ii) the parameter space is so large that cosine similarity gets to zero very quickly, not revealing the genuine underlying similarity.     Table 1). Sampling weights allows to interpret the score in a probabilistic way. KNOWLEDGEEDITOR (with different variants) presents distributions that are more skewed towards a high score (100) indicating that it is highly likely that when assigning some weights to the metrics, the weighted sum will be in favour of our method. Better view with colors. Ours CL2 Our CKL

Ours CKL+loop
Ours CKL+ x Ours CKL+ x +loop System A Figure 6: Probability that system A is better than system B according to a weighted sum of metrics (see individual values in Table 1) sampling mixing coefficients 1, 000 times from a Dirichlet distribution (with α = 1 to cover a diverse spectrum of metric combinations). The probability that KNOWLEDGEEDITOR (with C KL + P x + loop) is better than competing systems is high (> 97% for FC and > 88% for QA) indicating that it is highly likely that when assigning some weights to the metrics, the weighted sum will be in favour of our method. Better view with colors.