Language Anisotropic Cross-Lingual Model Editing

Multilingual pre-trained language models can learn task-specific abilities or memorize facts across multiple languages but inevitably make undesired predictions with specific inputs. Under similar observation, model editing aims to post-hoc calibrate a model targeted to specific inputs with keeping the model's raw behavior. However, existing work only studies the monolingual scenario, which lacks the cross-lingual transferability to perform editing simultaneously across languages. In this work, we focus on cross-lingual model editing. Firstly, we define the cross-lingual model editing task and corresponding metrics, where an edit in one language propagates to the others. Next, we propose a framework to naturally adapt monolingual model editing approaches to the cross-lingual scenario using parallel corpus. Further, we propose language anisotropic editing to improve cross-lingual editing by amplifying different subsets of parameters for each language. On the newly defined cross-lingual model editing task, we empirically demonstrate the failure of monolingual baselines in propagating the edit to multiple languages and the effectiveness of the proposed language anisotropic model editing. Our code is publicly available at https://github.com/franklear/LiME.


Introduction
Pre-trained language model based approaches have become the best practice in many fields, including multilingual NLP (Che et al., 2021;Tunstall et al., 2022).During training, Transformerbased (Vaswani et al., 2017) models can embed language abilities (Geva et al., 2021) and memorize facts (Dai et al., 2022) in the parameters.Though, models inevitably make undesired predictions with specific inputs, such as mistake labels or outdated facts.Moreover, the performance of multilingual models is unbalanced across languages, leading to inconsistency predictions over the same input in different languages.However, the high cost of updating facts, where we represent facts inferred from models as sentences in the dashed boxes.The goal is to update the given fact while retaining unrelated facts.Further, cross-lingual editing requires the edit in one language (e.g., en) to affect all languages (en, zh, . . . ).
training and data collecting makes it unrealistic to re-train the models using calibrated data in all languages.Therefore, there is a pressing need for an approach to calibrate multilingual pre-trained models across all languages of interest simultaneously.
As an emerging research area, model editing allows us to calibrate the behavior of pre-trained language models targeted to specific inputs (Sinitsin et al., 2020;Cao et al., 2021;Mitchell et al., 2022a,b;Meng et al., 2022a,b;Hase et al., 2021).However, challenges emerge when applying model editing to the cross-lingual scenario, due to the two features of multilingual pre-trained models: The first is cross-lingual transferability.Based on prior research conducted on pre-trained multilingual models like XLM (Conneau and Lample, 2019) and InfoXLM (Chi et al., 2021), it is wellestablished that incorporating diverse language data during training leads to advantageous crosslingual transfer effects.Thus, input with the same meaning can be expressed in multiple languages as completely different sentences.The editor has to be aware of this feature in case it suffers from editing failure in unseen languages.
The second is language anisotropy.Recent work reveals that language-specific and languageuniversal parameters exist in the multilingual pretrained model (Wang et al., 2020).This finding means the model tends to mainly activate a subset of its parameters depending on the language to be processed, which we call language anisotropy.An editor which treats all parameters identically for all languages is not language anisotropic, potentially harming other languages when editing.
In this work, we propose for the first time crosslingual model editing on multilingual pre-trained language models.Different from existing model editing, an edit in a single language propagates to the others in cross-lingual model editing.As is shown in Figure 1, with cross-lingual model editing, editing a fact in English also affects the Chinese version, while retaining unrelated facts.
We propose a simple yet effective framework to adapt existing monolingual model editing approaches to the cross-lingual scenario using the parallel corpus.Specifically, we replace the inputs for editor training with their parallel expressions in random languages.For example, the editor can be asked to edit model predictions on English input.The edited model is then supervised to enforce that the predictions are updated on parallel Chinese input and retained on unrelated French inputs.The next time, the above languages randomly change.To this end, the cross-lingual training formula helps the editor gain cross-lingual transferability.
Besides, we leverage the language anisotropy nature of the multilingual models to further improve cross-lingual model editing.Specifically, we propose to add a group of L 0 constrained languagespecific masks as the editor's parameters.During editing, the masks are used to instruct the editor to focus on different parts of the raw model's parameters according to the inputs' language.Training along with the masks, the editor gains the skill of making language anisotropic edits.
Our primary contributions are as follows: • We define the cross-lingual model editing task and corresponding evaluation metrics.• We propose a simple yet effective framework to adapt the monolingual editing approaches to the cross-lingual scenario.• We propose language anisotropic model editing to improve cross-lingual model editing significantly.
2 Background: Model Editing Sinitsin et al. (2020) propose Editable Training (model editing) as an efficient approach to modify the behavior of a trained model on specific inputs, where three core requirements are highlighted: • Reliability: the edited model acquires the desired behavior on specific inputs.• Locality: the edit influences the model on other inputs of interests as little as possible.• Efficiency: the editing approach should be computationally efficient.Reliability and locality are essential attributes of the model editing task, while efficiency is required to make the editor usable.
Recent work explores several ways to solve the model editing problem (Sinitsin et al., 2020;Mitchell et al., 2022a;Meng et al., 2022a,b).Despite the variety of algorithms, their training formulas are similar, i.e., training the editor end-to-end on editing data under the condition of reliability and locality.Specifically, a training step of the editor contains two stages.1) Editing stage: the editor is used to edit desired predictions into the raw model f (•; θ), producing the edited model f (•; θ u ).
2) Editor training stage: the edited model is then constrained under the requirements of reliability and locality, corresponding to two core objectives respectively.
For reliability, the edited model need to make the desired prediction y e in response to the input x e .This requirement refers to the task loss L task , e.g., cross-entropy or L 2 .So we have (L rel ) For locality, the edited model needs to retain predictions of unrelated inputs, which means that for an unrelated input x r , the output f (x r ; θ) should be kept.Though a similar loss like L rel can work in theory, the stronger KL divergence loss is used to minimize the side effect on unrelated labels In addition, other auxiliary objectives can be utilized which do not affect the training formula.
Note that the goal is to train the editor instead of the raw model.During training, the gradients propagate through the edited model to the editor.At test time, only the editing stage is needed.Overall, the training of the editor is a meta version of the model training because the "data" that the editor processes is the model (plus the input-prediction pair to be edited).
3 Cross-Lingual Model Editing

Task Definition
Following the work on monolingual model editing, we continue taking the idea of making an edit with reliability and locality (Sinitsin et al., 2020), while introducing cross-lingual transferability.
Assuming we have a model f parameterized by θ that maps the input x to the prediction p = f (x; θ).An update is needed when we want the model change its prediction from p to y.Here the requirement of cross-lingual transferability brings the key difference.The same input can be represented in multiple languages, producing parallel sentences.Therefore, the edit with reliability for x should affect the parallel inputs, denoted as I(x).
As the example in Figure 1, "Messi plays for Paris SG." in English is parallel to its Chinese translation.For locality, the side effect should be as low as possible, which means the prediction of input x ′ / ∈ I(x) is retained.Note that under this setting, one edit is always independent of another.The editor revisits the θ for every edit, then produces the corresponding θ u .Formally, the goal of the editor is to

Cross-Lingual Editing Based on Monolingual Approaches
Dispite the cross-lingual transferability, the requirements of reliability and locality stay the same with monolingual model editing, which are defined by the training data.To fully leverage the monolingual editing approaches and build reasonable baselines, we propose a framework to adapt them to the cross-lingual scenario using the parallel corpus as illustrated in Figure 2. What we need is a slight change in the training formula of the monolingual editing approaches, namely aligning inputs in different languages.Given x e in the editing language l e as the input to be edited and the corresponding desired prediction y e , the inputs used in the training objectives are sampled over the parallel inputs set I(x e ).
For reliability, the edited model is asked to update the prediction to y e on the sampled input x u ∈ I(x e ) in the updating language l e .Thus the reliability loss (L rel ) is modified by replacing x e with x u .For locality, the sampled input x r / ∈ I(x e ) in the retaining language l r is used as input, and the locality loss (L loc ) remains the same.
Monolingual editing is a degenerate case where only a single language is considered, i.e., l e = l u = l r .When the languages differ, the editor trained under the above sampling strategy acquires crosslingual transferability.
Intuitively, the editor functions as updating on identical inputs while not affecting unrelated inputs.In the above cross-lingual adaptation, reliability loss tells the editor what should be identical, and locality loss tells what should be unrelated.Thus, the two losses illustrate a semantically equivalent range for the editor across multiple languages, deriving the cross-lingual transferability.Therefore, the adaptation we make leverages the parallel corpus to inspire the potential of transferability that comes with the model editing task.

Language Anisotropic Editing
A multilingual pre-trained model like mBERT (Devlin et al., 2019) can integrate over one hundred languages in a single model.However, a known phenomenon called the curse of multilinguality (Conneau et al., 2020) exposes the trade-off between the number of languages and the model capacity, implying the languages tend to compete for the shared parameters.Further, it is revealed that languagespecific and language-universal parameters exist in the multilingual model, which potentially harm its cross-lingual transferabillity (Wang et al., 2020).All this evidence indicates that the multilingual model is language anisotropic in the perspective of the parameters.Therefore, we introduce a priori, i.e., the update should focus more on a certain subset of parameters according to the language of the input to edit.Nevertheless, identifying which language prefers which parameters is not so direct.Our idea is to drive the editor to find the important parameters during training.
As shown in the top-right part of Figure 2, we realize the idea with a group of learnable languagespecific masks.The model editor produces new  The editor edits the model at first, then losses for reliability and locality are obtained from the outputs of the edited model to supervise the editor.Languages of editing/updating/retaining are randomly sampled in each training step to endow the editor with language transferability.Our novel language anisotropic model editing applies soft masks according to the editing language, which are supervised using the re-parameterized L 0 loss.
parameters to update the raw model, so we mask the input/output of the editor to apply an adaptive weighting.For an update in language l, we mask each parameter (tensor) W to be updated with where ⊙ computes the element-wise production.
The mask operation bypasses the whole parameter firstly, then increases the weight of the selected part.We also add an auxiliary L 0 loss which is a sparsity penalty to make the mask filter only the important components in a parameter.We follow Louizos et al. (2018) to optimize L 0 with their re-parametrization approach.It should be noted that the mask is only aware of and applied to the editing language because we aim to update all the languages simultaneously, making any assumption on the updating or retaining languages meaningless.
Unfortunately, the element-wise masks for each language may contain as many parameters as the raw model, causing over-parameterization and a waste of computation.Say h is the hidden size.If predicting the O(h 2 ) updated parameters (or their gradients), the editor's parameters will inflate to unacceptable O(h 4 ).Inspired by the capacity of the low-rank updating demonstrated in previous model editing work (Cao et al., 2021;Mitchell et al., 2022a), we factorize the full mask matrix into two low-rank matrics, then constructing the updated raw parameters with non-parameterized operations.
The proposed language anisotropic model editing can work with various model editing approaches, while the implementation is specific to the algorithm details.Taking a parameter matrix W ∈ R n×m in an MLP layer for example.By the chain rule, its gradient on the loss L is where x ∈ R n is the layer's input, and δ ∈ R m refers to the gradient of the layer's output (i.e., the "input" in the backward pass).
For hyper-network based approaches (Cao et al., 2021;Mitchell et al., 2022a), a network g is built to conduct gradient transform.Hence, we insert the language masks m l • here as For other approaches that do not manipulate gradients (Sinitsin et al., 2020), the g is an identical transformation, and the language masks do not affect the rest part of the editing algorithm.
Finally we construct the full sized gradient using the rank-1 predictions The extra parameters and computation is in the order of O(h|L|).Since the size of the language set L is likely to be tens while the hidden size h can easily reach the thousand level, the extra timespace cost is tiny compared to the original O(h 2 ) order.To this end, we obtain an approach to make language anisotropic model editing.

Experiments 4.1 Evaluation
To evaluate cross-lingual model editing approaches, we focus on cross-lingual transferability, while continuing to keep our eyes on reliability and locality.Suppose that the languages we focus on make up L, and the corpus is D L .For l ∈ L, each monolingual subset D l of the corpus contains a number of tuples (x k , y k ), which means we desire the model to predict y k to the input x k .The y k does not need to be different from the raw prediction f (x k ; θ).Taking the union of datasets in all the languages, we have the cross-lingual model editing dataset Inspired by Cao et al. ( 2021), we propose three cross-lingual model editing metrics.Overall, we distinguish the languages where inputs are to be edited from where predictions are to be updated.Let D edit be the set of (input, desired prediction) pairs fed to edit the model, which cause model predictions to inputs in D update updated.In addition, I(x) = {x ′ |x ′ is parallel to x} refers to parallel inputs of a specific input x across languages of interest.
To measure reliability under cross-lingual transferability, we use editing accuracy.We calculate the ratio of the predictions successfully updated The above two metrics are not necessarily consistent or even conflicting, similar to precision and recall in classification.Thus, we define the editing success rate as the harmonic mean succ = 2 × acc × con acc + con .
Since evaluating over the full set for each edit has a huge overhead of enumerating every two inputs, we follow existing work on model editing (Cao et al., 2021;Mitchell et al., 2022a,b) to estimate it with mini-batched expectation.Notably, in this work D edit and D update are finite datasets.Thus we enumerate each (x e , y e ) ∈ D edit , and uniformly sample a subset in certain size of testing inputs x u from I(x e ) or x r from complement set of I(x e ) (for acc or con, respectively) to make a pair in order to calculate the metrics.
To obtain an average metric over all the languages, we calculate the macro average over editing languages.Specifically, to avoid enumerating all language pairs, we mix all the languages into D update = D L , and use edit sets of single language from {D l } l∈L as D edit one by one, then finally calculate the macro average.Finally, the success rate is calculated using the averaged accuracy rate and consistency rate.

Baselines
Finetuning As the most common baseline of model editing, we use finetuning (degenerated editor).With no editor to train, finetuning has no cross-lingual variant and makes no use of parallel corpus since no editor is to be trained.
Learned Editors Considering the proposed approaches are compatible with various learned editors, we use three monolingual editors as the basis: Editable Training (Sinitsin et al., 2020), Knowl-edgeEditor (Cao et al., 2021), and MEND (Mitchell et al., 2022a).We compare the editing performance of each editor with and without our approaches.

Datasets
Following the widely used setting, we construct synthetic editing datasets using existing data (Sinitsin et al., 2020;Cao et al., 2021;Mitchell et al., 2022a,b).We use the knowledge-intensive task mLAMA (Kassner et al., 2021) for fact editing, which is natural because predictions involve only specific knowledge, which is prone to change.Nevertheless, a usable dataset with a parallel corpus of another task, like classification, is lacking due to the difficulty in translating entities.Therefore, to demonstrate the generic task-agnostic editing ability of cross-lingual model editing, we also use a semantics-focused dataset XNLI (Conneau et al., 2018) for error correction.
mLAMA is a multilingual dataset of knowledge probing task through (masked) language modeling, providing facts expressed as masked sentences in 53 languages.Each fact is a triple ⟨[X], type, [Y]⟩ including two entities, e.g., ⟨Messi, play-for, Paris SG⟩.To produce the textual  where the number of [MASK] tokens is sampled from the length distribution of entity name in the corresponding language.Note that translation of an entity may be invisible for the edited model or even nonexistent.Consequently, editing with entity names, which involves the entity linking problem, can be intractable in pure cross-lingual model editing.Therefore, we always treat the entity in the edit input as the desired prediction.
XNLI is a parallel corpus of natural language inference in 15 languages, which can be modeled as a three-way sentence pair classification task, where we ask the model to predict the relation between a premise-hypothesis pair in {entailment, neutral, contradiction}.
In the model editing scenario, we treat the premise-hypothesis pair as a whole input sentence to classify.Unfortunately, since the raw model has already been finetuned using the training and dev set, a dedicated training setting for error correction cannot be built.Thus, we train the editor to edit arbitrarily, which implies the error correction ability.During training, we sample edit input over the training set and give a uniformly random label as the desired prediction.To evaluate an editor on reliability, we use data in the test set that the raw model gives wrong predictions and use corresponding gold labels as the desired predictions.As for locality, we continue to sample inputs to be retained over the whole test set.

Cross-Lingual Model Editing
In this part, we demonstrate that the cross-lingual scenario exceeds the capability of the monolingual model editing approach.Specifically, we compare the editing performance of the monolingual approaches and the proposed cross-lingual variants.Recall that we use L to represent the full language set, i.e., the 15 languages for XNLI and the 53 for mLAMA.In the case of XNLI, the data is inher-  ently parallel, while in mLAMA, each language, excluding English, relies on a translated subset of English.Given this scenario, we train the editors using the English subset to ensure uniform exposure to knowledge during training, thereby mitigating potential issues arising from training set disparities.More specifically, we select en as the editing language and expect the approaches to update predictions across all the languages.Hence, we have l e = en, l u , l r ∈ L during evaluation.Table 1 shows the averaged results of en → all the languages, while Figure 3 illustrates the distribution of results across languages.
Finetuning suffers from severe cross-lingual underfitting according to its low editing accuracy, causing a low overall success rate.
The monolingual editors work much better than finetuning.Although never seen other languages, the editors demonstrate partial cross-lingual transferability.Moreover, the editor acquires the ability to perform updates of locality, almost reaching all the highest editing consistency on XNLI.
However, only editors with the proposed crosslingual editing training framework truly generalizes the desired prediction to inputs in other languages.
On XNLI, editors trained cross-lingually on all languages improves the editing accuracy by a large margin, with much less loss of editing consistency, resulting in a large growth in editing success rate.On mLAMA, where the model faces a much larger output space, editors trained on all languages reveals its high consistency and improves all three metrics significantly.Moreover, Figure 3 shows that the performance gap across languages is closer under the cross-lingual training framework.

Language Anisotropic Model Editing
After confirming the effectiveness of the crosslingual model editing, we conduct experiments to study how the proposed language anisotropic model editing improves performance.Here we always train and evaluate approaches in all languages (l e , l u , l r ∈ L).Table 2 shows the averaged all → all results, with the per-language distribution plotted in Figure 4. Editors using parallel training data in Table 1 are the same as the editors without language anisotropic model editing in Table 2.The difference is that we no longer limit the editing language, thus the editing task becomes harder, making the results in  In each row, we inspect the top-1% preferred dimensions of a certain language l by value, on which dimensions we calculate the cosine similarities between l and every languages.
Finetuning still falls into underfitting across languages, performing similar to the situation of single editing language.
With language anisotropic model editing, the performance of editors reach a new high in both datasets.Note that on XNLI, the small growth (97.80%→ 98.22%) corresponds to the large error reduction (2.20% → 1.78%, 19% relatively).
Though trained with parallel data, a performance gap still exists between some languages and the others.Language anisotropic model editing helps the editors close the performance gap and increases the overall editing success rates.
To illustrate the function of the language-specific masks, we conduct analyses using one of the final MEND based checkpoints on XNLI.We observe that the parameters of the masks are very close in most dimensions across all languages.However, masks for different languages show preferences in small but different dimension subsets.Therefore, we plot the cosine similarities of learned parameters in the masks as a heatmap in Figure 5, where we limit the size of the preferred subset to 1% of the full hidden size.The heatmap of cosine similarities demonstrates that language anisotropic model editing captures the language anisotropy feature of the multilingual pre-trained language model.Through adaptively re-weighting gradients of a small subset of parameters for each language, language anisotropic model editing improves the performance of cross-lingual model editing.

Related Work
Model Editing Sinitsin et al. (2020) initially presents the model editing problem and proposes a MAML-like method, called Editable Training.Our cross-lingual model editing problem definition and metrics mostly extend their work.The proposed language anisotropic model editing approach can be applied to Editable Training by using the rank-1 masks to construct a full gradient/parameter mask.
A series of work models editing as a learningto-update problem and develops the hyper-network based approaches, such as KnowledgeEditor (Cao et al., 2021), MEND (Mitchell et al., 2022a), and SLAG (Hase et al., 2021).They build the editor to constrain gradients during finetuning.We gain a lot of inspiration from their work when designing our methods.
A category of approaches regard the language model as a knowledge base, and utilize a wider range of editing fomulars (Santurkar et al., 2021;Meng et al., 2022a,b;Geva et al., 2021;Dai et al., 2022;Mitchell et al., 2022b).We can obtain the cross-lingual variants using the parallel corpus, while whether the language anisotropic model editing works depends on the algorithm details.

Cross-Lingual Transferability
In recent work, multilingual pre-trained language models show their cross-lingual transferability (Devlin et al., 2019;Conneau et al., 2020;Xue et al., 2021;Chi et al., 2021), where multiple languages included in the training corpus benifit each other.Opposite to the positive cross-lingual transfer, Wang et al. (2020) study the negative interference phenomenon.They show the existence of language-specific parameters, which is also a theoretical basis of our work.Based on this priori, their work and our proposed language anisotropic model editing have similar underlying ideas: identifying the languagespecific parameters and using them to improve the cross-lingual transferability.
Though, our work differs from theirs in method and task.They leverage language-specific pruning to identify the preferred parameter subset of different languages.Then they propose an iterative second-order meta-optimizing algorithm to improve pre-training.Our approach does not perform prune, where the masks play the role of reweighting coefficients.Our approach also makes no change in the training algorithm, maintaining maximum compatibility with various model editing approaches.

Conclusion
In this work, we define the task and metrics of cross-lingual model editing.After summarizing the training formula of various monolingual model editing approaches, we naturally extend the formula to a cross-lingual variant using the parallel corpus.Further, we propose language anisotropic model editing to improve cross-lingual model editing.We conduct experiments to verify that the cross-lingual model editing problem is necessary and find that the proposed approaches are effective.

Limitations
Our work depends mainly on parallel data.Although tasks focusing on language abilities can leverage machine translation to obtain parallel data (Hu et al., 2020), it is much harder for tasks about knowledge and facts.Using parallel data to train cross-lingual model editors is like doing full supervision, while we need to leverage weakly labeled data to mitigate data scarcity.
On the other hand, whether monolingual or crosslingual, model editing still struggles with the continual learning problem.In the real world, knowledge constantly emerges and fades, disabling the stop of learning.However, most studies, including our work, focus on a single or a batch of inputs.Thus, an effective solution of continuously updating a series of inputs is necessary before model editing becomes a practical technic.
Note that our work focuses on the editor's generalized cross-lingual editing ability.We expect the editor to perform the editing honestly.This target potentially offers the possibility to modify model behavior maliciously.Though editing may not soon become a practical technic, the potential risk does exist.
for architectures and training from their corresponding base editors.

B.3 Training Details
We utilize the early-stopping strategy along with up to 500, 000 training steps.When training on the full datasets, we evaluate the model every 100, 000 steps and finalize training when the editing success rate is not improved over 200, 000 steps.When training on the English only subset, the validation interval is set to 20, 000 and the early stop patience is 40, 000 steps.
All experiments fit in one NVIDIA RTX 2080Ti GPU, where a single run takes one to three days.

C Additional Results
The large versions with raw data points of Figure 3 and Figure 4 are as follows.

Figure 1 :
Figure1: An example of cross-lingual model editing for updating facts, where we represent facts inferred from models as sentences in the dashed boxes.The goal is to update the given fact while retaining unrelated facts.Further, cross-lingual editing requires the edit in one language (e.g., en) to affect all languages (en, zh, . . .).

Figure 2 :
Figure 2: The overall framework of the proposed cross-lingual model editing.Each training step consists of two stages.The editor edits the model at first, then losses for reliability and locality are obtained from the outputs of the edited model to supervise the editor.Languages of editing/updating/retaining are randomly sampled in each training step to endow the editor with language transferability.Our novel language anisotropic model editing applies soft masks according to the editing language, which are supervised using the re-parameterized L 0 loss.

Figure 3 :
Figure 3: Editing performance varies across different languages.Training editors with parallel data improves overall editing performance, while decreasing the performance variance among languages.

Figure 4 :
Figure 4: Distribution of editing performance across languages.Language Anisotropic Model Editing (LiME for short) provides overall performance improvement and closes the performance gap across languages.

Figure 5 :
Figure5: Cosine similarities of learned parameters of language-specific masks.In each row, we inspect the top-1% preferred dimensions of a certain language l by value, on which dimensions we calculate the cosine similarities between l and every languages.

Figure 7 :
Figure 7: A large version of Figure 4, where each grey point corresponds to an l → all result averaged over three runs.LiME for Language Anisotropic Model Editing.

Table 1 :
Experiment results to compare monolingual and cross-lingual model editing approaches.During evaluation, the editing language is limited to en, while the updating and retaining languages contains all languages.

Table 2 :
Experiment results show that all three editors benifit from language anisotropic model editing on both datasets.All of the approaches are trained and evaluated in all languages.
Figure6: A large version of Figure3, where each grey point corresponds to an en → l result averaged over three runs.