What Knowledge Is Needed? Towards Explainable Memory for kNN-MT Domain Adaptation

kNN-MT presents a new paradigm for domain adaptation by building an external datastore, which usually saves all target language token occurrences in the parallel corpus. As a result, the constructed datastore is usually large and possibly redundant. In this paper, we investigate the interpretability issue of this approach: what knowledge does the NMT model need? We propose the notion of local correctness (LAC) as a new angle, which describes the potential translation correctness for a single entry and for a given neighborhood. Empirical study shows that our investigation successfully finds the conditions where the NMT model could easily fail and need related knowledge. Experiments on six diverse target domains and two language-pairs show that pruning according to local correctness brings a light and more explainable memory for kNN-MT domain adaptation.

Recently, Khandelwal et al. (2021) propose kNN-MT, showing a new paradigm for domain adaptation.kNN-MT first explicitly extracts translation knowledge in the target domain training data into a key-value datastore with a pre-trained NMT model.For each datastore entry, the key is a continuous representation and the value is a symbolic token.The datastore is then used to assist the NMT model during translation.The kNN-MT framework circumvents the necessity to disturb the parameters of the pre-trained NMT model and enables quick adaptation by switching datastores.
kNN-MT incorporates the symbolic datastore to assist the neural model (Khandelwal et al., 2021;Zheng et al., 2021;Jiang et al., 2021).However, the datastore usually stores all the target tokens in the parallel data, without considering the capability of the neural model.As a result, the datastore is usually huge in size and possibly redundant.
To understand the relationship between the datastore and the NMT model, this paper conducts investigations on the interpretability issue: what knowledge does the NMT model need?Intuitively, the pre-trained NMT model only needs knowledge that remedies its weaknesses.Thus, we propose to explore this issue from the point of local correctness (Section 3).Our local correctness includes two aspects, the correctness of translating a given entry (entry correctness) and, more importantly, the correctness of performing translation in a given neighborhood in the representation space (neighborhood correctness).
For the entry correctness, we check whether the NMT could make correct translation for the entry itself and accordingly split the datastore entries into two categories, namely known and unknown.Based on entry correctness, we examine neighborhood correctness to more comprehensively evaluate the NMT model's underlying capability.Specifically, we propose a knowledge margin metric to evaluate the maximum size of the neighborhood where the NMT could make correct translation.Intuitively, the NMT model may fail when the knowledge margin is small.
To verify our interpretation, we devise a datastore pruning algorithm PLAC (Pruning with LocAl Correctness), which simply removes entries with a higher knowledge margin value (Section 4).These entries are less useful for adaptation, because the NMT model translates well in their neighborhood.
We conduct experiments on six diverse target domains in two language pairs (Section 6).Compared with existing pruning baselines (Martins et al., 2022;Wang et al., 2022), PLAC prunes more entries (up to 45%) in four OPUS domains' datastore without hurting translation performance.Through ablation study, we reveal that simply relying on entry correctness is not enough, showing that the novel metric knowledge margin for the neighborhood correctness could be the key to build a light and more explainable memory for kNN-MT domain adaptation.

Background
For NMT domain adaptation, kNN-MT constructs a datastore D based on the given target domain bilingual corpus C and use it to provide helpful target domain translation knowledge for the pretrained NMT model M. In this section, we briefly introduce kNN-MT and its advanced variant, adaptive kNN-MT (Zheng et al., 2021).

Building a Domain Specific Datastore
Given target domain bilingual corpus C, all translation pairs in C are fed into the frozen pre-trained NMT model for decoding with teacher-forcing (Williams and Zipser, 1989).At decoding time step t, the hidden state from the last decoder layer h(x, y <t ) is taken as key and the t-th target token y t is taken as value, resulting in a key-value pair.For the entire corpus, the datastore D is consisted of key-value pairs: (1) where y <t denotes previous tokens in the sequence y.Each entry in the datastore explicitly memorizes the following translation knowledge: generating the value token at the decoder hidden state key.And the datastore covers all target language token occurrences.

Translating with the Datastore
During inference, given a source language sentence x, kNN-MT simultaneously leverages M and D to generate target language translation y={y 1 , y 2 , • • • , y |y| }.More specifically, at decoding time step t, kNN-MT queries the datastore with the decoder hidden state h(x, y <t ) generated by M. The k nearest neighbors of the query 1 are retrieved, which are k entries with keys closest to the query according to squared-L 2 distance, d.These retrieved knowledge are converted into a distribution over the vocabulary: where T is the temperature.Then, kNN-MT interpolates p kNN with the pre-trained NMT model's output distribution as the final translation distribution: The complete translation y can be generated by beam search.

Adaptive kNN-MT
For vanilla kNN-MT, the selection of hyperparameters, such as k or λ, highly affect the final translation performance, which is less stable across languages or domains.Adaptive kNN-MT uses a lightweight meta-k neural network to dynamically determine the usage of retrieved entries, which avoids the tuning of hyper-parameters and achieves a more stable performance (Zheng et al., 2021).
3 What Knowledge Does the NMT Model Need?
Although less accurate, the pre-trained NMT model could perform translation without the datastore.This fact suggests that the NMT model knows some bilingual knowledge of the target domain.However, the construction of datastore dismisses this point and results in a huge amount of entries being stored.
Intuitively, the pre-trained NMT model only needs knowledge that remedies its weaknesses.To find out these weaknesses and build more explainable memory, we start from investigating entry correctness.Based on this basic concept, we further study neighborhood correctness and find that it precisely reflects the NMT model's strengths and weaknesses.

Known v.s. Unknown for Entry Correctness
The capability of the NMT model in target domain is difficult to describe directly.However, as the datastore consists of entries constructed on training set, it is easier to check whether the NMT model could make correct translation for them.This can be efficiently accomplished by an extra evaluation during the teacher-forcing decoding.More specifically, at each time step t of the teacherforcing process, we not only record the hidden states h(x, y <t ) and the correct target token y t , but also evaluate the prediction of the NMT model y ′ t , which is the target token with the highest probability p NMT (y ′ t |x, y <t ) .Then we call an entry as a known entry if the NMT model could predict it correctly; and unknown, otherwise (Equation 4).
Obviously, the unknown entries in the datastore are important, because these are the points where the NMT model tends to make a mistake.

The Knowledge Margin Metric for Neighborhood Correctness
However, entry correctness alone could not fully reveal the NMT model's weaknesses.Because for known entries, the NMT model may still fail during inference where the context could be similar but different.Considering that the contextualized representations of similar context stay close in the representation space (Peters et al., 2018), we propose to investigate the NMT model's translation performance in a neighborhood.We propose a metric called knowledge margin, denoted as km, to measure the neighborhood correctness.Given an entry (h, y), its neighborhood is defined by its k nearest neighbors2 in the datastore The knowledge margin of the entry, i.e. km(h), is defined as: (5) Intuitively, km is the maximum size of the neighborhood of the entry h where the NMT could make correct translation.If considering at most k nearest neighbors of h, its knowledge margin will be a number between 0 and k.
Please note that the definition of knowledge margin applies for any point in the representation space, because for each point (e.g. an actual query q during inference), its neighborhood N k (q) could be defined by querying the datastore.This extension allows the investigation of the NMT model at any given point in the representation space.

Empirical Analysis
We now present an empirical analysis of the relationship between the NMT model and the datastore, and reveal the NMT model's weaknesses.
Settings We follow Zheng et al. (2021) and consider four domains in German-English OPUS dataset (Tiedemann, 2012) as target domains3 .Table 1 lists statistics of four domains4 .For pretrained NMT model, we use the winner model of WMT'19 German-English news translation task5 (Ng et al., 2019).The datastore for each domain is constructed on the corresponding training set with the pre-trained NMT model.

Entry Correctness
We collect statistics about the two categories of entries and report results in Table 2.The results show that 56%∼73% (averaging 66.7%) of datastore entries are known by the pre-trained NMT model.This high ratio strongly indicates that a large amount of datastore entries may be redundant.

Neighborhood Correctness
We measure neighborhood correctness of each datastore entries and plot the distribution of knowledge margin for known and unknown entries in Figure 1 ( k =   2048).The distributions on four OPUS domains show the same trends.Most unknown entries has a very low knowledge margin, e.g., around 90% of unknown entries have a margin value between 0 and 4. In contrast, the distribution for known entries is more diverse.The results indicate that the neighborhood correctness is consistent with the entry correctness, but may provide more information for known entries.
To verify the relation between knowledge margin and NMT model's translation ability, we conduct experiments on the development set for each domain, where translation context are unseen.For each token y t in the dev set, we perform teacherforcing until time step t − 1 and query the datastore for the neighborhood at time step t.We evaluate the knowledge margin of the query and the prediction accuracy of the NMT model.
Figure 2 shows the results.For tokens with higher margins, e.g.km ≥ 32, the prediction accuracy of the NMT model is higher than 95%.In contrast, for tokens with lower margins, e.g.km < 4, the accuracy is lower than 50%.This is a strong evidence that the NMT model could easily fail when knowledge margin is small.
In Table 3, we also show a translation example for such a condition, where knowledge margin of the current query is 0 and the NMT model fails to generate the last subword of "Cyanokit"6 .

Building Explainable Memory Based on Local Correctness
Because local correctness are good indicators for translation failures.It could also be interpreted as the importance of datastore entries.To verify this interpretation, we propose a pruning algorithm, i.e.
Pruning with LocAl Correctness (PLAC), to cut off entries with a high knowledge margin (Algorithm 1).
There are two steps in the algorithm.In the first step, each entry (h, y) in the datastore D is checked for their local correctness.If knowledge margin of (h, y) is greater than or equal to the threshold k p , the entry is collected as the pruning candidates7 .
The following un@@ desi@@ rable effects have been reported in association with Cy@@ an@@ ok@@ it 6 unknown Warum wurde Cy@@ an@@ ok@@ it zugelassen ?Why has Cy@@ an@@ ok@@ it 7 unknown Beson@@ dere Vorsicht bei der Anwendung von Cy@@ an@@ ok@@ it ist erforderlich Take special care with Cy@@ an@@ ok@@ it 8 unknown Wie wirkt Cy@@ an@@ ok@@ it ?How does Cy@@ an@@ ok@@ it NMT's prediction (y ′ t ): ite Correct target token (yt): it Table 3: An example where the NMT model fails (sentence are tokenized into subwords).At the current time step, all retrieved entries are unknown for the NMT model, so knowledge margin is 0. The prediction of NMT is highly likely to be wrong.With these retrieved entries, the kNN-MT could make a correct prediction.
datastore can be used in different kNN-MT models, such as adaptive kNN-MT.

Experiment Setup
This section introduces general experiment setup for evaluating pruning effect.More implementation details can be found in Appendix C.

Data and Processing
We conduct datastore pruning for 6 different domains from 2 language pairs.Specifically, we take  For preprocessing, we use moses9 toolkit to tokenize German and English corpus and jieba10 to tokenize Chinese corpus.Byte pair encoding11 (BPE) is applied for subword segmentation.

Pre-trained NMT Model
For De-En tasks, we use the winner model of WMT'19 De-En news translation task, which is based on the Transformer architecture (Vaswani et al., 2017).For Zh-En tasks, we train a base Transformer model from scratch on CWMT'17 Zh-En Dataset12 (9 million sentence pairs), since we do not find any publicly available Zh-En pre-trained NMT model on the website.
The pre-trained NMT model is the unadapted general domain model for each language pair, which is the starting point for domain adaptation.For kNN methods, it also serves as the base for building the datastore.

Systems for Comparison
We report the performance of the following systems for reference: the pre-trained NMT model (Base), the pre-trained model finetuned on each target domain (Finetune) (Luong and Manning, 2015), adaptive kNN-MT with full datastores built for each target domain on their training set (Adaptive kNN) (Zheng et al., 2021).Finetuning and Adaptive kNN are two popular alternatives for adaptation.
The following pruning methods are applied to the datastore of Adaptive kNN for comparison: randomly pruning (Random), cluster-based pruning (Cluster) (Wang et al., 2022), merging similar entries (Merge) (Martins et al., 2022), randomly pruning known entries (Known), pruning all known entries (All Known).Among them, Cluster and Merge are pruning methods based on the context similarity of different entries (Wang et al., 2022;Martins et al., 2022).
6 Experiment Results and Analysis

Safely Pruning with PLAC
Experiment results on OPUS domains are presented in Table 5.For the reference, the pre-trained NMT model usually does not translate well on target domains.Finetuning and Adaptive kNN have comparable performances.
We perform datastore pruning with PLAC for different domains and report the largest pruning ratio without significant performance degradation on the test set.
Compared with using full datastore (Adaptive kNN), our method (PLAC) cutting off 25%-45%15 entries of the datastore while achieving comparable performance.On the two largest domains, "OPUS-Medical" and "OPUS-Law", our method successfully prunes 45% datastore (millions of key-value pairs).Excellent pruning performance validates our analysis concerning with local correctness.
Cluster and Merge lead to a larger degradation of translation performance 16 , showing that entries with identical target tokens indeed have different importance in assisting the NMT model.Simply pruning all known entries results in a significant drop of performance (All Known).Pruning known entries to the same ratio as PLAC also lead to degradation (Known), although it outperforms Cluster and Merge.These comparisons indicates that the entry correctness only partially reflects entry importance, demonstrating the necessity of the neighborhood correctness analysis with knowledge margin.
The results on UM domains are presented in Table 6.The datastore could be pruned by 30% for "UM-Law" and 15% for "UM-Thesis" Datastore without any sacrifice in translation performance.The other findings are similar with those in German-English experiments.

How Knowledge Margin Affects Pruning Performance?
In this section, we examine how knowledge margin affects pruning performance and provide more insight into our proposed method.Figure 3 plots BLEU scores of adaptive kNN-MT models with pruned datastore under different pruning ratios on development sets.We can observe that trends are mostly similar in different domains.Pruning by PLAC achieves the best performance over the other baselines and the performance is more stable even with a higher pruning ratio.Note that Known is a case where neighborhood correctness is dismissed during entry pruning.Although it outperforms Random, Cluster and Merge in most scenarios, its performance is still unstable.
When tuning the hyperparameter k p among {4, 8, 16, 32}, we can see a trade-off between BLEU score and the pruning ratio.Large k p leads to a small sacrifice of BLEU score but a lower pruning ratio.Small k p allows us to prune more entries but causes significant BLEU scores decline after a specific threshold ratio.For example, when k p = 4, it is allowed to prune 55% "OPUS-Medical" datastore, but translation performance declines drastically after the pruning ratio reaches 50%.Finally, we choose the top-right point 17 in each subfigure as the best-performed setting for each domain, which are used in other experiments.

Datastore Entries With Lower Knowledge
Margin Are Indeed Valuable In this section, we want to verify that entries with low knowledge margin are truly important for NMT adaptation.For this purpose, we remove entries 17 Hyper-parameter values of these points are reported in Appendix C. from datastore with a reversed strategy, i.e. the knowledge margin of (h, y) is less than k p .
Table 7 shows pruning effect.We can see that pruning entries with reverse strategy suffers significant performance decline at even a small pruning ratio, demonstrating the importance of these entries for domain adaptation.We also show some cases for each domain in Table 8.We can see that target tokens of these valuable entries are more domainspecific, e.g."dose" and "Executive".

PLAC Is Applicable to Different kNN-MT Variants
For more comprehensive evaluation, we plug our pruned datastore into different kNN-MT variants, i.e. vanilla kNN (Khandelwal et al., 2021), KSTER (Jiang et al., 2021) and adaptive kNN.Experiment results on OPUS-Law domain show that our pruned datastore does almost no harm to the translation performance of different variants, demonstrating the effectiveness of PLAC.
The Executive Board shall decide on the organisation of its meetings .
You may have encounter@@ ed a bu@@ g in the program .

OPUS-Koran
Das ist eine schmerz@@ hafte P@@ ein .Target sentence: That would be a grie@@ v@@ ous aff@@ li@@ ction .Table 9: Memory Space (MB) comparsion between pruned datastore and full datastore."Space" denotes the memory space taken by the index file and "∆" denotes the percentage of space saved by our method.

Pruned Datastore Occupies Less Memory Space
In practice, the datastore must be loaded to CPU and GPU memory during inference.So its size affects the efficiency.Since Faiss index is used to index and represent the datastore, we compare the size of index file before and after pruning (Table 9).For all the domains, our pruning method PLAC significantly reduces the memory occupation.The ratio of saved memory space is roughly identical with the PLAC pruning ratio.For the largest datastore, "OPUS-Law", the memory space can be reduced by 44%.

Related Work
Less attention have been paid to the research of interpretability of kNN-MT.To the best of our knowledge, we are the first to systematically study the relationship between the NMT model and the datastore.As for datastore pruning, Wang et al. (2022) and Martins et al. (2022) prune the datastore based on the hypothesise that entries with similar translation are redundant.Actually, entries with similar translations may have different importance to the translation.Our analysis suggests one way to understand these differences.

Conclusion
It is interesting to explore how a neural model and a symbolic model works together.In this paper, we propose to analyze the local correctness of the neural model's predictions to identify the conditions where the neural model may fail.By introducing a knowledge margin metric to measure the local correctness, we find that the NMT model often fails when the knowledge margin is small.These results provide support for building a more explainable machine translation system.
Based on analyses, we can safely prune the datastore with the proposed PLAC method.Empirically, the datastore could be successfully pruned up to 45% while retaining translation performance.This results validate our earlier findings about the local correctness and translation failures.
Our method is general to different kNN-MT variants and easy to implement.Future directions maybe using local correctness to explore more interpretability issue of NMT domain adaptation, e.g.catastrophic forgetting.

Limitation
During inference, kNN-MT have to query the datastore at each decoding step, which is timeconsuming.Although up to 45% datastore entries can be safely pruned by our method, deploying a high-quality kNN-MT system with fast inference speed is still an open challenge.

Ethical Considerations
In kNN-MT works, the symbolic datastore helps adaptation but also introduce privacy concerns.Since kNN-MT explicitly saves all target language tokens in the datastore, there is a risk of privacy leakage.In the future, more efforts may be put into addressing this issue.

B Involved Scientific Artifacts
In this section, we list the artifact used in our project: Moses (LGPL-2.1-License):It is a statistical machine translation system that allows you to automatically train translation models for any language pair.
Jieba (MIT-License): it is a library for chinese word segmentation.
Subword-nmt (MIT-License): Subword-nmt is a package containing preprocessing scripts to segment text into subword units.
Fairseq (MIT-license): It is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling and other text generation tasks.
Faiss (MIT-license): It is a library for efficient similarity search and clustering of dense vectors.
For the sake of ethic, our use of these artifacts is consistent with their intended use.

C Implementation Details
We implement adaptive kNN-MT with Zheng et al. ( 2021)'s released code and script18 based on fairseq19 (Ott et al., 2019).Due to the large space of hyper-parameters, we follow Zheng et al. (2021) to set the number of retrieved entries (k a ) as 8 when training adaptive kNN-MT models for most experiments, and report pruning performance under different k a in Appendix D. During inference, we set beam size as 5 and length penalty as 1.0.
For implementing PLAC, the hyper-parameter k p in Algorithm 1 implicitly determines the maximum number of entries that are allowed to be pruned.So we tune k p among the subset of {4, 8, 16, 32} when given different pruning ratio r.
After buiding the datastore, we follow previous kNN-MT works (Khandelwal et al., 2021;Zheng et al., 2021) and use Faiss20 index (Johnson et al., 2019) to represent the datastore and accelerate nearest neighbors search.
In Table 11, we report hyperparameters to reproduce our main results in Table 5 and 6.In our experiments, it takes at most 1.5 GPU hours to train adaptive kNN-MT models on a single NVIDIA Titan RTX.

D Pruning Effect is Insensitive to Hyperparameter k a
To demonstrate the reliability of our pruned datastore, after pruning datastore, we train adaptive kNN-MT models with different hyperparameter k a and evaluate their translation performance (BLEU) on "OPUS-Law" domain's test set (

Figure 1 :
Figure 1: The ratio distribution on different knowledge margin for known and unknown entries on four OPUS domains.

Figure 2 :
Figure 2: The NMT model's prediction accuracy at positions with different margin values in OPUS domains' unseen development set.

Figure 3 :
Figure 3: BLEU scores of adaptive kNN-MT models with pruned datastore on different domains' development set.Different symbols represent different ways of pruning datastore.

Table 1 :
Number of sentences of the OPUS dataset.

Table 2 :
The statistics of the known and unknown entries for the pre-trained NMT model on four OPUS domains' training set.The number of entries and the ratio of known entries are listed.
4 OPUS domains for De-En experiments and 2 UM domains 8 for Zh-En experiments (Tian et al., 2014), 8 We split the original training set into training, development, test set because there is no development set provided in the original dataset and there exists an overlap between original training and test sets.Detailed description of UM domains can be found in Appendix A.

Table 4 :
Detailed statistics of UM dataset.We report the sentence number of each subset."Train", "Dev", "Test" denote training, development, test set respectively.

Table 5 :
Pruning Effect on four OPUS domains."Ratio" denotes the pruning ratio.Higher "BLEU" and "COMET" scores indicates better translation quality."*" means that performance decline is statistically significant (p < 0.05).

Table 7 :
Translation performance difference (BLEU) compared with Adaptive kNN using full datastore under different pruning ratios.

Table 8 :
Case study for remaining knowledge in different domain's pruned datastore.The underlined parts are target tokens of entries with small margin values.

Table 10 :
Translation performance (BLEU) of different kNN-MT variants with full and pruned datastore on OPUS-Law domain's test set.

Table 11 :
Hyperparameters for pruning datastore and training adaptive kNN-MT models.

Table 12 :
Table12).Results show that our pruning method enjoys consistent performance under different k a .Pruning performance under different k a .