Nearest Neighbor Machine Translation is Meta-Optimizer on Output Projection Layer

Nearest Neighbor Machine Translation ($k$NN-MT) has achieved great success in domain adaptation tasks by integrating pre-trained Neural Machine Translation (NMT) models with domain-specific token-level retrieval. However, the reasons underlying its success have not been thoroughly investigated. In this paper, we comprehensively analyze $k$NN-MT through theoretical and empirical studies. Initially, we provide new insights into the working mechanism of $k$NN-MT as an efficient technique to implicitly execute gradient descent on the output projection layer of NMT, indicating that it is a specific case of model fine-tuning. Subsequently, we conduct multi-domain experiments and word-level analysis to examine the differences in performance between $k$NN-MT and entire-model fine-tuning. Our findings suggest that: (1) Incorporating $k$NN-MT with adapters yields comparable translation performance to fine-tuning on in-domain test sets, while achieving better performance on out-of-domain test sets; (2) Fine-tuning significantly outperforms $k$NN-MT on the recall of in-domain low-frequency words, but this gap could be bridged by optimizing the context representations with additional adapter layers.


Introduction
In recent years, Nearest Neighbor Machine Translation (kNN-MT) and its variants (Khandelwal et al., 2021;Zheng et al., 2021a,b;Jiang et al., 2021;Wang et al., 2022a) have provided a new paradigm and achieved strong performance for fast domain adaptation through retrieval pipelines.Unlike model fine-tuning, which requires additional parameter updates or introduces external adapter layers, kNN-MT combines traditional Neural Machine Translation (NMT) models (Bahdanau et al., 2015;Vaswani et al., 2017;Hassan et al., 2018) with a token-level k-nearest-neighbour retrieval mechanism.This allows for direct access to domain-specific datastores, improving translation accuracy without the need for supervised finetuning.Although kNN-MT has achieved great success in domain adaptation tasks, its working mechanism is still an open problem that has not been thoroughly investigated.
In this paper, we propose a novel perspective to understand kNN-MT by describing it as a special case of fine-tuning, specifically a process of meta-optimization on the Output Projection Layer (OPL) of NMT, and establish connections between kNN-MT and model fine-tuning (Section 3).Our novel perspective on kNN-MT posits that (i) the working mechanism of kNN-MT is to implicitly execute gradient descent on OPL, producing metagradients via forward computation based on knearest-neighbors, and (ii) explicit fine-tuning on OPL shares a similar gradient format with the metagradients obtained by kNN-MT, according to the derivation of back-propagation.As illustrated in Figure 1, kNN-MT and explicit OPL fine-tuning share a dual view of gradient descent-based optimization.The key difference between them lies in the method for computing gradients: kNN-MT produces meta-gradients through forward computation and interpolation, while fine-tuning method computes gradients of OPL via back-propagation.Hence, it is reasonable to understand kNN-MT as an implicit form of model fine-tuning.
To provide empirical evidence for our understanding, we carry out experiments based on multidomain datasets (Section 4.1).Specifically, we compare the model predictions of kNN-MT and explicit OPL fine-tuning on five domain adaptation tasks.As expected, the predictions of kNN-MT is highly similar to that of explicit OPL fine-tuning.These findings support our understanding that kNN-MT performs implicit OPL fine-tuning.Next, we conduct comprehensive multi-domain experiments and word-level analysis to examine the differences in translation performance between kNN-MT and other popular fine-tuning methods, such as entire-model fine-tuning and adapter-based fine-tuning (Section 4.2 and 4.3).Our empirical results suggest that: (i) Introducing kNN-MT on top of adapter-based fine-tuning obtains comparable translation performance to entire-model finetuning on in-domain test sets, while achieving better performance on out-of-domain test sets.(ii) The entire-model fine-tuning significantly outperforms kNN-MT in terms of the recall of in-domain low-frequency words, but this difference can be mitigated by optimizing the context representations with lightweight adapter layers.

Neural Machine Translation
NMT employs an encoder-decoder model with neural networks that are parameterized by f θ to establish the mapping between the source sentence x and its corresponding target sentence y.For the decoding stage, at time step m, NMT utilizes the context representation h ∈ R d in , which is generated from the source sentence x and the current target context ŷ<m , to predict the next-token probability: where W O ∈ R |Y|×d in represents the parameter matrix of OPL in the NMT model and |Y| is the vocabulary size.et al. (2021) propose kNN-MT that enhances pre-trained NMT models on the general domain by incorporating a translation memory retriever.It enables the models to leverage external in-domain knowledge and improve the quality of in-domain translations.This approach is generally formulated in two processes: datastore construction and inference with kNN retrieval.The datastore is a translation memory that converts bilingual sentence pairs into a set of key-value pairs.For a given target domain bilingual corpus {(x, y)}, the context representation f θ (x, y <m ) generated by the pre-trained NMT model at each timestep m is used as the key, and the m-th target token y m is treated as the corresponding value, resulting in a key-value pair.The entire corpus contributes to the datastore D, which is comprised of all key-value pairs:

Khandelwal
During inference, the model utilizes the current context representation h = f θ (x, ŷ<m ) at the m-th decoding step to produce a probability distribution over a restricted vocabulary obtained through a nearest-neighbor approach: (3) where T denotes the temperature to control the sharpness of the softmax function and is the set of k nearest-neighbors 15593 retrieved from D using a pre-defined distance function d(., .).In practice, we can use either the dotproduct function or negative l 2 distance to implement d(., .).Xu et al. (2023) have demonstrated that the performance of these two functions is almost identical, so we adopt the dot-product function for theoretical analysis in this paper.Finally, kNN-MT interpolates the vanilla NMT prediction p NMT with the kNN prediction p kNN to obtain the final next-token probability: (4) where λ is a tuned interpolation coefficient.In addition, this prediction way could also be substituted with other kNN variants (Zheng et al., 2021a;Wang et al., 2022a;Dai et al., 2023b) to achieve better model performance or inference speed.

Dual Form Between Gradient Descent
Based Optimization and Attention Irie et al. (2022) present that linear layers optimized by gradient descent have a dual form of linear attention, which motivates us to view kNN-MT as meta-optimizers.Concretely, a linear layer optimized via gradient descent can be formulated as: where q ∈ R d in is the input representation, and W 0 , ∆W ∈ R dout×d in are the initialized parameter matrix and the updated matrix, respectively.In the back-propagation algorithm, ∆W is computed by accumulating n training inputs to this layer Q = (q 1 , ..., q n ) ∈ R d in ×n and corresponding (backpropagation) error signals E = (e 1 , ..., e n ) ∈ R dout×n obtained by gradient descent: The dual form of a linear layer trained by gradient descent is a key-value memory with attention storing the entire training experience: where LinearAttn(K, V, q) denotes the linear attention operation, and we regard the training inputs Q as keys, the error signals E as values, and the current input q as the query.Instead of using the regular softmax-normalized dot product attention, which is Attention(K, V, q) = Vsoftmax(K ⊤ q), we investigate the working mechanism of kNN-MT under a relaxed linear attention form, following the approach of Irie et al. (2022).
3 kNN-MT Performs Implicit Gradient Descent on Output Projection Layer In this section, we first demonstrate that probability distribution in kNN-MT, including p kNN and p NMT , is equivalent to Transformer attention.On top of that, we argue that kNN-MT implicitly performs gradient descent on OPL, producing meta-gradients via forward computation and interpolation based on k-nearest-neighbors.Next, we draw comparisons between kNN-MT and explicit OPL fine-tuning, establishing connections between these two forms.

Output Distributions are Attentions
Let h = f θ (x, ŷ<m ) be the context representation at each timestep m, and we obtain the nearest neighbors set denote matrices representing all key and value vectors in N (h), in which we replace the original token value with a one-hot vector for V m j .Then, we reformulate the computation of p kNN in Equation (3): where we use the dot-product function for the distance metric d(., .).According to the above equation, p kNN is a key-value memory with attention storing all nearest neighbors from the datastore.
For the computation of p NMT , we introduce an identity matrix I |Y| and convert it into attention format: ×|Y| is the matrix that represents key vectors for each token in vocabulary.Similarly, p NMT is a key-value memory with attention storing all representations of the entire vocabulary.

kNN-MT as Meta-Optimization
For the ease of qualitative analysis, we follow Irie et al. (2022) to understand the working mechanism of kNN-MT under a relaxed linear attention form, i.e., we remove the softmax operation in the computation of p kNN and p NMT , resulting in the following rewritten expressions for p kNN and p NMT : Then the next-token prediction probability of kNN-MT is the weighted sum of two attentions: (11) Combing Equation ( 7), ( 10) and ( 11), we derive the dual form between gradient descent-based optimization and kNN-MT: ∂W O represents the total gradient including a linear layer (dual form) and l2-regularization objective, K m stands for nearest-neighbors training inputs to the output projection layer in NMT, and E m = V m is the corresponding error signals obtained by gradient descent.As shown in the above equations, the introduced probability difference, i.e., p kNN − p NMT , is equivalent to parameter updates ∆W kNN that affect W O .We can also regard as some meta-gradients, which are leveraged to compute the updated parameter matrix ∆W kNN .
In summary, we introduce a new perspective to explain kNN-MT as a process of meta-optimization on the output projection layer of NMT, in which kNN-MT produces meta-gradients via the computation of p kNN − p NMT based on k-nearest-neighbors and implicitly applies gradients to the original output projection layer.

Comparing kNN-MT with Fine-tuning
As the Equation ( 12) indicates that the nearestneighbors set N (h) = {(K m j , V m j )} k j=1 serves as the training inputs to the output projection layer in the dual form of kNN-MT, we proceed to compare the meta-optimization of kNN-MT with explicit OPL fine-tuning.This explicit OPL fine-tuning approach maximizes the log-likelihood of the nearestneighbors set: where α is the hyper-parameter of l2-regularization objective and we optimize the parameter matrix of OPL using K m j and V m j as input and label, respectively.By applying the back-propagation algorithm, we obtain the updated matrix ∆W FT as follows: resents all prediction probabilities for the entire nearest-neighbours set, and the complete derivation process is presented in Appendix A.1.In the case of standard gradient descent, the new parameter matrix of OPL, i.e., W ′ O , is computed as: Methods Training Data Error Signals Gradients Optimizer Table 1: The similarities and differences between kNN-MT and explicit OPL fine-tuning, where error signals and gradients are provided in Equation ( 12) and ( 14).
where η is the learning rate.Similar to Equation ( 12), K m denotes training inputs and E m = V m − P m is the corresponding error signals via explicit OPL fine-tuning.Table 1 displays similarities and differences between kNN-MT and explicit OPL fine-tuning, both of which aim to maximize the log-likelihood of a nearest-neighbor set The main distinction lies in the fact that kNN-MT generates meta-gradients through forward computation and interpolation, while fine-tuning computes gradients of OPL through back-propagation.Moreover, we discover that explicit OPL fine-tuning produces gradient formats that are so similar to meta-gradients acquired through kNN-MT.Therefore, it is reasonable to view kNN-MT as an implicit model fine-tuning process on OPL, in which kNN-MT produces a distinct parameter matrix W ′ O at each decoding time step.As kNN-MT only involves the optimization of OPL compared to entiremodel fine-tuning, its performance is evidently constrained by the context representations produced by the base NMT model.

Experiments
In this section, we begin by comparing the model predictions of kNN-MT and explicit OPL finetuning (OPL-FT) using multi-domain datasets to verify our earlier analysis.Then we carry out comprehensive multi-domain experiments and wordlevel analysis to gain a better understanding of the translation performance differences between kNN-MT and current popular fine-tuning methods.
4.1 kNN-MT v.s.Explicit OPL Fine-tuning Setup.We mainly compare kNN-MT and OPL-FT on five domain adaptation datasets, including multi-domain German-English datasets in Khandelwal et al. (2021) (IT, Law, Medical, and Koran), and the IWSLT'14 German-English translation dataset.The details of multi-domain datasets are listed in Appendix A.2.The pre-trained NMT model from the WMT'19 German-English news translation task winner (Ng et al., 2019) is used as the basic model for kNN-MT and OPL-FT.We employ both inner-product (IP) and negative l2distance (L2) as distance metrics, in which the datastore size and hyper-parameter settings for kNN-MT are included in Appendix A.3 and we maintain consistency with previous work (Zheng et al., 2021a) for most details.As for OPL-FT, the parameter of OPL is trained with the same k-nearestneighbors retrieved by kNN-MT via either IP or L2 at each timestep.We perform a grid search and use the perplexity (PPL) on the validation set to determine the optimal learning rate and hyperparameter for SGD optimization.More details are presented in Appendix A.3.As kNN-MT and OPL-FT only involve the optimization of OPL, we adopt a teacher-forcing decoding strategy and evaluate the similarity between them by measuring the mean and variance of the difference between their model predictions on the golden label.Specifically, for the test set containing n target tokens, the mean M (•) and variance V (•) are computed as: where A, B ∈ {NMT, kNN-MT, OPL-FT, FT} and p(y i ) denotes the model prediction probability on each golden label y i .
Results.As shown in Table 2, we find that kNN-MT has a more similar model prediction with OPL-FT (lower mean/variance) compared to the base NMT model or entire model fine-tuning (FT).The experimental results indicate that kNN-MT and OPL-FT are closer than other tuned models.These findings provide empirical evidence supporting our understanding that kNN-MT performs implicit OPL fine-tuning.Additionally, we observe that kNN-MT achieves a slightly higher mean of model predictions than OPL-FT on average.We suspect that this is because kNN-MT solely utilizes the

Translation Performance
Setup.As kNN-MT could be viewed as a special case of model fine-tuning, we further compare the translation performance of two kNN-based models, i.e., traditional kNN-MT and adaptive kNN-MT (AK-MT) (Zheng et al., 2021a), with other popular fine-tuning methods, including entiremodel fine-tuning (FT) and adapter-based finetuning (Adapter).We adopt the previous multidomain datasets for this experiment but integrate the test sets of the other 4 domains as the out-ofdomain (OOD) test set for each domain.The evaluation metric is SacreBLEU, a case-sensitive deto-kenized BLEU score (Papineni et al., 2002).
All experiments are conducted based on the Fairseq toolkit (Ott et al., 2019).For the Adapter, we build adapter layers according to the approach proposed in Houlsby et al. (2019), with intermediate dimensions r selected from {64, 128, 256}.For kNN-based models, we adopt L2 as the distance metric and the same hyper-parameters as the previous section.We also explore the performance of combining AK-MT and Adapter (AK-MT Adapter ), which keeps the same hyper-parameters to AK-MT.The Adam algorithm (Kingma and Ba, 2015) is used for FT, Adapter and OPL-FT 2 , with a learning rate of 1e-4 and a batch size of 32k tokens.The training process is executed on 4 NVIDIA Tesla V100 GPUs and the maximum number of training steps is set to 100k with validation occurring every 500 steps.During decoding, the beam size is set to 4 with a length penalty of 0.6.3, we evaluate the translation performance of all models and obtain the following findings:

Results. As illustrated in Table
• OPL-FT, which optimizes the parameter matrix of OPL, also brings significant improvements.This proves that only updating the parameter of OPL could achieve relatively high domain adaptation performance for NMT since it already produces precise context representation due to the large-scale model pre-training.All in all, as a meta-optimizer on OPL, kNN-MT works quite well on domain adaptation tasks but still requires tuning of the context representations generated by the original model to achieve comparable performance to FT.

Word-Level Empirical Analysis
Setup.Apart from the BLEU score, we conduct a word-level analysis to investigate the translation differences between kNN-MT and FT, and determine the bottleneck of kNN-MT.Specifically, we analyze the translation results of kNN-MT, AK-MT, FT, and AK-MT Adapter by calculating the recall of different target words. 3We first use spaces as delimiters to extract target words and define the domain-specific degree of each word w as 3 As shown in Appendix A.6, we calculate the precision, recall, and F1 score (P/R/F1) for each word in the translation results and observe that the correlation between translation performance and word recall is strongest.
f GD (w) , where f ID (.) and f GD (.) are the word frequencies in domain-specific and generaldomain training data, respectively.4Then we split the target words into four buckets based on γ: {0 ≤ γ(w) < 1, 1 ≤ γ(w) < 2, 2 ≤ γ(w) < 5, γ(w) ≥ 5}, with words having a higher domain frequency ratio γ indicating a higher degree of domain-specificity.To better illustrate the gap between kNN-based methods and FT, we define incremental word recall ∆R for kNN-MT, AK-MT and AK-MT Adapter as the difference in word recall compared to FT: ∆R(w) = R(w) − R FT (w).
Results. Figure 2a presents ∆R values for words in different buckets, indicating that compared to FT, kNN-MT and AK-MT have poor word recalls for words with γ(w) ≥ 2, particularly when γ(w) ≥ 5.However, AK-MT Adapter achieves comparable performance to FT, suggesting that enhancing the context representations with adapter layers could handle this issue.Moreover, we focus on words with γ(w) ≥ 5 and evaluate word recalls in different buckets based on word frequency, dividing words into four buckets based on their in-domain frequency ranking: top 1%, top 1~5%, top 5~20%, and top 20~100%.As shown in Figure 2b, for indomain low-frequency words, particularly those ranking behind top 20%, kNN-MT and AK-MT perform significantly worse than FT in terms of word recall.Similarly, AK-MT Adapter yields comparable word recall to FT.These results demonstrate that the performance differences between kNN-based models and FT mainly lie in the low recall of in-domain low-frequency words, which can be alleviated by optimizing context representations with additional adapter layers.
Nearest Neighbors Analysis.We verify the performance of kNN retrieval for the words with γ(w) ≥ 5 to better understand the quality of context representations.We use the teacher-forcing decoding strategy to calculate the non-retrieval rate of words in each bucket, where a word is defined as non-retrieval if any sub-word of it is not retrieved in the k-nearest-neighbors of AK-MT and AK-MT Adapter .The k-nearest-neighbors of kNN-MT and AK-MT are exactly the same.Figure 3 shows that the non-retrieval rate (Unretrieved%) of AK-MT increases as word frequency decreases, consistent with the results of word recall in Figure
For the NMT system, Khandelwal et al. (2021) propose kNN-MT that utilizes a kNN classifier over a large datastore with traditional NMT models (Bahdanau et al., 2015;Vaswani et al., 2017;Hassan et al., 2018) to achieve significant improvements.Recently, several attempts have been made by most researchers to improve the robustness and scalability.Meng et al. (2022) and Martins et al. (2022a) propose fast versions of kNN-MT.Zheng et al. (2021a) develop adaptive kNN-MT by dynamically determining the number of retrieved tokens k and interpolation λ at each step, while Martins et al. (2022b) attempt to retrieve chunks of tokens from the datastore instead of a single token.Wang et al. (2022a) adopt a lightweight neural network and the cluster-based pruning method to reduce retrieval redundancy.Dai et al. (2023b) improve both decoding speed and storage overhead by dynamically constructing an extremely small datastore and introducing a distance-aware adapter for inference, and further observe the similar behaviours between kNN-based methods and translation memory approaches (Gu et al., 2018;Zhang et al., 2018;Hao et al., 2023).
Despite the great success of the kNN-MT family, the working mechanism of these methods remains an open question.Zhu et al. (2023) analyze the relationship between the datastore and NMT model to better understand the behaviour of kNN-MT.To the best of our knowledge, we are the first to provide a meta-optimization perspective for kNN-MT, i.e., kNN-MT performs implicit gradient descent on the output projection layer.

Conclusion
In this paper, we present a new meta-optimization perspective to understand kNN-MT and establish connections between kNN-MT and model finetuning.Our results on multi-domain datasets provide strong evidence for the reasonability of this perspective.Additional experiments indicate that (i) incorporating kNN-MT with adapter-based finetuning achieves comparable translation quality to entire-model fine-tuning, with better performance on out-of-domain test sets; (ii) kNN-based models suffer from the low recall of in-domain lowfrequency words, which could be mitigated by optimizing the representation vectors with lightweight adapter layers.We hope our understanding would have more potential to enlighten kNN-based applications and model design in the future.

Limitations
In this section, we discuss the limitations and future research directions of our work: • In the theoretical interpretation of kNN-MT, we adopt a relaxed form of attention in the computation of p kNN and p NMT for qualitative analysis, following the approach of preview work (Irie et al., 2022;Garg et al., 2022;Dai et al., 2023a).Whether this conclusion is suitable for normal attention is not rigorously proven, but empirical results provide strong evidence of the plausibility of this perspective.
• This paper does not include the results of combining other parameter-efficient fine-tuning methods, such as Prefix-tuning (Li and Liang, 2021) and LoRA (Hu et al., 2022), with kNN-MT.But these methods actually share a similar composition function to optimize the context representations (He et al., 2022).We leave this exploration as the future work.
• The word-level empirical analysis indicates that kNN-based models suffer from the low recall of in-domain low-frequency words.Apart from adapter-based fine-tuning, this issue may be mitigated by enhancing the context representations of low-frequency words via more efficient approaches, e.g., introducing frequency-aware token-level contrastive learning method (Zhang et al., 2022) at the pre-training stage and leveraging large-scale pre-trained models (Devlin et al., 2019;Brown et al., 2020;Guo et al., 2020;Li et al., 2022).
• Theoretical and empirical analysis on kNN-MT actually could be directly applied to nearest neighbor language models (kNN-LM) (Khandelwal et al., 2020).In the future, we would like to follow this research line and do more in-depth explorations on kNN-LM.Moreover, the theoretical analysis in this paper is limited to the last hidden states of NMT and we are also interested in investigating the effectiveness of our analysis on other hidden states of NMT, such as the output of the last attention layer in the decoder (Xu et al., 2023).

A Appendix
A.1 Derivation Process of ∆W FT According to the chain rule, the updated matrix ∆W FT is calculated as follows: where Then we provide the derivation process for the rest part.Assume that l denotes the vocabulary index of V m j , p i is the i-th probability computed by softmax(Z m j ) and z i stand for the i-th value of the vector Z m j .The calculation of F = V m j ⊤ log(softmax(Z m j )) can be re-written as F = log(p l ).When i = l, the partial derivative of F to z i is calculated as: ) If i ̸ = l, we have: ) Combining the above equations, we have:   where V m j is the one-hot vector whose the l-th value is 1, and P m j = softmax(W O K m j ) is the whole vector of prediction probability.Finally, the Equation 16 is re-written as:

A.2 Dataset Statistics
We adopt a multi-domain dataset and consider domains including IT, Medical, Koran and Law, together with IWSLT'14 German-English (DE-EN) dataset in all our experiments.The sentence statistics of datasets are illustrated in Table 4.For the data preprocessing, we use the Moses toolkit to tokenize the sentences and split the words into subword units (Sennrich et al., 2016) using the bpecodes provided by Ng et al. (2019).

A.3 Datastore Size and Hyper-parameters
The datastore size of each domain and the choices of hyper-parameters in kNN-MT are shown in for all datasets is the same.The search base values are {1, 2, 3, 4, 5, 6, 7, 8, 9} and we scale them to 1e-1, 1e-2, 1e-3 and 1e-4 times, i.e., we have 9 × 4 = 36 values to search.In Table 6, we present the details of the selected learning rates on five datasets.

A.4 Translation Performance on Out-of-Domain Test Sets
As shown in Table 7, we report the whole out-ofdomain results for the experiment in Section 4.2.

A.5 Translation Performance of Recent Advancements in kNN-MT
We provide a comprehensive comparison of translation performance between recent advancements in kNN-MT and the methods mentioned in section 4.2.The results are shown in Table 8.The results of FK-MT, EK-MT, CK-MT and SK-MT are excerpted from Dai et al. (2023b).

A.6 More Details of Word-Level Analysis
We report the overall P/R/F1 results on multidomain test sets in Table 9.Compared with precision and F1 score, the defect of kNN-MT is more obvious on word recall.In addition, as shown in Table 10, we focus on words with γ(w) ≥ 5 and calculate word recalls in different buckets based on word frequency.For the nearest-neighbors analysis, in addition to the non-retrieval rate mentioned in section 4.3, we evaluate the following metrics: ① Gold Rank/Gold Dist: the average gold label rank/distance in the top-k list, while taking the rank and distance of the last word in the top-k list (i.e., the farthest neighbor) if unretrieved; ② #Gold Labels: the average number of gold labels in the top-k list; ③ #Labels: the average distinct labels in the top-k list, indicating the diversity.For indomain words (γ(w) ≥ 5), the detailed results of k-nearest-neighbors analysis in above metrics are shown in Table 11.We observe that after adapterbased fine-tuning, the non-retrieval rate is reduced as the average distance of the gold label increases.9: Overall P/R/F1 of all models on multi-domain test sets, in which we count P/R/F1 in different buckets based on the domain-specific degree of each word γ(w).AK-MT A is the brief description of AK-MT Adapter(r=256) .

Figure 1 :
Figure1: kNN-MT implicitly executes gradient descent on the Output Projection Layer (OPL) of NMT and produces meta-gradients via forward computation based on k-nearest-neighbors.The meta-optimization process of kNN-MT shares a dual view with explicit OPL fine-tuning that updates the parameters of OPL with back-propagated gradients.

Figure 2 :Figure 3 :
Figure 2: Incremental word recall ∆R of different words on multi-domain test sets.We plot the mean ∆R of five datasets with standard deviation in both figures.For the left figure (a), we count word recalls in different buckets based on γ, while for the right figure (b), we focus on words with γ(w) ≥ 5 and calculate word recalls in different buckets based on word frequency.

Table 3 :
The BLEU score (%) and decoding speed of all models on multi-domain test sets, including IT, Law, Medical, Koran, and IWSLT."# Params" refers to the number of fine-tuned parameters.The test sets of the other four domains are integrated as out-of-domain (OOD) test sets for each domain and "OOD Avg." represents the average performance of all models on OOD test sets.For detailed results on the OOD test sets, please refer to Appendix A.4. "# Speed" indicates the relative inference speed using vanilla NMT as a baseline with a batch size of 50k tokens.
label of k-nearest-neighbors as error signals to update the models, without considering the prediction of the NMT model, which may weaken the label signal.

Table 4 :
Sentence statistics of multi-domain datasets.

Table 5 :
The datastore size (number of tokens) and hyper-parameter choices (i.e., k, λ and T ) of kNN-MT (IP) and kNN-MT (L2) in each domain.

Table 6 :
The optimal learning rates for explicit OPL fine-tuning based on the perplexity of the validation set.

Table 10 :
The word recall of all models on multi-domain test sets, in which we focus on words with γ(w) ≥ 5 and calculate word recalls in different buckets based on word frequency."# Words" denotes the total number of examples in different buckets.AK-MT A is the brief description of AK-MT Adapter(r=256) .

Table 11 :
Detailed results of k-nearest-neighbors analysis of in-domain words (γ(w) ≥ 5) on multi-domain test sets.AK-MT A is the brief description of AK-MT Adapter(r=256) .