Learning Kernel-Smoothed Machine Translation with Retrieved Examples

How to effectively adapt neural machine translation (NMT) models according to emerging cases without retraining? Despite the great success of neural machine translation, updating the deployed models online remains a challenge. Existing non-parametric approaches that retrieve similar examples from a database to guide the translation process are promising but are prone to overfit the retrieved examples. However, non-parametric methods are prone to overfit the retrieved examples. In this work, we propose to learn Kernel-Smoothed Translation with Example Retrieval (KSTER), an effective approach to adapt neural machine translation models online. Experiments on domain adaptation and multi-domain machine translation datasets show that even without expensive retraining, KSTER is able to achieve improvement of 1.1 to 1.5 BLEU scores over the best existing online adaptation methods. The code and trained models are released at https://github.com/jiangqn/KSTER.


Introduction
Over the past years, end-to-end Neural Machine Translation (NMT) models have achieved great success (Bahdanau et al., 2015;Wu et al., 2016;Vaswani et al., 2017). How to effectively update a deployed NMT model and adapt to emerging cases? For example, after a generic NMT model trained on WMT data, a customer wants to use service to translate financial documents. The costomer may have a handful of translation pairs for the finance domain, but do not have the capacity to perform a full retraining.
Non-parametric adaptation methods enable incorporating individual examples on-the-fly, by retrieving similar source-target pairs from an external database to guide the translation process Gu et al., 2018;Zhang et al., 2018; * Work is done while at ByteDance. † Corresponding author. Base is a Transformer model trained on general domain WMT data. kNN-MT and our proposed KSTER are adapted for domain-specific translation with in-domain database. Both kNN-MT and KSTER achieve improvements over Base in domain-specific translation performance. But kNN-MT overfits to in-domain data and performs bad in general domain translation, while the proposed KSTER achieves comparable general domain translation performance with Base. Cao and Xiong, 2018). The external database can be easily updated online. Most of these methods rely on effective sentence-level retrieval. Different from sentence retrieval, k-nearest-neighbour machine translation introduces token level retrieval to improve translation (Khandelwal et al., 2021). It shows promising results for online domain adaptation.
There are still limitations for existing nonparametric methods for online adaptation. First, since it is not easy for sentence-level retrieval to find examples that are similar enough to the test example, this low overlap between test examples and retrieved examples brings noise to translation . Second, completely non-parametric methods are prone to overfit the retrieved examples. For example, although kNN-MT improves domain-specific translation, it overfits severely and can not generalize to the general domain, as is shown in Figure 1. An ideal online adaptation method should introduce less noise to the translation process and generalize to the changeful test examples with the incrementally changing database.
In this paper, we propose to learn Kernel-Smoothed Translation with Example Retrieval (KSTER), to effectively learn and adapt neural machine translation online. KSTER retains the online adaptation advantage of non-parametric methods and avoids the drawback of easy overfitting. More specifically, KSTER improves the generalization ability of non-parametric methods in three aspects. First, we introduce a learnable kernel to dynamically measure the relevance of the retrieved examples based on the current context. Then, the exampled-based distribution is combined with the model-based distribution computed by NMT with adaptive mixing weight for next word prediction. Further, to make the learning of KSTER stable, we introduce a retrieval dropout strategy. The intuition is that similar examples can constantly be retrieved during training, but not the same situation during inference. We therefore drop the most similar examples during training to reduce this discrepancy.
With above improvements, KSTER shows the following advantages: • Extensive experiments show that, KSTER outperforms kNN-MT, a strong competitor, in specific domains for 1.1 to 1.5 BLEU scores while keeping the performance in general domain. • KSTER outperforms kNN-MT for 1.8 BLEU scores on average in unseen domains. Therefore, there is no strong restriction of the input domain, which makes KSTER much more practical for industry applications.

Related Work
This work is mostly related to two research areas in machine translation (MT), i.e., domain adaptation for machine translation and online adaptation of MT models by non-parametric retrieval.
Domain Adaptation for MT Domain adaptation for MT aims to adapt general domain MT mod-els for domain-specific language translation (Chu and Wang, 2018). The most popular method for this task is finetuning general domain MT models on in-domain data. However, finetuning suffers from the notorious catastrophic forgetting problem (McCloskey and Cohen, 1989;Santoro et al., 2016). There are also some sparse domain adaptation methods that only update part of the MT parameters Wuebker et al., 2018;Guo et al., 2021).
In real-world translation applications, the domain labels of test examples are often not available. This dilemma inspires a closely related research area -multi-domain machine translation (Pham et al., 2021;Farajian et al., 2017;Liang et al., 2020;Lin et al., 2021;Zhu et al., 2021), where one model translates sentences from all domains.
Online Adaptation of MT by Non-parametric Retrieval Non-parametric methods enable online adaptation of deployed NMT models by updating the database from which similar examples are retrieved.
Traditional non-parametric methods search sentence-level examples to guide the translation process (Cao and Xiong, 2018;Gu et al., 2018;Zhang et al., 2018). Recently, n-gram level  and token level (Khandelwal et al., 2021) retrieval are introduced and shows strong empirical results. Generally, similar examples are retrieved based on fuzzy matching (Bulte and Tezcan, 2019;Xu et al., 2020), embedding similarity, or a mixture of the two approaches .

Methodology
In this section, we first formulate the kernelsmoothed machine translation (KSTER), which smooths neural machine translation (NMT) output with retrieved token level examples. Then we introduce the modeling and training of the learnable kernel and adaptive mixing weight. The overview of KSTER is shown in Figure 2.

Base Model for Neural Machine Translation
The state-of-the-art NMT models are based on the encoder-decoder architecture. The encoder encodes the source text x into a sequence of hidden states. The decoder takes the representations of the source text as input and generates target text autoregressively. In each decoding step, the decoder  predicts the probability distribution of next tokens p m (y i |x,ŷ <i ; θ) conditioned on the source text x and previously generated target tokensŷ <i . The NMT model parameters are denoted as θ.

Kernel Smoothing with Retrieved Examples
The model-based distribution p m (y i |x,ŷ <i ; θ) is then smoothed by an example-based distribution p e (y i |x,ŷ <i ). It is computed using kernel density estimation (KDE) on retrieved examples. We build a database from which similar examples are retrieved. The database consists of all token level examples from the training set in the form of key-value pairs. In each key-value pair (k, v), the key k = f NMT (x, y <i ; θ) is the intermediate representation of a certain layer in the NMT decoder. The value is the corresponding ground truth target token y i . The key k can be seen as a vector representation of the context of value v. We obtain the key-value pairs from (x, y) by running force-decoding with a trained NMT model.
In each decoding step, we compute the query q i = f NMT (x,ŷ <i ; θ) and retrieve k similar examples based on the L 2 distance 1 query and keys. Each retrieved example forms a triple (k j , v j , d j ), where k j is the key; v j is the corresponding value token and d j is the L 2 distance between query q i and key k j . The example-based distribution p e (y i |x,ŷ <i ) is then computed with these retrieved 1 For two d-dimension vectors x and y, we compute the L 2 distance between x and y as where K(q, k; σ) is the kernel function and σ is the parameter of the kernel.
Finally, the NMT output is smoothed by combing the model-based distribution and the examplebased distribution with a mixing weight λ.

Learnable Kernel Function
Although all kernel functions can be used in KDE, we choose two specific kernels in this study -Gaussian kernel and Laplacian kernel, since they are easy to parameterize.
The only parameter in Gaussian kernel is the bandwidth σ, it controls the smoothness of the examplebased distribution, as is shown in Figure 3.
In a learnable Gaussian kernel, the bandwidth is not a fixed hyper-parameter. Instead, it is estimated in each decoding step by a learned affine network with exponential activation.
The bandwidth of learnable Laplacian kernel is modeled in the same way as the bandwidth of learnable Gaussian kernel. Koran  IT  Subtitles  Train 467k  248k  18k  223k  500k  Dev  2k  2k  2k  2k  2k  Test  2k  2k  2k 2k 2k

Adaptive Mixing of Base Prediction and Retrieved Examples
To mix the model-based distribution and examplebased distribution adaptively, we model the mixing weight λ with a learnable neural network. The mixing weight λ is computed by a multilayer perceptron with query q i and weighted sum of keys k as inputs, where In this way, kNN-MT (Khandelwal et al., 2021) could be seen as a specific case of KSTER, with fixed Gaussian kernel and mixing weight.

Training
We optimize the KSTER model by minimizing the cross entropy loss between the mixed distribution and ground truth target tokens: where n is the length of a target sentence y.
We keep the NMT model parameters θ fixed. Only parameters of learnable kernel and mix-

Retrieval Dropout
Since the database is built from the training data and KSTER is trained on the training data, similar examples can constantly be retrieved from the database during training. However, in test time, there may be no example in the database that is similar to the query. This discrepancy between training and inference may lead to overfitting.   (Bojar et al., 2014) as general domain training data, which consists of 4.5M sentence pairs. newstest2013 and newstest2014 are used as the general domain development set and test set, respectively. 5 domain-specific datasets proposed by Koehn and Knowles (2017) and resplited by Aharoni and Goldberg (2020) 2 are used to evaluate the domain-specific translation performance. The detailed statistics of the 5 datasets are shown in Table 1.

Implementation Details
We use joint Byte Pair Encoding (BPE) (Sennrich et al., 2016) with 30k merge operations for subword segmentation. The resulted vocabulary is shared between source and target languages. We employ Transformer Base (Vaswani et al., 2017) as the base model. Following Khandelwal et al. (2021), the normalized inputs of feed forward network in the last Transformer decoder block are used as keys to build the  databases and queries for retrieval. The translation performance is evaluated with detokenized BLEU scores (Papineni et al., 2002), computed by Sacre-BLEU (Post, 2018) 3 . We build a FAISS (Johnson et al., 2017) index for nearest neighbour search. We employ inverted file and product quantization to accelerate retrieval in large scale databases. The keys of examples are stored in the fp16 format to reduce the memory demand. We set k = 16 to keep a balance between translation quality and inference speed.
We train the base model for 200k steps. The best 5 checkpoints are averaged to obtain the final model. We train KSTER for 30k steps. For the training procedures of all models, each batch contains 32,768 tokens approximately. The models are optimized by Adam optimizer (Kingma and Ba, 2015) with learning rates set to 0.0002.
KSTER introduced 526k trainable parameters, which is 0.85% of the base model. We implement all the models based on JoeyNMT (Kreutzer et al., 2019) 4 .

Domain Adaptation for Machine Translation
We build individual database for each specific domain with in-domain training data in DAMT. The sizes of databases are shown in Table 2.  Main results The DAMT experiment results are shown in Table 3. For domain-specific translation, KSTER outperforms kNN-MT for 1.2 and 1.4 BLEU scores on average in EN-DE and DE-EN translation respectively. Finetuning achieves best domain-specific performance on average. However, the performance of finetuned models on general domain drops significantly due to the catastrophic forgetting problem. The even worse general domain performance of kNN-MT indicates that it overfits to the retrieved examples severely. KSTER performs far better than finetuning and kNN-MT in general domain, which shows strong generalization ability. We notice that KSTER with Laplacian kernel performs slightly better than Gaussian kernel, since KSTER with Gaussian kernel tends to ignore the long-tailed retrieved examples.

Robustness test
The performance of MT model with non-parametric retrieval is influenced by the size and quality of database. Khandelwal et al. (2021)   of kNN-MT changes with the size of database. In this work, we study the performance change of kNN-MT and KSTER with low-quality database. Specifically, we test the robustness of these models in DAMT when the database is noisy. We add token-level noise to the English sentences in parallel training data by EDA (Wei and Zou, 2019) 5 . For each word in a sentence, it is modified with a probability of 0.1. The candidate modifications contain synonym replacement, random insertion, random swap and random deletion with equal probability. Then we use the noisy training data to construct the noisy database.
We study the effects of source side noise and target side noise on translation performance. The experiment results are presented in Table 4. Target side noise has more negative effect to translation performance than source side noise. The BLEU scores of KSTER drop less apparently in all settings, which indicates that the proposed method is more robust with low-quality database.

Multi-Domain Machine Translation
In MDMT, since there is no domain label available in test time, examples from all domains are translated with one model. We build a mixed database with training data of general domain and 5 specific domains, which is used in all MDMT experiments. The mixed database for EN-DE translation and DE-EN translation contains 172M and 167M key-value pairs respectively.
Baselines We compare the proposed method with the following baselines. 5 We do not experiment with adding noise to the German side, since German WordNet is not available for us, which is necessary for synonym replacement Besides, KSTER also outperforms joint-training in both general domain performance and averaged domain-specific performance significantly. The proposed method achieves advantages over jointtraining in both online adaptation and translation performance.

General-specific domain performance trade-off
We plot the general domain performance and averaged domain-specific performance of kNN-MT with different hyper-parameter selections in Figure 4. We find that kNN-MT performs better in domain-specific translation when the system prediction relies more on the searched examples (low bandwidth and higher mixing weight). In contrast, better general domain translation performance is achieved when the system prediction relies more on NMT prediction (high bandwidth and low mixing weight).
There is a trade-off between general and specific domain performance in kNN-MT. Applying the identical kernel and mixing weight for all test examples can not achieve best performance in general domain and specific domains simultaneously.
KSTER with Gaussian kernel, which is a generalization of kNN-MT, achieves better performance in both general domain and domain-specific translation since it applies adaptive kernel and mixing weight for different test examples. Distributions in Figure 5 indicates that KSTER learns different kernels and different weights for different examples.

Generalization ability over unseen domains
We test the generalization ability of baselines and    Table 6. KSTER outperforms all baselines, which shows strong generalization ability.

Inference Speed
A common concern about non-parametric methods in MT is that searching similar examples may slow the inference speed. We test the inference speed KSTER in MDMT in EN-DE translation, which is the setting with the largest database. The averaged inference time in general domain and 5 specific domain test sets of kNN-MT is 1.15 times of the base model. The averaged inference time of KSTER is 1.19 times of the base model, which is only slightly slower than the baselines.

Analysis
In this section, we first conduct ablation studies to verify the effectiveness of each part of the proposed method. Then we conduct detailed analysis on how kernel-smoothing with retrieved examples helps translation.  Figure 6: The ablation study of k selections in DAMT and MDMT settings in EN-DE translation. In DAMT, we report the averaged performance on 5 specific domains. In MDMT, we report the averaged performance on general domain and 5 specific domains. With all k selections, KSTER outperforms kNN-MT consistently.

Ablation Studies of Proposed Methods
Both learnable kernel and learnable mixing weight bring improvement In KSTER, both the kernel and mixing weight are learnable. We study the effect of keeping only one of the two parts learnable in MDMT.
We take KSTER with Gaussian kernel for analysis. The ablation experiment results are presented in Table 7.
Both learnable kernel and learnable mixing weight bring improvement in both general domain and domain-specific translation. Keeping the two parts learnable simultaneously brings additional improvement. Overall, learnable mixing weight is more important than learnable kernel function.
KSTER outperforms kNN-MT with all k selections We conduct ablation study over different k selections in both DAMT and MDMT settings in EN-DE translation. We experiment with four k selections -{4, 8, 16, 32}, and plot the results in Figure 6. In DAMT, KSTER achieves best performance with k = 16. In MDMT, the performance of our method increases with k. With all k selections, KSTER outperforms kNN-MT consistently.

Retrieval dropout improves generalization
We study the effect of retrieval dropout in MDMT and select the KSTER with Laplacian kernel for analysis.
We plot the general domain and averaged domain-specific translation performance of models trained with or without retrieval dropout in Figure  7. Without retrieval dropout, the performance of both general domain and domain-specific translation drops dramatically. The discrepancy between training and inference leads to severe overfitting. This problem is alleviated by the proposed retrieval dropout, which shows that this training strategy improves the generalization ability of KSTER.

Fine-grained Effects of Kernel-smoothing with Retrieved Examples on Translation
For better understanding the effects of kernelsmoothing with retrieved examples on translation, we study the following two research questions.
Kernel-smoothing influences verbs, adverbs and nouns most To study the first research question, we categorize the predicted words with their Part-of-Speech tags (POS tags). In each decoding step, if the predicted word y i obtains the highest probability of example-based distribution but it does not obtain the highest probability of modelbased distribution, it is recognized as a prediction determined by kernel-smoothing with retrieved examples.
We compute the ratio of predictions determined by kernel-smoothing across different POS tags. We conduct this analysis on DAMT task in EN-DE direction and select Medical and Subtitles domains as representatives.
We report the results in Figure 8. Medical and Subtitles represent two opposite cases where nonparametric retrieval contributes more in the former and contributes less in the latter. We find that across the 2 different domains, kernel-smoothing contributes most to the predictions of verbs, adverbs and nouns, which are morphologically complex word types. Retrieving words in similar con- text may helps selecting the correct form of morphologically complex words.
Kernel-smoothing helps word sense disambiguation In kernel-smoothing, we search examples with similar keys -contextualized hidden states. We hypothesize that the retrieved examples contains useful context information which helps word sense disambiguation. We test this hypothesis with contrastive translation pairs. A contrastive translation pair contains a source, a reference and one or more contrastive translations. Contrastive translations are constructed by replacing a word in reference with a word which is another translation of an ambiguous word in the source. NMT systems are used to score reference and contrastive translations. If an NMT system assign higher score to reference than all contrastive translations in an example, the NMT system is recognized as making correct prediction on this example.
We use ContraWSD (Gonzales et al., 2017) 7 as the test suite, which contains 7,359 contrastive translation pairs for DE-EN translation. We encode the source sentences from ContraWSD and training data of 5 specific domains by averaged BERT embeddings (Devlin et al., 2018). Then we whiten the sentence embeddings with BERT-whitening proposed by Huang et al. (2021); . For each domain, we select 300 examples from Con-traWSD that most similar to the in-domain data based on the cosine similarity of sentence embeddings. We evaluate the translation performance and word sense disambiguation ability of base model and KSTER for MDMT on selected examples for each domain. The results are shown in Figure  9. Experimental results show that KSTER consistently outperforms base model in both translation performance and word sense disambiguation accuracy, which indicates that kernel-smoothing helps word sense disambiguation in machine translation.

Conclusion
In this work, we propose kernel-smoothed machine translation with retrieved examples. It improves the generalization ability over existing non-parametric methods, while keeps the advantage of online adaptation.