Chunk-based Nearest Neighbor Machine Translation

Semi-parametric models, which augment generation with retrieval, have led to impressive results in language modeling and machine translation, due to their ability to retrieve fine-grained information from a datastore of examples. One of the most prominent approaches, kNN-MT, exhibits strong domain adaptation capabilities by retrieving tokens from domain-specific datastores (Khandelwal et al., 2021). However, kNN-MT requires an expensive retrieval operation for every single generated token, leading to a very low decoding speed (around 8 times slower than a parametric model). In this paper, we introduce a chunk-based kNN-MT model which retrieves chunks of tokens from the datastore, instead of a single token. We propose several strategies for incorporating the retrieved chunks into the generation process, and for selecting the steps at which the model needs to search for neighbors in the datastore. Experiments on machine translation in two settings, static and “on-the-fly” domain adaptation, show that the chunk-based kNN-MT model leads to significant speed-ups (up to 4 times) with only a small drop in translation quality.


Introduction
Machine translation has seen remarkable advances due to increasingly powerful neural models (Sutskever et al., 2014;Bahdanau et al., 2015;Vaswani et al., 2017).Most deployed systems are fully-parametric (the training data is fully compressed into the parameters of a neural model), but they often struggle when translating rare words or out-of-domain sentences (Koehn and Knowles, 2017), commonly requiring several stages of finetuning to adapt to data drift or to new domains.Recently, semi-parametric methods have shown great promise, by combining the strengths of parametric models with external databases of parallel sentences, such as translation memories (Gu et al., 2018;Zhang et al., 2018;Bapna and Firat, 2019a;Meng et al., 2021;Zheng et al., 2021a;Jiang et al., 2021;Martins et al., 2022).
One of the most prominent semi-parametric models for machine translation is the k-Nearest Neighbor Machine Translation model (kNN-MT) (Khandelwal et al., 2021), which has led to impressive results, particularly in domain adaptation settings, without requiring fine-tuning.The kNN-MT model constructs domain-specific datastores of parallel sentences and, at inference time, it retrieves similar examples from these datastores, which are used to improve the generation process, through the interpolation of probability distributions.However, kNN-MT only retrieves single tokens-this is inefficient, since the model needs to consult the datastore at every generation step, an expensive operation.Consequently, its decoding speed is around 8 times slower than that of a fully-parametric model.
Recent work has introduced several techniques to speed up kNN-MT.Meng et al. (2021) proposed Fast kNN-MT, which constructs a different datastore for each example, by first searching for the nearest neighbors of the source tokens.Wang et al. (2021) introduced Faster kNN-MT, similar to Fast kNN-MT but with reduced memory requirements.Martins et al. (2022) proposed pruning the datastore, reducing the keys' representation size, and using a cache of retrieval distributions.However, despite leading to some decoding speedups, these methods are limited as they still retrieve a single token at each time step.
In this paper, we propose a simple and efficient chunk-based kNN-MT model.Inspired by RETRO (Borgeaud et al., 2021), the chunk-based kNN-MT model retrieves chunks of tokens, instead of single tokens.But, similarly to kNN-MT and unlike RETRO, it does not require any training or finetuning of the parametric component: it simply uses a combination of caching and interpolation of probability distributions to incorporate the retrieved tokens.By doing this, the model leads to a similar translation quality while having to search for neighbors in the datastore less often.This leads to decoding speeds up to 4 times faster than the ones achieved using the vanilla kNN-MT model and only twice as slow as a fully-parametric model, but with considerably higher translation quality.
In sum, our main contributions are: • We introduce a chunk-based kNN-MT model, which retrieves chunks of tokens from a datastore of examples.
• We propose and compare several approaches to deal with the retrieved chunks' tokens and to select the steps in which the model performs retrieval from the datastore.
• We compare the translation quality and decoding efficiency on domain adaptation, which shows the benefits of chunk-based kNN-MT.
• We propose using chunk-based kNN-MT for on-the-fly adaptation.

Background
In machine translation, a model is given a sentence or document in a source language, x = [x 1 , . . ., x L ], and the goal is to output a translation in a target language, y = [y 1 , . . ., y N ].This is commonly done using a parametric sequence-tosequence model (Sutskever et al., 2014;Bahdanau et al., 2015;Vaswani et al., 2017), in which the encoder receives the source sentence as input and outputs a set of hidden states.Then, at each step t, the decoder attends to these hidden states and outputs a probability distribution over the vocabulary, p NMT (y t |y <t , x).Finally, these probability distributions are used in a search procedure to generate the translation, typically using beam search (Reddy, 1977).The kNN-MT's datastore D consists of a keyvalue memory, where each key is the decoder's output representation, f (x, y <t ) ∈ R d , and the value is the corresponding target token y t ∈ V:

k-Nearest Neighbor Machine Translation
where S denotes a set of parallel sentences.Then, at inference time, the model searches the datastore to (approximately) retrieve the set of k nearest neighbors N .The retrieval distribution, p kNN (y t |y <t , x), is computed, using the neighbors' distance to the current decoder's output representation, d(f (x, y <t ), •): (2) where T is the softmax temperature, k j denotes the key of the j th neighbor and v j its value.Finally, the two distributions, p NMT (y t |y <t , x) and p kNN (y t |y <t , x), are combined, by performing interpolation, to obtain the final distribution, which is used to generate the translation through beam search: where λ ∈ [0, 1] is a hyperparameter that controls the weight given to the two distributions.

Chunk-based kNN-MT
We now describe our chunk-based kNN-MT model, illustrated in Figure 1.We first describe the datastore creation ( §3.1) and how we can retrieve chunks of tokens from it ( §3.2), describing how they are used by the model ( §3.2.2).Finally, we describe how to select the steps in which the model performs retrieval ( §3.3).

Building the datastore
For the model to retrieve a chunk of tokens of size c instead of a single token, we first need to build a datastore D which also consists of a key-value memory, where each entry key is the decoder's output representation f (x, y <t ), but the value is now a chunk of target tokens y t:(t+c−1) :2   Top right: procedure when not performing retrieval.Bottom: retrieval schedule scheme.
Note that the chunks are sliding windows, i.e., they overlap.

Retrieving chunks of tokens
At inference time, when performing retrieval, the model searches the datastore and retrieves the set of k nearest neighbors N , for each beam hypothesis and example in the current batch.We now describe several strategies to manipulate the retrieved chunks of tokens during generation.

Maintaining the chunk order
Since in the original sentences used to build the datastore each chunk is composed of an ordered sequence of tokens, the simplest way for the model to use the retrieved tokens is to consider them in the same order.For this, we simply need to compute a retrieval distribution for each token in the chunk, always using the same retrieval distances but aligning the chunk tokens with the corresponding time step, i.e. at the retrieval step we consider the first token of each neighbor chunk, at the following step we consider the second token, and so on.We can compute this retrieval distribution as it is done for the kNN-MT, in Eq. 2, just modifying the token indices according to the step.However, by doing this, we are ignoring the remaining tokens in the chunk, which can also contain relevant information for the current prediction.Also, the tokens that are generated in the previous steps t, . . ., t + j − 1 might not be well "aligned" with the next tokens in the jth position of the chunk for all neighbors.

Neighbors' Cache
To avoid the limitation stated above, we propose using a neighbors' cache instead.We keep the tokens of the retrieved chunks in this cache, so that the model has a higher flexibility about which tokens to select for the current step: it has access to all the tokens present in the retrieved chunks.The neighbors' cache, M, consists of a key-value memory, where a key is the decoder's output representation, f (x, y <t ), and a value is the corresponding target token y t , as in the datastore: (5) Note, however, that this cache requires having the decoder state for every token in the chunk, not just for the first one.Therefore, we need to have this information available in the datastore:3 Then, as shown in the diagram of Figure 1, at the retrieval steps, the model first searches for the nearest neighbors in the datastore, then builds the neighbors' cache with the tokens of the retrieved chunks, then finally uses the first token of each chunk to compute the current retrieval distribution.In contrast, at the non-retrieval steps, the model searches for the nearest neighbors in the neighbors' cache instead of retrieving from the datastore.Then, to compute the retrieval distribution it also uses the softmax, as in Eq. 2, but with a different softmax temperature, T ′ .To incorporate the retrieved tokens, it performs interpolation as before, Eq. 3, replacing the hyperparameter λ by λ ′ .4 Considering batch-beam-level neighbors.By building this neighbors' cache, the model can use all the tokens in the retrieved chunks that correspond to each beam hypothesis of the sentence being translated.However, it still ignores the chunks of tokens corresponding to the other beam hypotheses, which are often quite similar, and the other sentences being translated in the same batch, which can contain relevant contextual information if they belong to the same document.
To also leverage these, we propose increasing the number of tokens that the model has access to, by combining the chunks retrieved for the different beam hypotheses and the different examples of the same batch.To do so, we simply need to build a single neighbors' cache for the current batch, to which we feed all the retrieved chunks' tokens.
Considering sentence-level neighbors.To also consider the chunks of tokens retrieved in previous steps of the generation of the current sentences, we propose to keep these in the neighbors' cache, instead of resetting the cache at each retrieval step.
We empirically compare these different proposed approaches in §4.1.3.

Retrieval Steps Schedule
As the need to perform retrieval slows down decoding considerably, having an efficient retrieval schedule is key to achieve decoding speedups.The simplest scheduling option corresponds to performing retrieval every i steps.However, we noticed empirically that it is beneficial to perform retrieval steps more frequently at the beginning of the sentence, as we will see in Table 3 of §4.1.4.To leverage this, we introduce the following schedule.
Having k ∈ {1, 2, . ..} as the retrieval step's index and t k as the corresponding time step, (i.e.t k is the position of the token generated after the k th retrieval step), we propose using a geometric progression to compute the interval (in tokens) between the k th and (k + 1) th retrieval steps, i k = t k+1 − t k : where i max and i min are hyperparameters that define the maximum and minimum interval between retrieval steps, the rate at which the interval increases is defined as r = 1 2 i max /|x| where |x| is the source sentence size, and ⌊•⌋ denotes the floor function.5By using this progression, the frequency with which the model performs retrieval decays along the generation, until the interval between retrieval steps reaches i max .For example, with i min = 2, i max = 16, and |x| = 20 the model performs retrieval at steps: {1, 3, 7, 20, 36, 52, . . .}.Note that the chunk size, c, is independent of the interval between retrieval steps.

Experiments
To understand if the chunk-based kNN-MT is able to maintain the translation quality while speedingup decoding, we performed experiments on domain adaptation ( §4.1) and on-the-fly adaptation ( §4.2).

Domain Adaptation
Dataset and metrics.For domain adaptation, we perform experiments on the Medical, Law, IT, and Koran domain data of the multi-domains dataset introduced by Koehn and Knowles (2017) using the splits redefined by Aharoni and Goldberg (2020).To build the datastores, we use the training sets that have 6, 903,141, 19,061,382, 3,602,862, and 524,374 tokens, respectively.The validation and test sets have 2,000 examples for each domain.For evaluation, we use BLEU (Papineni et al., 2002;Post, 2018) and COMET (Rei et al., 2020).
Models.As a baseline, we consider the fully parametric base MT model: the winning system from the WMT'19 German-English news translation task (Ng et al., 2019) (with 269M parameters), available in Fairseq (Ott et al., 2019).We also compare our chunk-based kNN-MT model with other  Settings.For all models that perform retrieval, we retrieve k = 8 neighbors and select the hyperparameter λ for each method and each domain by performing grid search on λ ∈ {0.5, 0.6, 0.7, 0.8}.For the chunk-based kNN-MT, we also perform grid search on λ ′ ∈ {0.4,0.5, 0.6} and T ′ ∈ {1, 2, 3}.The selected values for the hyperparameters for each model for each domain, validated on the validation set, are stated in Table 10 of App.D. We use the softmax temperatures proposed by Khandelwal et al. (2021) and for the efficient kNN-MT, we use the efficiency methods' parameters proposed by Martins et al. (2022).Where not stated otherwise, we use the chunk-based kNN-MT with chunks of size c = 16, with a sentence-level neighbors' cache, and use the geometric progression heuristic, in Eq. 7, to select the retrieval steps, with i min = 2 and i max = 16, since this is the setting that leads to the best trade-off between translation quality and decoding speed on the validation set.We also follow Martins et al. (2022) and use PCA to reduce the datastore keys' dimension to 256 and the neighbors' cache keys' size to 64.To perform search in the datastore and in the neighbors' cache, we use FAISS (Johnson et al., 2019).
For the fine-tuned model, we perform fine-tuning for a maximum of 20 epochs on each domain.We perform grid search on the validation set, using different learning rates, η ∈ {5 × 10 −6 , 1 × 10 −5 , 5 × 10 −5 , 1 × 10 −4 } and two different learning rate schedules (reducing learning rate on plateau and by the inverse square root) with and without warmup during 1 epoch.The selected hyperparameters are stated in Table 11 of App.D.
Computational infrastructure.All experiments were performed on a server with 3 RTX 2080 Ti (11 GB), 12 AMD Ryzen 2920X CPUs (24 cores), and 128 Gb of RAM.For the decoding speed measurements, we ran each model on a single GPU while no other process was running on the server, to have a controlled environment.The nearest neighbor search in the datastore is performed on the CPUs, since not all datastores fit into GPU memory.

Results
The translation scores are reported in of semi-parametric models, the results are not conclusive: in terms of BLEU, the semi-parametric models lead to better translations, but according to COMET this is not the case.We present translation examples for the different domains in App.F.

Decoding speed
As can be seen in Figure 2, the chunk-based kNN-MT model leads to a decoding speed up to two times higher than the decoding speed of the efficient kNN-MT model of Martins et al. ( 2022) and up to four times higher than that of the vanilla kNN-MT model of Khandelwal et al. (2021).The chunk-based kNN-MT model is also able to reduce the decoding speed gap to the base MT model to a factor of two, compared to a factor of four from previous work.Moreover, according to the results on Table 1 this speed-up comes without substantially harming the model's translation quality.

What is the best way to incorporate the retrieved tokens?
To understand which chunk incorporation strategy works best, we perform a comparison using chunks of size c = 6 and performed retrieval every 6 steps (i = 6).The results reported in Table 2 show that using a neighbors' cache leads to substantially better BLEU scores.We can also see that having a beam-batch-level cache improves the BLEU score and that keeping the tokens from the previous retrieved chunks in the neighbors' cache further improves the translation quality.

When to perform retrieval?
To understand how we should select the retrieval steps, we compare performing retrieval every i = 6 or i = 8 steps against using the proposed geometric progression (GP), Eq. 7, with i min = 2 and i max = 8, i max = 16, or i max = 32, to compute the interval between retrieval steps.For this comparison, we used the model with a sentencelevel neighbors' cache and considered c = i or c = i max if using the geometric progression.We report the BLEU scores in Table 3 and the corresponding decoding speeds in Table 4.This comparison shows that performing retrieval steps more frequently at the beginning of the translation, by using the the proposed geometric progression heuristic (GP), leads to better BLEU scores while having a higher decoding speed.

On-the-fly Adaptation
To understand how the chunk-based kNN-MT model behaves in a realistic scenario where the data arrives in streams, we performed experiments where the model is continuously adapted on the fly.Task description.In this task, we attempt to simulate a real scenario, described in Figure 3, in which we have a model that translates sentences and a human translator (e.g., a post-editor) who corrects the generated translations.Our goal is to understand whether it is better to use the corrected translations to repeatedly fine-tune the base model or to add the corrected translations to the datastore without touching the base model.To do so, we consider that, before starting translation, we have access to 10% of the dataset (we use the training sets of the medical and law domains) and the goal is to translate the remaining 90% examples.To simulate the existence of human translators who correct the generated translations, we consider that, after the model translates a block of sentences, it has access to the corresponding reference sentences.

Models.
As baselines, we use the base MT model and the base model fine-tuned on the initially available 10% of data (fine-tune (once)).We also compare fine-tuning the base model on the initially available data and then finetuning after every 32,000 and every 64,000 examples, with all the available data at the time.In all cases, we finetuned the model for a maximum of 5 epochs with a learning rate of 5 × 10 −5 using the inverse square root scheduler.Concerning the semi-parametric models, we compare the kNN-MT model and the chunk-based kNN-MT model, building the initial datastore with the initially available data and adding new examples to the datastore after every 250 or 1,000 sentences.We used the same configurations for the chunk-based kNN-MT model, as in §4.1: chunks of size c = 16, sentence-level neighbors' cache, and the geometric progression heuristic to select when to perform retrieval, with i min = 2 and i max = 16.We also use the same hyperparameters, stated on Table 10 of App.D.

Results
Figure 4 contains the results of the on-the-fly adaptation experiments on the medical domain.On the top left plot, we see that both the kNN-MT and the chunk-based kNN-MT lead to higher BLEU scores than the fine-tuned models.Also, on the top right, we see that the time needed to add examples to the datastore is much shorter than the time needed to fine-tune the model.This comes at the cost of a higher inference time (bottom left).However, we can see that the chunk-based kNN-MT model is able to substantially reduce the inference time gap to the fully-parametric models.Also, the bottom right plot shows that the chunk-based kNN-MT has a shorter total time than kNN-MT and the models fine-tuned every 32,000 and 64,000 steps.Concerning the fine-tuned models, there is a BLEU increase when the model is fine-tuned more often.However, this leads to a substantially higher training time.On the other hand, increasing the datastore updates frequency leads to small BLEU improvements and small increases in training time.The results on the law domain show similar results (App.E).

Related Work
Semi-parametric models.Semi-parametric models, which augment a parametric model with a retrieval component, have been shown to be effective on several text generation tasks.For language modeling, Khandelwal et al. (2019) proposed the k-nearest neighbor language (kNN-LM), in which a language model is augmented with token-based retrieval and uses probability interpolation to incorporate these tokens.Yogatama et al. (2021) proposed to integrate the retrieved tokens with a gating mechanism.Borgeaud et al. (2021) proposed to retrieve chunks of tokens and incorporate them with cross-attention, using datastores with trillions of tokens.To increase the kNN-LM's decoding speed, He et al. (2021) proposed a range of techniques, such as datastore pruning, dimension reduction, and adaptive retrieval.Alon et al. (2022) proposed adding pointers to the next token on the original corpus to the datastore entries, so that the model can consider the pointed entries instead of performing retrieval.Similarly to our approach, this saves retrieval steps by leveraging the original corpus sequences, but in our case, we do not limit the candidate tokens to be the following ones and consider the succeeding tokens even if the model has not generated the same prefix token(s).
For machine translation, Gu et al. (2018) introduced a semi-parametric model which uses an outof-the-box search engine to retrieve similar sentence pairs, and incorporate them with shallow and deep fusion.Zhang et al. (2018) proposed to retrieve n-grams and use them to up-weight token probabilities.Bapna and Firat (2019a) proposed to retrieve sentences similar to the source's n-grams, and incorporate them with attention.More recently, Khandelwal et al. (2021) proposed the kNN-MT model which Zheng et al. (2021a) extended with a network that determines the number of retrieved tokens to consider and Zheng et al. (2021b) pro-posed building the datastore using monolingual sentences.As kNN-MT can be up to two orders of magnitude slower than a fully-parametric model, methods that improve its efficiency have been proposed.Meng et al. (2021) and Wang et al. (2021) proposed the Fast and Faster kNN-MT, in which the model has a higher decoding speed, by creating a different datastore, based on the source sentence, for each example.Martins et al. (2022) propose efficient kNN-MT, which we use as baseline ( §4.1), by adapting the methods introduced by He et al. (2021) to machine translation and introducing a retrieval distributions cache to speed-up decoding.In this paper, we show that the chunk-based kNN-MT model can further speed-up decoding, by retrieving chunks of tokens instead of a single token.
Semi-parametric models have also been applied to other tasks as question answering (Lewis et al., 2020;Izacard and Grave, 2021a,b) and dialogue generation (Weston et al., 2014;Fan et al., 2021).

Domain adaptation for machine translation.
Domain adaptation consists of adapting generic models to domain-specific data.The most common method for domain adaptation in machine translation is fine-tuning the model on each domain, but this can be expensive and often leads to catastrophic forgetting (Saunders, 2021).To simplify this, some work has proposed fine-tuning only part of the model (Wuebker et al., 2018;Bapna and Firat, 2019b;Lin et al., 2021;Liang et al., 2021).Farajian et al. (2017) performed on-the-fly adaptation, by fine-tuning the model on a set of retrieved examples for each source sentence.However, this still requires fine-tuning model parameters.
Several works have introduced domain adaptation methods without the need to fine-tune the model.Eidelman et al. (2012), Hasler et al. (2014), andSu et al. (2015) proposed using topic models while Bertoldi et al. (2014) proposed leveraging post-editing information.More recently, Khandelwal et al. (2021) proposed using semi-parametric models which retrieve from domain-specific datastores.The aim of our chunk-based kNN-MT method is to speed up kNN-MT's decoding, while maintaining its high translation quality.

Conclusions
In this paper, we propose a chunk-based kNN-MT model, which retrieves chunks of tokens from a datastore, instead of a single token.To do so, we proposed several alternatives to explore the re-trieved chunks' tokens: keeping the original order or building a neighbors' cache.We also analyzed two approaches to select the retrieval steps: every i steps or using a geometric progression heuristic to define the interval between retrieval steps.Through experiments on domain adaptation, we showed that chunk-based kNN-MT leads to a considerable speed-up without substantially compromising the translation quality.Experiments on onthe-fly adaptation showed that chunk-based kNN-MT leads to high quality translations while being more efficient than previously proposed methods.

Limitations
The scope of this paper is limited to the usage of small to medium size datastores, due to the memory requirements needed for big size datastores, for which the proposed model could be even more beneficial.Additionally, we use the decoding speed (tokens per second), training time (in minutes), and inference time (in minutes) to compare the efficiency of the different models.However, these metrics depend on the computational infrastructure used, and, consequently, the speed-up gains can vary when using different hardware.

A Varying the Chunk Size along the Generation
When using the geometric progression to compute the interval between retrieval steps ( §3.3), the model performs retrieval more frequently at the beginning of the generation of the translation.Because of this, we compare having a fixed chunk size equal to the maximum interval between retrieval steps (c = i max ) with having the chunk size vary along the generation (c k = i k ).For this comparison, we compute the retrieval steps using the geometric progression with i min = 2 and i max = 16, and use a sentence-level cache.The results, in Table 5, indicate that keeping the chunk size fixed leads to slightly better translation quality.

B Using Different Values for k.
In order to understand how the number of retrieved chunks (k) affects the translation quality and the decoding speed, we compare using different values of k.For this comparison, we compute the retrieval steps using the geometric progression with i min = 2 and i max = 16, and use a sentence-level cache.We report the BLEU score and decoding speed for different values of k in Tables 6 and 7, respec-tively.These results show that there is a trade-off between the translation quality and the decoding speed, when varying the number of retrieved neighbors (k).

C Experiments on ES-FR and ET-IT.
To understand if the proposed model, chunk-based kNN-MT, performs well on language pairs and datasets other than the ones used in the main experiments ( §4.1.1),we perform experiments on Spanish-French (es-fr) and Estonian-Italian (et-it) on two datasets: EMEA and JRC-acquis (Tiedemann, 2012).For this experiment, we used the multilingual model mBART50 (Tang et al., 2020), compute the retrieval steps using the geometric progression with i min = 2 and i max = 16, and use a sentence-level cache.We report the BLEU scores and the decoding speeds (in tokens per second) on Tables 8 and 9, respectively.As can be seen, the chunk-based kNN-MT model is able to improve the translation quality considerably, when comparing with the base MT model, while leading to a decoding speed around 3 times faster than the vanilla kNN-MT model.
On Table 11 we report the values of the hyperparameters used to fine-tune the base model on each domain: learning rate, learning rate scheduler, and whether warmup steps were used.

E On-the-fly Adaptation on Law Domain
We report the results of the on-the-fly adaptation experiment, on the law domain, on Figure 5.In a similar way as in the medical domain, the top left plot shows that the kNN-MT and the chunk-based kNN-MT models lead to higher BLEU scores than the fine-tuned models.We can also see, on the top right plot, that the time the models take to add examples to the datastore along the generation is much shorter than the time needed to fine-tune the model.This comes at the cost of a higher inference time (as shown on the bottom left plot).However, we can see that the chunk-based kNN-MT model is able to substantially reduce the inference time gap between fully-parametric and semi-parametric models, having a shorter total time than the kNN-MT and the models fine-tuned every 32,000 and 64,000 steps (bottom right plot).Concerning the fine-tuned models, fine-tuning more often leads to a slightly better BLEU score.However, this also leads to a substantially higher training time.

Figure 1 :
Figure1: Chunk-based kNN-MT scheme.Top left: model procedure when retrieving neighbors from the datastore.Top right: procedure when not performing retrieval.Bottom: retrieval schedule scheme.
models that have access to the domain-specific training data: the base model above fine-tuned on the domain-specific datasets, the vanilla kNN-MT model(Khandelwal et al., 2021), and the Efficient kNN-MT model fromMartins et al. (2022).

Figure 3 :
Figure3: On-the-fly adaptation scheme.After the model translates an example, a human translator (e.g., a posteditor) corrects the generated translation which is added to the fine-tuning data or the domain-specific datastore.We use the references to simulate human translations.

Figure 4 :
Figure 4: Analysis of the on-the-fly adaptation experiments on the medical domain.Top left: BLEU scores measured every 4,000 examples.Top right: Time (in minutes) spent by each model on training / creating and updating the datastore and the corresponding BLEU score.Bottom left: Inference time (in minutes) spent by each model to translate the whole set of examples and its BLEU score.Bottom right: Total time (training + inference) spent by each model to translate the set of examples and its BLEU score.For the time plots, the lower right corner is better.
, and on the IT domain in Figure 9.To simplify the examples, we use a batch size of 1 and a beam size of 1

Table 1 :
BLEU and COMET scores on the multi-domains test set, for a batch size of 8. Plots of the decoding speed (tokens per second) for the different models on the medical, law, IT, and Koran domains, for different batch sizes(1,8,16).The generation speed (y-axis) is in log scale.

Table 1 .
We can see that the decrease of translation quality when comparing the chunk-based kNN-MT model with the vanilla kNN-MT model is not substantial in terms of BLEU (-1.5 points on average) and COMET (-.04 points on average).It can also be seen that the chunk-based kNN-MT model leads to considerably better translation scores than the base MT model (+9.1 BLEU and +.06 COMET points on average) and to slightly better results than the efficient kNN-MT model in terms of BLEU.When comparing fine-tuning the base model with the use

Table 2 :
BLEU scores on the multi-domains test set, for a batch size of 8, with c = 6 and i = 6.

Table 3 :
BLEU scores on the multi-domains test set, for a batch size of 8.When using the geometric progression heuristic (GP) the average interval with i max = 16 is 5.97 and with i max = 32 is 6.85.

Table 4 :
Decoding speed (tokens per second) on the multi-domains test set, for a batch size of 8, with c = i max .

Table 5 :
BLEU scores on the multi-domains test set, for a batch size of 8.

Table 6 :
BLEU scores on the multi-domains test set, for a batch size of 8.

Table 7 :
Decoding speed (tokens per second) on the multi-domains test set, for a batch size of 8.

Table 8 :
BLEU scores on the EMEA and JRC test sets, for a batch size of 8.

Table 9 :
Decoding speed (tokens per second) on the EMEA and JRC test sets, for a batch size of 8.

Table 10 :
Values of the hyperparameters: number of neighbors to be retrieved k, interpolation coefficient λ, and retrieval softmax temperature T .