Towards Example-Based NMT with Multi-Levenshtein Transformers

Retrieval-Augmented Machine Translation (RAMT) is attracting growing attention. This is because RAMT not only improves translation metrics, but is also assumed to implement some form of domain adaptation. In this contribution, we study another salient trait of RAMT, its ability to make translation decisions more transparent by allowing users to go back to examples that contributed to these decisions. For this, we propose a novel architecture aiming to increase this transparency. This model adapts a retrieval-augmented version of the Levenshtein Transformer and makes it amenable to simultaneously edit multiple fuzzy matches found in memory. We discuss how to perform training and inference in this model, based on multi-way alignment algorithms and imitation learning. Our experiments show that editing several examples positively impacts translation scores, notably increasing the number of target spans that are copied from existing instances.


Introduction
Neural Machine Translation (NMT) has become increasingly efficient and effective thanks to the development of ever larger encoder-decoder architectures relying on Transformer models (Vaswani et al., 2017).Furthermore, these architectures can readily integrate instances retrieved from a Translation Memory (TM) (Bulte and Tezcan, 2019;Xu et al., 2020;Hoang et al., 2022), thereby improving the overall consistency of new translations compared to past ones.In such context, the autoregressive and generative nature of the decoder can make the process (a) computationally inefficient when the new translation has very close matches in the TM; (b) practically ineffective, as there is no guarantee that the output translation, regenerated from scratch, will resemble that of similar texts.
An alternative that is attracting growing attention is to rely on computational models tailored to edit existing examples and adapt them to new source precision % units unigram copy 87.5 64.9 gen 52.6 35.1 bigram copy-copy 81.4 55.0 copy-gen 40.1 8.9 gen-copy 39.5 10.7 gen-gen 34.2 25.4 Table 1: Modified precision of copy vs. generated unigrams and bigrams for TM-LevT.For bigrams, we consider four cases: bigrams made of two copy tokens, two generated tokens, and one token of each type.
sentences, such as the Levenshtein Transformer (LevT) model of Gu et al. (2019).This model can effectively handle fuzzy matches retrieved from memories, performing minimal edits wherever necessary.As decoding in this model occurs is nonautoregressive, it is likely to be computationally more efficient.More important for this work, the reuse of large portions of existing translation examples is expected to yield translations that (a) are more correct; (b) can be transparently traced back to the original instance(s), enabling the user to inspect the edit operations that were performed.To evaluate claim (a) we translate our test data (details in Section 5.1) using a basic implementation of a retrieval-augmented LevT with TM (TM-LevT).We separately compute the modified unigram and bigram precisions (Papineni et al., 2002) for tokens that are copied from the fuzzy match and tokens that are generated by the model. 1 We observe that copies account for the largest share of output units and have better precision (see Table 1).
Based on this observation, our primary goal is to optimize further the number of tokens copied from the TM.To do so, we propose simultaneously editing multiple fuzzy matches retrieved from memory, using a computational architecture -Multi-LevT, or TM N -LevT for short -which extends TM-LevT to handle several initial translations.The benefit is twofold: (a) an increase in translation accuracy; (b) more transparency in the translation process.Extending TM-LevT to TM N -LevT however requires solving multiple algorithmic and computational challenges related to the need to compute Multiple String Alignments (MSAs) between the matches and the reference translation, which is a notoriously difficult problem; and designing appropriate training procedures for this alignment module.
Our main contributions are the following: 1. a new variant of the LevT model that explicitly maximizes target coverage ( §4.2); 2. a new training regime to handle an extended set of editing operations ( §3.3); 3. two novel multiway alignment ( §4.2) and realignment ( §6.2) algorithms; 4. experiments in 11 domains where we observe an increase of BLEU scores, COMET scores, and the proportion of copied tokens ( §6).
Our code and experimental configurations are available on github. 2 2 Preliminaries / Background

TM-based machine translation
Translation Memories, storing examples of past translations, is a primary component of professional Computer Assisted Translation (CAT) environments (Bowker, 2002).Given a translation request for source sentence x, TM-based translation is a two-step process: (a) retrieval of one or several instances (x, ỹ) whose source side resembles x, (b) adaptation of retrieved example(s) to produce a translation.In this work, we mainly focus on step (b), and assume that the retrieval part is based on a fixed similarity measure ∆ between x and stored examples.In our experiments, we use: with ED(x, x) the edit distance between x and x and |x| the length of x.We only consider TM matches for which ∆ exceeds a predefined threshold τ and filter out the remaining ones.The next step, adaptation, is performed by humans with CAT 2 https://github.com/Maxwell1447/fairseq/tools.Here, we instead explore ways to perform this step automatically, as in Example-Based MT (Nagao, 1984;Somers, 1999;Carl et al., 2004).

Adapting fuzzy matches with LevT
The Levenshtein transformer of Gu et al. (2019) is an encoder-decoder model which, given a source sentence, predicts edits that are applied to an initial translation in order to generate a revised output (Figure 1).The initial translation can either be empty or correspond to a match from a TM.Two editing operations -insertion and deletionare considered.The former is composed of two steps: first, placeholder insertion, which predicts the position and number of new tokens; second, the predictions of tokens to fill these positions.Editing operations are applied iteratively in rounds of refinement steps until a final translation is obtained.
In LevT, these predictions rely on a joint encoding of the source and the current target and apply in parallel for all positions, which makes LevT a representative of non-autoregressive translation (NAT) models.As editing operations are not observed in the training data, LevT resorts to Imitation Learning, based on the generation of decoding configurations for which the optimal prediction is easy to compute.Details are in (Gu et al., 2019), see also (Xu and Carpuat, 2021), which extends it with a repositioning operation and uses it to decode with terminology constraints, as well as the studies of Niwa et al. (2022) and Xu et al. (2023) who also explore the use of LevT in conjunction with TMs.The cat sleeps on a bed .
The cat sits on the mat .
Le chat est sur le tapis .

Processing multiple fuzzy matches
One of the core differences between TM N -LevT and LevT is its ability to handle multiple matches.This  implies adapting the edit steps (in inference) and the roll-in policy (in imitation learning).
Inference in TM N -LevT Decoding follows the same key ideas as for LevT (see Figure 1) but enables co-editing an arbitrary number N of sentences.Our implementation (1) applies deletion, then placeholder insertion simultaneously on each retrieved example; (2) combines position-wise all examples into one single candidate sentence; (3) performs additional steps as in LevT: this includes first completing token prediction, then performing Iterative Refinement operations that edit the sentence to correct mistakes and improve it ( §3.2).
Training in TM N -LevT TM N -LevT is trained with imitation learning and needs to learn the edit steps described above for both the first pass (1-2) and the iterative refinement steps (3).This means that we teach the model to perform the sequence of correct edit operations needed to iteratively generate the reference output, based on the step-by-step reproduction of what an expert 'teacher' would do.For this, we need to compute the optimal operation associated with each configuration (or state) ( §4).The roll-in and roll-out policies specify how the model is trained ( §3.3).
3 Multi-Levenshtein Transformer 3.1 Global architecture TM N -LevT has two modes of operations: (a) the combination of multiple TM matches into one single sequence through alignment, (b) the iterative refinement of the resulting sequence.In step (a), we use the Transformer encoder-decoder architec-ture, extended with additional embedding and linear layers (see Figure 2) to accommodate multiple matches.In each of the N retrieved instances y = (y 1 , • • • , y N ), y n,i (the i th token in the i th instance) is encoded as E y n,i + P i + S n , where E ∈ R |V|×d model , P ∈ R Lmax×d model and S ∈ R (N +1)×d model are respectively the token, position and sequence embeddings.The sequence embedding identifies TM matches, and the positional encodings are reset for each y n .The extra row in S is used to identify the results of the combination and will yield a different representation for these single sequences. 3Once embedded, TM matches are concatenated and passed through multiple Transformer blocks, until reaching the last layer, which outputs ) in the case of multiple ones.The learned policy π θ computes its decisions from these hidden states.We use four classifiers, one for each sub-policy: 1. deletion: predicts keep or delete for each token y del n,i with a projection matrix A ∈ R 2×d model : 2. insertion: predicts the number of placeholder insertions between y plh n,i and y plh n,i+1 with a projection matrix B ∈ R (Kmax+1)×2d model : with K max the max number of insertions.
3. combination: predicts if token y cmb n,i in sequence n must be kept in the combination, with a projection matrix C ∈ R 2×d model : 4. prediction: predicts a token in vocabulary V at each placeholder position, with a projection matrix D ∈ R |V|×d model : Except for step 3, these classifiers are similar to those used in the original LevT.
The cat sleeps on a bed The dog is on the green mat !
The cat sits on the mat .
The cat sits on the mat .

Decoding
Decoding is an iterative process: in a first pass, the N fuzzy matches are combined to compute a candidate translation; then, as in LevT, an additional series of iterative refinement rounds (Gu et al., 2019) is applied until convergence or timeout.Figure 3 illustrates the first pass, where N = 2 matches are first edited in parallel, then combined into one output.
To predict deletions (resp.insertions and token predictions), we apply the argmax operator to π del θ (resp.π plh θ , π tok θ ).For combinations, we need to aggregate separate decisions π cmb θ (one per token and match) into one sequence.For this, at each position, we pick the most likely token.
During iterative refinement, we bias the model towards generating longer sentences since LevT outputs tend to be too short (Gu et al., 2019).As in LevT, we add a penalty to the probability of inserting 0 placeholder in π plh θ (Stern et al., 2019).This only applies in the refinement steps to avoid creating more misalignments (see §6.2).

Imitation learning
We train TM N -LevT with Imitation Learning (Daumé et al., 2009;Ross et al., 2011), teaching the system to perform the right edit operation for each decoding state.As these operations are unobserved in the training data, the standard approach is to simulate decoding states via a roll-in policy; for each of these, the optimal decision is computed via an expert policy π * , composed of intermediate experts π del First, from the initial set of sentences y init , the unrolling of π * produces intermediate states (y del , del * ), (y plh , plh * ), (y cmb , cmb * ), (y tok , tok * ) (see top left in Figure 4).Moreover, in this framework, it is critical to mitigate the exposure bias and generate states that result from non-optimal past decisions (Zheng et al., 2023).For each training sample (x, y 1 , • • • , y N , y * ), we simulate multiple additional states as follows (see Figure 4 for the full picture).We begin with the operations involved in the first decoding pass:4 1.Additional triplets ♯ : π rnd•del•N turns y * into N random substrings, which simulates the edition of N artificial examples.
2. Token selection ♯ (uses π sel ): our expert policy never aligns two distinct tokens at a given position ( §4.3).We simulate such cases that may occur at inference, as follows: with probability γ, each <PLH> is replaced with a random token from fuzzy matches (Figure 5).
The expert always completes its translation in one decoding pass.Policies used in iterative refinement are thus trained with the following simulated states, based on roll-in and roll-out policies used in LevT and its variants (Gu et al., 2019;Xu et al., 2023;Zheng et al., 2023): 4. Correct mistakes (uses π tok θ ): using the output of token prediction y post•del , teach the model to erase the wrongly predicted tokens.

5.
Remove extra tokens ♯ (uses π ins θ , π tok θ ): insert placeholders in y post•tok and predict tokens, yielding y post•del•extra , which trains the model to delete wrong tokens.These sequences differ from case (4) in the way <PLH> are inserted.
As token prediction applies for both decoding steps, these states also improve the first pass.
The expert decisions (e.g.inserting deleted tokens like in state (3) ; or deleting wrongly predicted The cat sits on a bed .A dog is on the floor The cat bed _ on _ The .floor _ _ is on the mat .
Figure 5: Noising y cmb with π sel using tokens from tokens in state ( 4)) associated with most states are obvious, except for the initial state and state (5), which require an optimal alignment computation.

Optimal Alignment
Training the combination operation introduced above requires specifying the expert decision for each state.While LevT derives its expert policy π * from the computation of edit distances, we introduce another formulation based on the computation of maximal covers.For N =1, these formulations can be made equivalent 5 (Gusfield, 1997).

N-way alignments
We formulate the problem of optimal editing as an N-way alignment problem (see figure 6) that we define as follows.
and the target sentence y * , a N-way alignment of 5 When the cost of replace is higher than insertion + deletion.This is the case in the original LevT code. in V n to node j in V * .An N-way alignment satisfies properties (i)-(ii): (i) Edges connect identical (matching) tokens: (n, i, j) ∈ E ⇒ y n,i = y * ,j .
(ii) Edges that are incident to the same subset V n do not cross: An optimal N-way alignment E * maximizes the coverage of tokens in y * , then the total number of edges, where y * ,j is covered if there exists at least one edge (n, i, j) ∈ E. Denoting E the set of alignments maximizing target coverage:

Solving optimal alignment
Computing the optimal N-way alignment is NPhard (see Appendix D).This problem can be solved using Dynamic Programming (DP) techniques similar to Multiple Sequence Alignment (MSA) (Carrillo and Lipman, 1988) (Gusfield, 1997); the cat sits on the mat the cat sits on it it's on the mat where the cat sits on (a) Combination of the optimal 1-way alignments.
the cat sits on the mat the cat sits on it it's on the mat where the cat sits on (b) Optimal 2-way alignment.
Figure 6: Illustration of the optimal N-way alignment which maximizes a global coverage criterion (6b), while independent alignments do not guarantee optimal usage of information present in TM examples (6a).
2. search for the optimal recombination of these graphs, selecting 1-way alignments Assuming N and k are small, we perform an exhaustive search in O(k N ).

From alignments to edits
From an alignment (V, V * , E), we derive the optimal edits needed to compute y * , and the associated intermediary sequences.For each training sample (x, y), we retrieve up to 3 in-domain matches.We filter matches x n to keep only those with ∆(x, x n ) > 0.4.We then manually split each of the 11 datasets into train, valid, test-0.4,test-0.6,where the valid and test sets contain 1,000 lines each.test-0.4(resp.test-0.6)contains samples whose best match is in the range [0.4,0.6[ (resp.[0.6, 1[).As these two test sets are only defined based on the best match score, it may happen that some test instances will only retrieve 1 or 2 close matches (statistics are in Table 6).

The benefits of multiple matches
We compare two models in We report the performance of systems trained using N =1, 2, 3 for each domain and test set in Table 4 (BLEU) and 12 (COMET).We see comparable average BLEU scores for N =1 and N =3, with large variations across domains, from which we conclude that: (a) using 3 examples has a smaller return when the best match is poor, meaning that bad matches are less likely to help (test-0.4vs. test-0.6);(b) using 3 examples seems advantageous for narrow domains, where training actually exploits several close matches (see also Appendix F).We finally note that COMET scores9 for TM 3 -LevT are always slightly lower than for TM 1 -LevT, which prompted us to develop several extensions.
The _ _ on the _ , it seems .
The cat _ on the mat , _ _ .
The _ _ on the _ , it seems .
The cat _ on the the , it seems . .
The cat _ on the mat , it seems .

Realignment
In preliminary experiments, we observed that small placeholder prediction errors in the first decoding pass could turn into catastrophic misalignments (Figure 7).To mitigate such cases, we introduce an additional realignment step during inference, where some predicted placeholders are added/removed if this improves the global alignment.Realignment is formulated as an optimization problem aimed to perform a tradeoff between the score − log π plh θ of placeholder insertion and an alignment cost (see Appendix C).
We assess realignment for N =3 (Tables 4 and  12) and observe small, yet consistent average gains (+0.2 BLEU, +1.5 COMET) for both test sets.

Pre-training
Another improvement uses pretraining with synthetic data.For each source/target pair (x, y) in the pre-training corpus, we simulate N fuzzy matches by extracting from y N substrings y n of length ≈ |y| • r, with r ∈ [0, 1].Each y n is then augmented as follows: 1. We randomly insert placeholders to increase the length by a random factor between 1 and 1 + f , f = 0.5 in our experiments.
2. We use the CamemBERT language model (Martin et al., 2020) to fill the masked tokens.
These artificial instances simulate diverse fuzzy matches and are used to pre-train a model, using the same architecture and setup as in §5.The Autoregressive (AR) system is our implementation of (Bulte and Tezcan, 2019).
Knowledge distillation Knowledge Distillation (KD) (Kim and Rush, 2016) is used to mitigate the effect of multimodality of NAT models (Zhou et al., 2020) and to ease the learning process.We trained a TM N -LevT model with distilled samples (x, ỹ1 , • • • , ỹN , ỹ), where automatic translations ỹi and ỹ are derived from their respective source x i and x with an auto-regressive teacher trained with a concatenation of all the training data.
We observe that KD is beneficial (+0.3 BLEU) for low-scoring matches (test-0.4)but hurts performance (-1.7 BLEU) for the better ones in test-0.6.This may be because the teacher model, with a BLEU score of 56.7 on the test-0.6,fails to provide the excellent starting translations the model can access when using non-distilled data.

Ablation study
We evaluate the impact of the various elements in the mixture roll-in policy via an ablation study (Table 13).Except for π sel , every new element in the roll-in policy increases performance.As for π sel , our system seems to be slightly better with than without.An explanation is that, in case of misalignment, the model is biased towards selecting the first, most similar example sentence.As an ablation, instead of aligning by globally maximizing coverage ( § 4.2), we also compute alignments that maximize coverage independently as in figure 6a.
A complete run of TM N -LevT is in Appendix F.

Related Work
As for other Machine Learning applications, such as text generation (Guu et al., 2018), efforts to integrate a retrieval component in neural-based MT have intensified in recent years.One motivation is to increase the transparency of ML models by providing users with tangible traces of their internal computations in the form of retrieved examples (Rudin, 2019).For MT, this is achieved by integrating fuzzy matches retrieved from memory as an additional conditioning context.This can be performed simply by concatenating the retrieved target instance to the source text (Bulte and Tezcan, 2019), an approach that straightforwardly accommodates several TM matches (Xu et al., 2020), or the simultaneous exploitation of their source and target sides (Pham et al., 2020).More complex schemes to combine retrieved examples with the source sentence are in (Gu et al., 2018;Xia et al., 2019;He et al., 2021b).The recent work of Cheng et al. (2022) handles multiple complementary TM examples retrieved in a contrastive manner that aims to enhance source coverage.Cai et al. (2021) also handle multiple matches and introduce two novelties: (a) retrieval is performed in the target language and (b) similarity scores are trainable, which allows to evaluate retrieved instances based on their usefulness in translation.Most of these attempts rely on an auto-regressive (AR) decoder, meaning that the impact of TM match(es) on the final output is only indirect.The use of TM memory match with a NAT decoder is studied in (Niwa et al., 2022;Xu et al., 2023;Zheng et al., 2023), which adapt LevT for this specific setting, using one single retrieved instance to initialize the edit-based decoder.Other evolutions of LevT, notably in the context of constraint decoding, are in (Susanto et al., 2020;Xu and Carpuat, 2021), while a more general account of NAT systems is in (Xiao et al., 2023).Zhang et al. (2018) explore a different set of techniques to improve translation using retrieved segments instead of full sentences.Extending KNNbased language models (He et al., 2021a) to the conditional case, Khandelwal et al. (2021) proposes knearest neighbor MT by searching for target tokens that have similar contextualized representations at each decoding step, an approach further elaborated by Zheng et al. (2021); Meng et al. (2022) and extended to chunks by Martins et al. (2022).

Conclusion and Outlook
In this work, we have extended the Levenshtein Transformer with a new combination operation, making it able to simultaneously edit multiple fuzzy matches and merge them into an initial translation that is then refined.Owing to multiple algorithmic contributions and improved training schemes, we have been able to (a) increase the number of output tokens that are copied from retrieved examples; (b) obtain performance improvements compared to using one single match.We have also argued that retrieval-based NMT was a simple way to make the process more transparent for end users.
Next, we would like to work on the retrieval side of the model: first, to increase the diversity of fuzzy matches e.g.thanks to contrastive retrieval, but also to study ways to train the retrieval mechanism and extend this approach to search monolingual (target side) corpora.Another line of work will combine our techniques with other approaches to TM-based NMT, such as keeping track of the initial translation(s) on the encoder side.

Limitations
As this work was primarily designed a feasibility study, we have left aside several issues related to performance, which may explain the remaining gap with published results on similar datasets.First, we have restricted the encoder to only encode the source sentence, even though enriching the input side with the initial target(s) has often been found to increase performance (Bulte and Tezcan, 2019), also for NAT systems (Xu et al., 2023).It is also likely that increasing the number of training epochs would yield higher absolute scores (see Appendix F).
These choices were made for the sake of efficiency, as our training already had to fare with the extra computing costs incurred by the alignment procedure required to learn the expert policy.Note that in comparison, the extra cost of the realignment procedure is much smaller, as it is only paid during inference and can be parallelized on GPUs.
We would also like to outline that our systems do not match the performance of an equivalent AR decoder, a gap that remains for many NAT systems (Xiao et al., 2023).Finally, we have only reported here results for one language pair -favoring here domain diversity over language diversityand would need to confirm the observed improvements on other language pairs and conditions.
Chunting Zhou, Jiatao Gu, and Graham Neubig. 2020.Understanding knowledge distillation in nonautoregressive machine translation.In Proceedings of the International Conference on Learning Representations.

A Model Configuration
We use a Transformer architecture with embeddings of dimension 512; feed-forward layers of size 2048; number of heads 8; number of encoder and decoder layers: 6; batch size: 3000 tokens; sharedembeddings; dropout: 0.3; number of GPUs: 6.The maximal number of additional placeholders is During training, we use Adam optimizer with (β 1 , β 2 )=(0.9,0.98); inverse sqrt scheduler; learning rate: 5e −4 ; label smoothing: 0.1; warmup updates: 10,000; float precision: 16.We fixed the number of iterations at 60k.For decoding, we use iterative refinement with an empty placeholder penalty of 3, and a max number of iterations of 10 ( Gu et al., 2019).
The hyper-parameters of the realigner ( §C) were tuned on a subset of 1k samples extracted from the ECB training set.

B Data Analysis
Table 6 contains statistics about all 11 domains.They notably highlight the relationship between the average number of retrieved sentences during training and the ability of TM 3 -LevT to perform better than TM 1 -LevT in Table 4.The domains with retrieval rates lesser than 1 (Epp, News, TED, Ubu) have quite a broad content, meaning that training instances have fewer close matches, which also means that for these domains, TM 3 -LevT hardly sees two or three examples that it needs to use in inference.

C Realignment
The realignment process is an extra inference step aiming to improve the result of the placeholder insertion stage.To motivate our approach, let us con-sider the following sentences before placeholder insertion: , where letters represent tokens, × denotes padding, < and > respectively stand for <BOS> and <EOS>.
The output of this stage is a prediction for all pairs of consecutive tokens.This prediction takes the form of a tensor log π plh θ of dimensions N × (L−1)×(K max +1), corresponding respectively to the number of retrieved sentences N , the maximum sentence length L, and the maximum number of additional placeholders K max .
Let P (a N ×(L−1) tensor) denote the arg max, e.g.P = 0 0 0 2 0 0 0 1 0 0 0 1 0 0 0 Inserting the prescribed number of placeholders (figured by _) then yields the following y cmb : This result is far from perfect, as it fails to align the repeated occurrences of C. For instance, a preferable alignment requiring 3 changes (1 change consists in a modification of ±1 in P ) could be: The general goal of realignment is to improve such alignments by performing a small number of changes in P .We formalize this problem as a search for a good tradeoff between (a) the individual placeholder prediction scores, aggregated in L L (likelihood loss) and (b) L A an alignment loss.Under its simplest form, this problem is again an optimal multisequence alignment problem, for which exact dynamic programming solutions are computationally intractable in our setting.
We instead develop a continuous relaxation that can be solved effectively with SGD and is also easy to parallelize on GPUs.We, therefore, relax the integer condition for P and assume that P i,j can take continuous real values in [0, K max ], then solve the continuous optimization problem before turning P i,j values back into integers.
The likelihood loss aims to keep the P i,j values close to the model predictions.Denoting (µ, σ) respectively the mean and variance of the model predictions, our initial version of this loss is In practice, we found that using a weighted average μ and clamping the variance σ2 both yield better realignments, yielding: To define the alignment loss, we introduce a position matrix X of dimension N × L in R + , where X n,i corresponds to the (continuous)position of token y n,i after inserting a real number of placeholders.X is defined as: with i the number of tokens occuring before X n,i and j<i P n,j the cumulated sum of placeholders.Using X, we derive the distance tensor D of dimension N × L × N × L in R + as: Finally, let G be an N × L × N × L alignment graph tensor, where G n,i,m,j = 1 if and only if y n,i = y m,j and n ̸ = m and D n,i,m,j < D max .G connects identical tokens in different sentences when their distance after placeholder insertion is at most D max .This last condition avoids perturbations from remote tokens that coincidentally appear to be identical.
Each token y n,i is associated with an individual loss: The alignment loss aggregates these values over sentences and positions as: A final ingredient in our realignment model is related to the final discretization step.To avoid rounding errors, we further constrain the optimization process to deliver near-integer solutions.For this, we also include a integer constraint loss defined as : where µ t controls the scale of L int (P ).As x → sin 2 (πx) reaches its minimum 0 for integer values, minimizing L int (P ) has the effect of enforcing a near-integer constraint to our solutions.Overall, we minimize in P : L = L L (P ) + L A (P ) + L int (P ), slowly increasing the scale of µ t according to the following schedule with t 0 , T the timestamps for respectively the activation of the integer constraint loss, and the activation of the clamping.This optimization is performed with gradient descent directly on GPUs, with a small additional cost to the inference procedure.

D NP-hardness of Coverage Maximization in N-way Alignment
Given the set of possible N-way alignments, the problem of finding the one that maximizes the target coverage is NP-hard.To prove it, we can reduce the NP-hard set cover problem (Garey and Johnson, 1979) to the N-way alignment coverage maximization problem.
• Cover set decision problem (A): • N-way alignment coverage maximization decision problem (B): A solution of (B) can be certified in polynomial time: we simply compute the cardinal of a union.Any instance of (A) can be transformed in polynomial time and space into a special instance of (B) where all C k = C 0 and p = |X|.

E Results for fr-en
Table ?? reports the BLEU scores for the reverse direction (fr→en), using exactly the same configuration as in Table 2.Note that since we used the same data split (retrieving examples based on the similarity in English), and since the retrieval procedure is asymmetrical, 4,749 test samples happen to have no match.That would correspond to an extra column labeled "0", which is not represented here.The reverse direction follows a similar pattern, providing further evidence of the method's effectiveness.

F Complementary Analyses
Diversity and difficulty Results in Table 4 show that some datasets do not seem to benefit from multiple examples.This is notably the case for Europarl, News-Commentary, TED2013, and Ubuntu.We claim that this was due to the lack of retrieved examples at training (as stated in §B), of diversity, and the noise in fuzzy matches.To further investigate this issue, we report two scores in Table 8.The first is the increase of bag-of-word coverage of the target gained by using N =3 instead of N =1; the second is the increase of noise in the examples, computed as the proportion of tokens in the examples that do not occur in the target.We observe that, in fact, low diversity is often associated with poor scores for TM 3 -LevT, and higher diversity with better performance.Long run All results in the main text were obtained with models trained for 60k iterations, which was enough to compare the various models while saving computation resources.For completeness, we also performed one longer training for 300k iterations for TM 3 -LevT (see The Benefits of realignment Table 10 shows that realignment also decreases the average number of refinement steps to converge.These results suggest that the edition is made easier with realignment. In  COMET scores We compute COMET scores (Rei et al., 2020) separately for each domain with default wmt20-comet-da similarly to Table 4 (see Table 12).We observe that the basic version of TM 3 -LevTunderperforms TM 1 -LevT; we also see a great variation in the scores.A possible explanation can be a fluency decline when using multiple examples, which is not represented by the precision scores computed by BLEU.The improved version, using realignment and pre-training, confirms that adding more matches is overall beneficial for MT quality.
Per-domain ablation study Table 13 details the results of our ablation study separately for each domain.
Illustration A full inference run is in Table 14, illustrating the benefits of considering multiple examples and realignment.Even though the realignment does not change here the final output, it reduces the number of atomic edits needed to generate it, making the inference more robust.-delx: no extra deletion loss;-rd-del: no random deletion (β=0); -mask: no random mask (δ=0); -dum-plh: null probability to start with y post•del =y * (α = 0); -indep-align: the alignments are performed independently.

Figure 1 :
Figure 1: First decoding pass of TM-LevT, a variant of LevT augmented with Translation Memories.

Figure 2 :
Figure 2: A high-level overview of TM N -LevT's architecture.Additions w.r.t.TM-LevT are in a dashed box.

Figure 3 :
Figure 3: The first decoding pass in TM N -LevT.
with a complexity O(N |y * | n |y n |).We instead implemented the following two-step heuristic approach: 1. separately compute alignment graphs between each y n and y * , then extract k-best 1-way alignments {E n,1 . . .E n,k }.This requires time O(k|y n ||y * |) using DP N into y * based on the optimal alignment (V, V * , E * ).
Edges in E indicate the tokens that are preserved throughout this process:1.deletion:∀n,∀i,yn,iis kept only if (n, i, j) ∈ E for some j; otherwise it is deleted.The resulting sequences are {y plh n } n=1...N .i=<PLH>.4.prediction:The remaining <PLH> symbols in y tok are replaced by the corresponding target token in y * at the same position.The expert policy π * edits examples y 1 , • • • , y5.1 Data and metricsWe focus on translation from English to French and consider multiple domains.This allows us to consider a wide range of scenarios, with a varying density of matching examples: our datasets include

Table 2 :
one trained with one TM match, the other with three.Each

Table 2 :
BLEU/ChrF and COMET scores on the full test set.All BLEU/ChrF differences are significant (p = 0.05).model is evaluated with, at most, the same number of matches seen in training.This means that TM 1 -LevT only uses the 1-best match, even when more examples are found.In this table, test sets test-0.4 and test-0.6 are concatenated, then partitioned between samples for which exactly 1, 2, and 3 matches are retrieved.We observe that TM 3 -LevT, trained with 3 examples, consistently achieves better BLEU and ChrF scores than TM 1 -LevT, even in the case N =1, where we only edit the closest match. 8e better BLEU scores are associated with a larger number of copies from the retrieved instances, which was our main goal (Table3).Similar results for the other direction are reported in the appendix § E (Table7).

Table 3 :
Proportion of unigrams and bigram from a given origin (copy vs. generation) for various models.
9 BLEU, +4.6 COMET for test-0.6).Training curves also suggest that pre-trained models are faster to converge.Combining with realignment yields additional gains for TM 3 -LevT, which outperforms TM 1 -LevT in all domains and both metrics.

Table 6 :
Number of samples, average number of retrieved sentences and average length of sentences after tokenization for all 11 domains.

Table 7 :
BLEU scores on the full test set.TM 3-LevT is improved with pre-training and realignment.All BLEU differences are significant (p = 0.05).p-values from SacreBLEU paired bootstrap resampling (n = 1000).
Table 11, we present detailed results of the unigram modified precision of LevT, TM 3 -LevT and

Table 10 :
Average number of extra refinement rounds.TM 3 -LevT+realign.Using more examples indeed increases copy (+4.4), even though it diminishes copy precision (-1.7).Again we observe the positive effect of realignment, which amplifies the tendency of our model to copy input tokens.

Table 12 :
ECB EME Epp GNO JRC KDE News PHP TED Ubu Wiki Per domain COMET scores (x 100) for TM n -LevT and variants.Bold for scores better than TM 1 -LevT.