Bilingual Synchronization: Restoring Translational Relationships with Editing Operations

Machine Translation (MT) is usually viewed as a one-shot process that generates the target language equivalent of some source text from scratch. We consider here a more general setting which assumes an initial target sequence, that must be transformed into a valid translation of the source, thereby restoring parallelism between source and target. For this bilingual synchronization task, we consider several architectures (both autoregressive and non-autoregressive) and training regimes, and experiment with multiple practical settings such as simulated interactive MT, translating with Translation Memory (TM) and TM cleaning. Our results suggest that one single generic edit-based system, once fine-tuned, can compare with, or even outperform, dedicated systems specifically trained for these tasks.


Introduction
Neural Machine Translation (NMT) systems have made tangible progress in recent years (Bahdanau et al., 2015;Vaswani et al., 2017), as they started to produce usable translations in production environments.NMT is generally viewed as a one-shot activity process in autoregressive approaches, which generates the target translation based on the sole source side input.Recently, Non-autoregressive Machine Translation (NAT) models have proposed to perform iterative refinement decoding (Lee et al., 2018;Ghazvininejad et al., 2019;Gu et al., 2019), where translations are generated through an iterative revision process, starting with a possibly empty initial hypothesis.
This paper focuses on the revision part of the machine translation (MT) process and consider bilingual synchronization (Bi-sync), which we define as follows: given a pair of a source (f ) and a target (ẽ) sentences, which may or may not be mutual translations, the task is to compute a revised version e of ẽ, such that e is an actual translation of f .This is necessary when the source side of an existing translation is edited, requiring to update the target and keep both sides synchronized.Bi-sync subsumes standard MT, where the synchronization starts with an empty target (ẽ = []).Other interesting cases occur when parts of the initial target can be reused, so that the synchronization only requires a few changes.
Bi-sync encompasses several tasks: synchronization is needed in interactive MT (IMT, Knowles and Koehn, 2016) and bilingual editing (Bronner et al., 2012), with ẽ the translation of a previous version of f ; in MT with lexical constraints (Hokamp and Liu, 2017), where ẽ contains target-side constraints (Susanto et al., 2020;Xu and Carpuat, 2021); in Translation Memory (TM) based approaches (Bulte and Tezcan, 2019), where ẽ is a TM match for a similar example; in automatic post-editing (APE) (do Carmo et al., 2021), where ẽ is an MT output.
We consider here several implementations of sequence-to-sequence models dedicated to these situations, contrasting an autoregressive model with a non-autoregressive approach.The former is similar to Bulte and Tezcan (2019), where the source sentence and the initial translation are concatenated as one input sequence; the latter uses the Levenshtein Transformer (LevT) of Gu et al. (2019).We also study various ways to generate appropriate training samples (f , ẽ, e).Our experiments consider several tasks, including TM cleaning, which attempts to fix and synchronize noisy segments in a parallel corpus.This setting is more difficult than Bi-sync, as many initial translations are already correct and need to be left unchanged.Our results suggest that one single AR system, once fine-tuned, can favorably compare with dedicated systems for each of these tasks.To recap, our main contributions are (a) the generalization of several tasks subsumed by a generic synchronization objective, allowing us to develop a unified perspective about otherwise unrelated subdomains of MT; (b) the design of a training procedure for a generic edit-based model; (c) an empirical validation on five settings and domains.

Generating Editing Data
We consider a general scenario where, given a pair of sentences f and ẽ, assumed to be related, but not necessarily parallel, we aim to generate a target sentence e that is parallel to f .We would also like ẽ and e to be close, as ẽ is often a valid translation of a sentence f that is close to f .Training such models requires triplets (f , ẽ, e).While large amounts of parallel bilingual data are available for many language pairs, they are hardly ever associated to related translations ẽ (except for APE).We therefore study ways to simulate synthetic ẽ from e, while preserving large portions of e in ẽ.String edits can be decomposed into a sequence of three basic edits (insertions, substitutions and deletions), we design our artificial samples so that edits from ẽ to e only involve one type of operation (Figure 1).

Insertions
We mainly follow Xiao et al. (2022) to generate initial translations ẽins for insertion by randomly deleting segments from e.For each e, we first randomly sample an integer k∈[1, 5], then randomly remove k non-overlapping segments from e.The length of each removed segment is also randomly sampled with a maximum of 5 tokens.We also impose that the overall ratio of removed segments does not exceed 0.5 of e. Different from Xiao et al. (2022), ẽins does not include any placeholders to locate the positions of removed segments.This makes ẽins a more realistic starting point as the insertion positions are rarely known in practical settings.Our preliminary experiments also show that identifying insertion positions makes the infilling task easier than when they are unknown.

Substitutions
To simulate substitutions, we apply round-trip translation with lexically constrained decoding (LCD, Post and Vilar, 2018) to generate initial translations for substitution ẽsub .Roundtrip translation is already used for the APE task in Junczys-Dowmunt and Grundkiewicz (2016).This requires two standard NMT models separately trained on parallel data, one for each direction.For each training example (f , e), we first (a) translate e into an intermediate source sentence f * using top-5 sampling (Edunov et al., 2018);1 (b) generate an abbreviated version ẽ ins using the method described above for insertions.We then translate f * using LCD, with ẽ ins as constraints, to obtain ẽsub .In this way, we ensure that at least half of e remains unchanged in ẽsub , while the other parts have been substituted.To increase diversity, ẽ ins (used to create ẽsub ) is sampled with a different random seed than ẽins (used for the insertion task).
Deletions Simulating deletions requires the initial translation ẽdel to be an extension of e.We propose two strategies to generate ẽdel .The first uses a GAP insertion model as in Xiao et al. (2022), in which word segments are randomly replaced with a placeholder [gap] to generate ẽgap .The task is then to predict the missing segments based on the concatenation of f and ẽgap as input.This differs from our own insertion task, as (a) insertion positions are identified as a [gap] symbol in ẽgap and (b) generation only computes the sequence of missing segments e seg , rather than a complete sentence.
We use GAP to generate extra segments for a pair of parallel sentences as follows.We randomly insert k∈ [1,5] [gap] tokens into e, concatenate it with f and use GAP to predict extra segments, yielding the synthetic target sentence ẽdel 1 .This method always extends parallel sentences with additional segments on the target side.However, these segments are arbitrary and may not contain any valid semantic information, nor be syntactically correct.
We thus consider a second strategy, based on actual edit operations collected in the WikiAtomicEdits dataset2 (Faruqui et al., 2018), which contains edits of an original segment x and the resulting segment x , with exactly one insertion or deletion operation for each example, collected from Wikipedia edit history.This notably ensures that both versions of each utterance are syntactically correct.We treat the deletion data of WikiAtomicEdits as "reversed" insertions, and use both of them to train a seq-toseq wiki model (x short →x long ), generating longer sentences from shorter ones.The wiki model is then used to expand e into an ẽdel 2 .Compared to ẽdel 1 , ẽdel 2 is syntactically more correct.However, it is also by design very close (one edit away) to e.
As both simulation methods have merits and flaws, we randomly select examples from ẽdel 1 and ẽdel 2 to build the final synthetic initial translation samples for the deletion operation ẽdel .
Copy and Translate Operations To handle parallel sentences that do not require any changes, we add a fourth copy operation, where the initial translation ẽcp is equal to the target sentence (ẽ cp =e).Hence, the data used to learn edit operations is built with triplets (f , ẽ, e) where ẽ is uniformly randomly selected from ẽins , ẽsub , ẽdel and ẽcp .Finally, to maintain the capacity to perform standard MT from scratch, we also consider samples where ẽ is empty.The implementation of standard MT varies slightly upon approaches, as we explain below.

Model Architectures
We implement Bi-sync with Transformer-based (Vaswani et al., 2017) autoregressive and nonautoregressive models.The former (Edit-MT) is a regular Transformer with a combined input made of the concatenation of f and ẽ; the latter (Edit-LevT) is the LevT of Gu et al. (2019).
Edit-MT In this model, ẽ is simply concatenated to f , with a special token to separate the two sentences.This technique has been used e.g. in Dabre et al. (2017) for multi-source MT or in Bulte and Tezcan (2019) for translating with a similar example.The input side of the editing training data is thus f [sep] ẽ, as shown in Figure 1 (top).
On the target side, we add a categorical prefix to indicate the type of edit(s) associated with a given training sample, as is commonly done for multidomain or multilingual MT.For each basic edit (insertion, substitution and deletion), we use a binary tag to indicate if the operation is required.For instance, an ẽins needing insertions would have tags [ins] [!sub] [!del] prepended to e. Copy corresponds to all three tags set to negative as The tagging scheme provides us with various ways to perform edit-based MT: (a) we can perform inference without knowing the required edit type of ẽ by generating tags then translations; (b) when the edits are known, we can generate translations with desired edits by using the corresponding tags as a forced prefix; (c) inference can also only output the edit tags and predict the relation between f and ẽ.The ability to perform standard MT is preserved by training with a balanced mixture of editing data and parallel data.
The latter corresponds to an empty ẽ.For these examples, the target side does not contain any tags.
Edit-LevT LevT is trained with a randomly noised version of the reference as initial target, and decodes with empty sentences.In Bi-sync, we instead initialize the target side with the given ẽ for training and inference.To perform standard MT with the same model, we train with a tunable mixture of these two strategies, where p controls the proportion of each type of samples.Taking p = 0 is equivalent to train a LevT model with only parallel data.We use p = 0.5 in our experiments, making it equivalent to mixing editing and parallel data for the Edit-MT model.The value of p can be carefully designed with a schedule or curriculum to optimize the behavior of Edit-LevT, which we leave for future work.For Edit-LevT, we do not use any tags, as Edit-LevT already includes an internal mechanism to predict the edit operation(s).

Datasets and Experimental Settings
We first evaluate Edit-MT and Edit-LevT on a basic resynchronization task where ẽ is assumed to be the translation of a former version of f , and only a limited amount of edits is sufficient to restore parallelism.We conduct experiments on WMT14 English-French data3 in both directions (En-Fr & Fr-En) and evaluate on two test sets.The first is an artificial derivation of the standard newstest2014 set, and the second is the small parallel sentence compression dataset4 of Ive and Yvon (2016).
For the original newstest2014, we generate an initial version ẽ for each test sentence and each edit operation according to the methods of Section 2.1.For deletion, we test the performance of both generation methods, resulting in four versions (Ins, Sub, Del 1 , Del 2 ) of newstest2014 with 3,003 sentences each.The sentence compression dataset contains a subset of documents also selected from newstest2014, where sentences are compressed by human annotators while remaining parallel in the two languages.We only retain utterances for which the compressed and original versions actually differ on both sides, resulting in 526 test sentences.
Both experiments use the same training data, where we discard examples with invalid language tag as computed by fasttext language identifica-tion model5 (Bojanowski et al., 2017), yielding a training corpus of 33.9M examples.We tokenize all data using Moses6 and build a shared sourcetarget vocabulary with 32K Byte Pair Encoding (BPE) units (Sennrich et al., 2016) learned with subword-nmt.7 Since we use both parallel and artificial editing data to train edit-based models, the total training data contains about 68M utterances.
We conduct experiments using fairseq8 (Ott et al., 2019).Edit-MT relies on the Transformerbase model of Vaswani et al. (2017).Model and training configurations are in Appendix A. Performance is computed with SacreBLEU (Post, 2018).

Results
We first separately evaluate the learnability of each edit operation on our synthetic newstest2014 sets.We also derive two tasks from the compression dataset: parallel sentence compression (comp) and extension (ext).For compression, the task consists of producing a compressed target sentence e comp given the compressed source f comp and the original target e.For extension, the model should produce e with f and the compressed target e comp .These two tasks are respectively similar to the deletion and insertion tasks.There are slight differences, though, as (a) ẽ for these settings is always syntactically correct, (b) segments that are removed or inserted are selected for their lower informativeness.Therefore, these tasks are more about restoring an adequate, rather than a fluid, translation.
As mentioned in Section 2.2, the generation of translations in Edit-MT models is conditioned on predicted or oracle editing tags that are prefixed to the output: these two situations are contrasted using forced-prefix decoding with the correct tags.For the compression and extension tasks, we use the deletion and insertion tags, respectively.
Tables 2 and 3 report results for both directions, to be compared with a "do-nothing" baseline which simply copies ẽ as the output.Edit-MT is able to edit the given ẽ for all types of required edits much better than Edit-LevT.It obtains large gains over the copy baseline for insertion,9 substitution and deletion for both directions.When tested on the compression and extension tasks, which have different edit distributions to the artificial editing data, Edit-MT still improves ẽ by 1.2-4.7 BLEU for En-Fr and 2.4-7.9BLEU for Fr-En.By prefixing Edit-MT with the oracle editing type tags, we can further boost the performance on almost every task for both directions.Edit-LevT can also improve ẽ in most test situations, even though the gains are lower than Edit-MT models.However, due to the non-autoregressive nature of Edit-LevT, it obtains a decoding speedup of 2.3−3× with respect to Edit-MT when tested with the same inference batch size on the same hardware, as recommended by Helcl et al. (2022).
To better understand Edit-MT and Edit-LevT on the resynchronization task, we further analyze their performance with respect to the edit distance ∆ between ẽ and e.For the results in Table 1, we merge test sentences of all edit types (Ins, Sub, Del 1 , Del 2 ) into one test set, then break them down by the value of ∆.For both directions, prefixing the oracle editing tags for Edit-MT yields a stable improvement for almost all ∆.Edit-LevT performs similar to Edit-MT when no editions are needed, but only starts to improve the copy baseline when more edits are required (∆ ≥ 8).

Translating with Translation Memories
As explained above, Bi-sync encompasses examplebased MT, whereby an existing similar translation retrieved from a TM is turned into an adequate translation of the source.Edit-MT actually uses the same architecture as the retrieval-based models of Bulte and Tezcan (2019); Xu et al. (2020).In this section, we study the performance of our synchronization models in this practical scenario.

Datasets
We use the same multi-domain corpus as Xu et al. (2020), which contains 11 different domains for the En-Fr direction, collected from OPUS10 (Tiedemann, 2012).We search for similar translations using Fuzzy Match. 11The similarity between two English source sentences is computed as: where ED(f , f ) is the edit distance between f and f , and |f | is the length of f .The intuition is that the closer f and f is, the more suitable ẽ will be.To  (2020) proposed an alternative where, for each similar translation, they masked out segments that were not aligned with the source.This means that the initial similar translation only contains segments that are directly related to the source input.We also reproduce this related setting, which is very similar to the insertion task.
As we are mostly interested in the edit behavior, we split the data by keeping 1,000 sentences with a sufficiently similar translation (sim > 0.6) in the TM as the test set for each domain.The remaining data is used for training.We use all found similar translations for training and only the best similar translation for testing.This results in 4.4M parallel sentences (para), in which 2.6M examples are also associated to a similar translation (similar) or just the related segments (related).Data preprocessing is similar as in Section 3.1.

Experimental Settings
Our baselines reproduce two settings for TM-based MT: the FM setting of Bulte and Tezcan (2019) and the FM # setting of Xu et al. (2020).The former is trained using para + similar data, and the latter uses para + related.These two baselines are trained with the same configuration as in Section 3.1.We also report scores obtained by simply copying the retrieved similar translations, as in Section 3.2.
Edit-MT and Edit-LevT from Section 3 differ from FM and FM # both in task, and also in training domains.Hence, we consider fine-tuning (FT) our models.For Edit-MT, we use para+similar+ related data and fine-tune for only 1 epoch with a learning rate of 8e −5 .As we do not have information about the edit operations required to change a similar translation into the reference, we set all editing tags as on for similar data, and prefix the output with [ins] [sub] [del].For the related data, we conjecture that mostly insertions are needed as the irrelevant segments have already been removed.We thus only activate the insertion tag.For Edit-LevT, we only use similar+related and fine-tune for 1 epoch with a learning rate of 9e −5 .Our finetuned models can perform both translation with a similar sentence and with related segments.

Results and Analysis
We evaluate with BLEU, and also show TER scores in Appendix C. We reproduce in Table 4 the overall good performance of FM and FM # .Both significantly improve the initial similar translations.The generic Edit-MT performs much worse, and does not even match the copy results.When prefixed with the editing tag13 (+tag), we observe small improvements (+0.8 BLEU on average), that are further increased in the related scenario (+R).FT yields a much larger boost in performance (+13.4BLEU).This highlights the effect of the task and domain mismatches on our initial results with Edit-MT.The related setting also benefits from FT, albeit by a smaller margin (+9 BLEU).Our best results overall, using FT, are superior to FM # and close to that of FM.This has practical implications, since FM # and FM are specifically trained to transform a retrieved translation, whereas the generic Edit-MT is initially trained with artificial edits then only slightly fine-tuned on the in-domain data.To appreciate this difference, we evaluate our models on two unseen domains (Office and ENV), neither of which are used to train FM and FM # , nor to fine-tune Edit-MT.Results in Table 6 unambiguously show that in this setting, the fine-tuned Edit-MT outperforms FM, suggesting that our edit-based model has not only adapted to the domain, but also to the task, as it can effectively perform zero-shot TM-based translations.Results obtained with the Edit-LevT model on these test sets lag far behind: even with FT on in-domain data, Edit-LevT still struggles to improve over the copy baseline.
We also perform the analysis with a breakdown by the edit distance ∆ as in Section 3.2 on the multidomain test sets.For the results in Table 5, all 11k test sentences are merged into one test set, then broken down by values of ∆.The generic Edit-MT starts to improve the similar translation for ∆≥3.However, it is difficult for Edit-MT to detect very small changes (∆<3) without FT.Once fine-tuned, Edit-MT performs similar to FM for small changes, which further confirms that Edit-MT adapts to the TM-based translation task.Edit-LevT models, however, only slightly improve the copy baseline for similar translations requiring large changes (∆≥8).However, it is better than other models at detecting ẽs that do not need edits (∆=0).We also provide results broken down the merged test set by different edit operations in Appendix C.

Parallel Corpus Cleaning
Our model restores synchronization between a pair of sentences.This is also useful for parallel TM cleaning tasks.Given a source sentence f and a possibly incorrect translation ẽ, we want to detect non-parallelism and to perform appropriate fixes.We study how Edit-MT fares with this new problem on two publicly available datasets: first on the SemEval 2012&3 Task 8: Cross-lingual Textual Entailment (CLTE, Negri et al., 2011Negri et al., , 2012Negri et al., , 2013),14 then with the OpenSubtitles corpus (Lison and Tiedemann, 2016).

Cross-lingual Textual Entailment
The CLTE task aims to identify multi-directional entailment relationships between two sentences x 1 and x 2 written in different languages.We evaluate on the Fr-En direction, where x 1 is in French and x 2 is in English.The tagging mechanism of Edit-MT (see Section 2.2) can readily be used for this classification task.Data descriptions and slight adjustments of tagging scheme are in Appendix D.
We treat x 1 as f and x 2 as ẽ to match the input format of Edit-MT, and perform zero-shot inference reusing the same Edit-MT model as in Section 3. We concatenate x 1 and x 2 as input, and truncate the target sequence by only taking the first three edit tags as the predicted label for the corresponding input pair, treating Edit-MT as a mere classification (CLF) model.We also slightly finetune Edit-MT with the 500 examples of CLTE training data for 5 epochs with a learning rate of 8e−5.7, together with the best scores reported in Negri et al. (2013) for years 2012 and 2013 and the scores reported in Carpuat et al. (2017), which are the best reported performance we could find.Note that these scores may be quite weak, as pre-trained language models did not exist at that time.We have tried to apply a pre-trained XLM model (Conneau and Lample, 2019) to report better baselines for CLTE.However, fine-tuning XLM with only 500 sentences for 5 epochs did not outperform the reported baselines, as the fine-tuned XLM model needs to train new parameters for the linear output layer.Our Edit-MT, on the contrary, does not require any additional parameters during fine-tuning.As can be seen in Table 7, out-of-the-box Edit-MT fails to clearly detect the entailment relationships.This is not surprising, as there is a significant difference between our editing data and the CLTE test sets.For instance, the insertion initial translation ẽins is always grammatically incorrect, while all sentences in CLTE are syntactically correct.However, after slight fine-tuning on the CLTE data, Edit-MT for both directions can quickly adapt to the task, achieving state-of-the-art performance.This again hints that Edit-MT actually learns to identify various cases of non-parallelism.

Fixing OpenSubtitles Corpus
We further evaluate the ability of Edit-MT to detect parallel sentences and fix noisy data.We experiment with the OpenSubtitles15 data (Lison and Tiedemann, 2016) for the En-Fr direction.In this corpus, the French side is translated from English, but with noisy segments.A standard approach is to filter out noisy sentences from the training data when building systems.We aim to study whether Edit-MT can automatically identify and edit, rather than discard, noisy sentence pairs, so that training can use the full set of parallel data.We measure performance on the 10,159 segments of the En-Fr Microsoft Spoken Language Translation (MSLT) task (Federmann and Lewis, 2016), which simulates a real world MT scenario.
The OpenSubtitles data is processed similarly as in Section 3.1.We first use the fine-tuned CLF model in the CLTE task to predict the relation for all sentence pairs in OpenSubtitles data.About 60% of the data is classified as parallel, indicating that no edit operation is predicted for these segments.Models trained on the 60% clean data are denoted as filtered.For the other 40% presumably noisy data, we reuse the Edit-MT En-Fr model of Section 3 to fix the translations, using the predicted edit tag as a prefix on the target side (see Sections 3 and 4).Models trained on the edited data are noted as fixed.We train NMT models with either all data (full) or just the 40% noisy data as baselines.For comparison, we also train a model using the same data size (15.8M) as the noisy subset, randomly selected from the filtered subset.
As shown in Table 8, aggressively filtering the noisy data improves over using the full training corpus (+2 BLEU for filtered) more than revising it (+1 BLEU for filtered + fixed).The second set of results yields similar conclusions with smaller datasets: here, the effect of automatically fixing a set of initially noisy data improves the BLEU score by 7.2 points and closes half of the gap with a clean corpus of the same size.Note that these results are obtained without adaptation, simply reusing the pre-trained Edit-MT model of Section 3.This suggests that in situations where the training data is small and noisy, editing-based strategies may provide an effective alternative to filtering.

Related Work
The prediction of translations based on a source sentence and an initial translation is first explored in the context of IMT, using a left-to-right sentence completing framework proposed by Langlais et al. (2000).The proposals of Green et al. (2014); Knowles and Koehn (2016); Santy et al. (2019) explore ways to generate translations based on given prefix hints.A more general setting, enabling arbitrary insertions thanks to LCD, is studied for online IMT systems in Huang et al. (2021).Note that LCD was initially developed for other purposes, namely enforcing lexical or terminological constraints (Hokamp and Liu, 2017;Post and Vilar, 2018;Hu et al., 2019).As this approach induces large decoding overheads, recent works in this thread explore NAT techniques: Susanto et al. (2020) propose to inject lexical constraints into an edit-based LevT (Gu et al., 2019), an approach improved by Xu and Carpuat (2021) with an additional repositioning operator.
Recent attempts to revise initial translations are explored by Marie and Max (2015), who propose a touch-based scenario where users select usable translations segments, while the more questionable ones are automatically retranslated iteratively.This idea is revisited by Grangier and Auli (2018), where undesired words in the initial translation are crossed-out.The authors use a dual source encoder to represent the initial translation along with the source sentence, which is also explored by Wang et al. (2020) in a touch-editing case.The text infilling task is also considered by Xiao et al. (2022), based on a single source encoder; see also Yang et al. (2021) and Lee et al. (2021) for related proposals.These studies consider a slightly different task than ours, as they only predict the missing part of the initial translation.Nevertheless, they can all be adapted to our generic Bi-sync scenario.Similar approaches have also been studied in APE.Multi-source architectures have been explored in e.g.(Junczys-Dowmunt and Grundkiewicz, 2018;Tebbifakhr et al., 2018;Shin and Lee, 2018;Pal et al., 2018), whereas Hokamp (2017);Lopes et al. (2019) jointly encode source and translation as one input.(Wisniewski et al., 2015;Libovický et al., 2016;Bérard et al., 2017) focus on learning edit operations.Junczys-Dowmunt and Grundkiewicz (2016) also propose to generate APE training data with round-trip translation.
Bi-sync also encompasses TM-based methods.Gu et al. (2018) use a second encoder to represent TM matches, an idea extended with a more compact representation of TM matches in Xia et al. (2019).As explained above, Bulte and Tezcan (2019) use a single encoder, concatenating TM seg-ments with the source.Xu et al. (2020) further add a second embedding feature indicating related segments in TM matches, and Pham et al. (2020) propose to simultaneously consider the source and target sides of retrieved TMs.Retrieval-based MT is also explored in He et al. (2021); Khandelwal et al. (2021); Cai et al. (2021), trying to make the performance gain less dependent on the quality of the retrieved TM matches, or to enforce a tighter coupling between TM matches and translations.

Conclusion
This work introduced Bi-sync, the task of generating translations of a source sentence by editing a related target sentence.We have proposed various ways to create artificial initial translations for different editing types, that are needed for training.We have explored both autoregressive and non-autoregressive architectures, observing experimentally that our autoregressive Edit-MT model trained with artificial triplets performs bilingual resynchronization tasks in several real world scenarios.Edit-MT can also be quickly adapted to retrieval-based MT tasks, where we compared favorably to dedicated models.Finally, Edit-MT can also fix TMs by detecting parallel sentences and correct imperfect translations without adaptation.Another application that we wish to explore is APE.
Our NAT approach Edit-LevT is lagging behind Edit-MT.In the future, we would like to explore more NAT systems, which are computationally faster, and improve their performance.We intend to consider training curriculums and to modify the LevT model to better fit the Bi-sync task.We would also like to study ways to reduce the load of fully re-decoding the input sequence, especially when small changes, that need to be reproduced in the target, are iteratively applied to the source sentence.

Limitations
The generation of editing data for each type of edit operations requires lots of efforts and resources.This requires two separately trained NMT models to generate the data for substitution via roundtrip translation with LCD.The computational cost of LCD is very high compared to regular beam search, therefore consuming many computational resources.The generation of editing data for deletion also requires one separately trained model for each method with a complete decoding of the entire training corpora.Even though our data generation procedure is effective, the generation process may not be environmentally friendly.Due to computational limits, we were not able to conduct experiments on other languages pairs and tasks such as APE, in which large public datasets are available for other language pairs other than En-Fr.
As we decomposed the edits from one initial translation to the reference by basic edits (insertion, substitution and deletion), the generic Edit-MT and Edit-LevT models can mostly perform one type of edits at a time.It might be worth studying to combine several edit types into one single generated example, in order to approach more realistic scenarios for the generic models.
We mainly measured our results with BLEU and some additional scores in TER.However, other metrics like COMET (Rei et al., 2020) can also be interesting: as pointed out by Helcl et al. (2022), BLEU may be less appropriate to measure valid translations than COMET for NAT models.

A Edit-based Model Configurations
We conduct experiments using fairseq16 (Ott et al., 2019).Edit-MT relies on the Transformerbase model of Vaswani et al. (2017).We use a hidden size of 512 and a feedforward size of 2,048.We optimize with Adam with a maximum learning rate of 0.0007, an inverse square root decay schedule, and 4,000 warmup steps.We also tie all input and output embedding matrices (Press and Wolf, 2017;Inan et al., 2017).Edit-MT is trained with mixed precision and a batch size of 8,192 tokens on 4 V100 GPUs for 300k iterations.We save checkpoints for every 3,000 iterations and average the last 10 saved checkpoints for inference.For Edit-LevT, we follow Gu et al. (2019), using a maximum learning rate of 0.0005 with 10,000 warmup steps and a larger batch size of 16,384.For inference, we set a maximum decoding round of 10.Our experiments use the same multi-domain corpus as Xu et al. (2020).This corpus contains 11 different domains for the En-Fr direction, collected from OPUS17 (Tiedemann, 2012): documents from the European Central Bank (ECB); from the European Medicines Agency (EMEA); Proceedings of the European Parliament (Epps); legislative texts of the European Union (JRC); News Commentaries (News); TED talk subtitles (TED); parallel sentences extracted from Wikipedia (Wiki); localization files (GNOME, KDE and Ubuntu) and manuals (PHP).All these data were deduplicated prior to training.To evaluate the ability of our models to actually make use of TMs instead of memorizing training examples, we also test on two unseen domains: OpenOffice from OPUS and the PANACEA environment corpus18 (ENV).We follow Xu et al. (2020) and search for the top 3 similar translations based on Fuzzy Match with a similarity score greater than 0.6 as computed by Equation (1) on the source side and without an exact match.Note that the ratio of sentences with at least one similar translation greatly varies across domains, as shown in Table 9.When reproducing the related setting, Xu et al. (2020) used a placeholder token to mark the positions where segments are deleted.As discussed in Section 2.1, our models do not include such information.Therefore, we do not use placeholders for the related data.

C Additional Results
Table 10 reports results on multi-domain test set for the task of translating with TMs measured on TER using SacreBLEU (Post, 2018).TER results show that even the generic Edit-MT model actually identifies useful edits, as we see improvements with respect to the copy baseline.
Table 11 shows results on BLEU broken down the aggregate test set by edit operations.We observe that the generic Edit-MT struggles to perform well when substitutions are needed.FT vastly improves the ability to substitute and delete from ẽ, separately or even in combination.Fine-tuned Edit-MT even outperforms FM when only deletions are required.

D Details for Parallel Corpus Cleaning
In the CLTE task, the goal is to identify multidirectional entailment relationships between two sentences x 1 and x 2 , written in different languages.Each (x 1 , x 2 ) pair in the dataset is annotated with one of the following relations: Bidirectional (x 1 ⇔x 2 ): the two fragments entail each

Table 1 :
BLEU scores for all edit types (Ins, Sub, Del 1 , Del 2 ) broken down by the edit distance ∆ between ẽ and e for both En-Fr and Fr-En.Each column represents a range of distances.N denotes the number of sentences in each group.All is computed by concatenating all test sentences.

Table 2 :
BLEU scores for Edit-MT and Edit-LevT on resynchronization tasks for En-Fr.Deletions are evaluated separately for two generation methods (Del 1 and Del 2 ).+ tag refers to decoding with the oracle tag as a forced-prefix.Best performance is in bold.

Table 3 :
BLEU scores for Edit-MT and Edit-LevT models on resynchronization tasks for Fr-En.

Table 4 :
BLEU scores for the multi-domain test sets.All is computed by concatenating test sets from all domains, with 11k sentences in total.Copy refers to copying the similar translation in the output.+Rimpliesusing the related segments instead of a full initial sentence for inference.Best performance in each block are in bold.study the ability of our models to actually make use of TMs instead of memorizing training examples, we also test on two unseen domains: OpenOffice from OPUS and the PANACEA environment corpus 12 (ENV).Detailed description and statistics about these corpora are in Appendix B.Xu et al.

Table 5 :
BLEU scores for the multi-domain test sets broken down by the edit distance ∆ between ẽ and e.Each column represents a range of distances.N denotes the number of sentences in each group.

Table 6 :
BLEU scores on unseen domains.

Table 7 :
Accuracy scores on the SemEval CLTE tasks.FT denotes Edit-MT fine-tuned for classification.

Table 9 :
B.1 Datasets and Processing Details Data used for experiments of Section 4.2.FM ratio is the ratio of sentences with at least one matched similar translations, FM train is the actual number of examples augmented with a similar translation used for training, after setting aside 1,000 test sentences for each domain.Each training sentence is matched with up to 3 similar translations.