Synthetic Pre-Training Tasks for Neural Machine Translation

Pre-training models with large crawled corpora can lead to issues such as toxicity and bias, as well as copyright and privacy concerns. A promising way of alleviating such concerns is to conduct pre-training with synthetic tasks and data, since no real-world information is ingested by the model. Our goal in this paper is to understand the factors that contribute to the effectiveness of pre-training models when using synthetic resources, particularly in the context of neural machine translation. We propose several novel approaches to pre-training translation models that involve different levels of lexical and structural knowledge, including: 1) generating obfuscated data from a large parallel corpus 2) concatenating phrase pairs extracted from a small word-aligned corpus, and 3) generating synthetic parallel data without real human language corpora. Our experiments on multiple language pairs reveal that pre-training benefits can be realized even with high levels of obfuscation or purely synthetic parallel data. We hope the findings from our comprehensive empirical analysis will shed light on understanding what matters for NMT pre-training, as well as pave the way for the development of more efficient and less toxic models.


Introduction and Motivation
Neural Machine Translation (NMT) models depend on large quantities of aligned training data (Aharoni et al., 2019;Fan et al., 2021;NLLB Team et al., 2022).For many language pairs of interest, however, high quality parallel data is either unavailable or exists only in limited quantities.Training robust NMT systems with such limited data can be a significant challenge.
Even for high-resource language pairs, parallel data can be noisy and frequently contains toxic speech or biased language.Such problems are particularly acute for comparable corpora crawled automatically from the web (Kreutzer et al., 2022) Figure 1: A comparison of the extent to which the synthetic data generation methods described in Section 3 encode lexical and/or structural translation knowledge.The vertical axis compares methods with respect to lexical knowledge.The horizontal axis compares structural knowledge.BLEU scores correspond to the Indonesianto-English translation task described in Section 4.
since it can cause catastrophic mistranslations (Costa-jussà et al., 2022) or hallucinated toxicity.It is preferable to avoid exposing the model to such data in order to prevent accidental generation of offensive content or egregiously embarrassing translations.Crawled data can also present problematic copyright, attribution, and privacy issues.As an example, the JW300 corpus of Jehovah's Witnesses publications (Agić and Vulić, 2019) was recently withdrawn due to a copyright infringement claim.
Our primary motivation is to investigate how knowledge transfer from NMT pre-training can help to avoid or minimize the data issues described above.We study the impact of pre-training and transfer learning on translation tasks by comparing various procedural approaches to synthetic data generation.Each approach has varying degrees of inherited or artificially constructed lexical and structural translation knowledge.The degree to which each method encodes lexical and/or structural translation knowledge is plotted in abstract form in Figure 1.We describe each of our synthetic data generation methods in Section 3.
Our first approach ( §3.1) studies the extent to which the transfer benefits of regular pre-training can be realized when using obfuscated or encrypted data.Our obfuscated corpus is derived from real parallel data by mapping the original words to a vocabulary of 'nonsense' tokens.Experiments on six different language pairs show that obfuscated pretraining is able to capture much of the transferable knowledge: pre-training with an obfuscation ratio as high as 75% is still able to achieve BLEU scores close to those obtained by a model pre-trained on the original un-obfuscated parallel data.
Our second approach ( §3.2) seeks to maximize the benefit that can be derived from a specific limited quantity of fine-tuning data.We do this by pretraining on newly constructed artificial sentence pairs synthesized directly from the fine-tuning corpus.The synthetic sentence pairs are created by concatenating randomly sampled aligned phrase pairs extracted from the fine-tuning corpus.Although the sentence-level fluency and grammaticality of sentences constructed using this technique are both quite poor, they do retain word-and phraselevel correspondences and local reordering information that can greatly improve translation quality and robustness compared to models trained using only the original fine-tuning data.
Our third approach ( §3.3) explores the pretraining impact of important translation phenomena such as alignments and reordering.We pre-train models on procedurally generated synthetic parallel data that does not derive from any real human language corpus.We design three simple synthetic sequence-to-sequence translation tasks and associated data sets.Since our data is procedurally generated, problems of toxicity, attribution and copyright can be avoided.We evaluate the effectiveness of pre-training and transfer for our synthetic tasks in the context of low-resource NMT.Our results show that -to a surprising degree -the transfer benefits of pre-training can be realized even with purely synthetic tasks and data.Our analysis shows that structure, in the form of aligned sub-trees, matters in synthetic pre-training for NMT.
We empirically evaluate the impact of each of our proposed synthetic pre-training methods in lowresource MT settings ( §4), followed by a discussion and analysis explaining our insights into what makes for a good pre-trained model ( §5).We also consider the question of model toxicity.We measure the extent of hallucinated toxicity in each syn-thetic data generation method, showing that synthetic methods can result in substantially reduced toxicity compared to models pre-trained on real parallel corpora.
The primary contributions of our paper are as follows: (i) we propose several novel synthetic pre-training tasks, that encode varying degrees of structural and lexical knowledge, in order to gain insights into what makes for a good pre-trained NMT model; (ii) we conduct a comprehensive empirical evaluation of knowledge transfer in NMT from synthetic data pre-training, considering metrics of both translation quality and toxicity; and (iii) we demonstrate that synthetic data is a promising stepping stone towards relieving the data burden in low-resource translation and building more accurate and trustworthy NMT systems.

Related Work
Transferring knowledge from pre-trained language models (Devlin et al., 2018;Raffel et al., 2019;Brown et al., 2020) is a common technique for ensuring robust NLP downstream task performance.Early work by Zoph et al. (2016) explored transfer learning for NMT from a model pre-trained on a single language pair.More recently, methods that transfer from large-scale multilingual pre-trained models (Conneau et al., 2019;Liu et al., 2020;Goyal et al., 2022;NLLB Team et al., 2022) have achieved improved translation performance across a wide range of language pairs.Aji et al. ( 2020) conducted a study on pre-training and transfer for low-resource NMT.These works depend on real human language for pre-training and therefore inherit data issues such as toxicity and bias.In contrast, our work studies NMT pre-training and transfer from synthetic data based on 'nonsense' words.
Only a few methods have addressed the problem of pre-training from synthetic data in NLP.Krishna et al. (2021) proposed pre-training for summarization using synthetic article and summary pairs derived from manually curated tasks and a vocabulary of nonsense symbols.Sinha et al. (2021) have shown that masked language model pre-training with limited word-order information can be almost as effective as regular pre-training.Chiang andLee (2020, 2021) show that non-human language data and artificial datasets (e.g.nested sequences of parentheses), can still demonstrate knowledge transfer to downstream NLP tasks.Wu et al. (2022) compare the effect of pre-training on many sim-ple synthetic tasks.Our work in this paper represents the first empirical evaluation of synthetic pre-training for neural machine translation.To the best of our knowledge, our proposed synthetic tasks have not been explored in previous work.
The quality of a pre-trained model should not be measured purely by performance.We should also consider trustworthiness (He et al., 2022;Xu et al., 2022;He et al., 2021).Recent works have noted that translation systems pre-trained on webscale corpora are prone to produce toxic (Costajussà et al., 2022) or biased outputs (Prates et al., 2020;Cho et al., 2021;Costa-jussà et al., 2020), and/or present privacy issues (Prates et al., 2020;Kamocki and O'Regan, 2016), which reduces user trustworthiness.Bias mitigation for NMT has been well-investigated while privacy and toxicity issues for translation are still not extensively explored (Costa-jussà et al., 2022).Wang et al. (2021) propose federated neural machine translation to protect privacy such as commercial leakage or copyright.(Costa-jussà et al., 2022) mitigate toxicity by filtering training data that matches pre-defined multilingual toxic word lists.
3 Synthetic Pre-Training for NMT Pre-training followed by fine-tuning is a common approach to training robust NMT models (Conneau et al., 2019;Liu et al., 2020).Our motivation is to understand the extent to which the transfer benefits of pre-training can be replicated using synthetic tasks and data.In this section, we describe three approaches to the programmatic generation of synthetic data: (i) pre-training with obfuscated parallel data that implicitly preserves certain language properties such as distributional frequencies, (ii) pre-training with synthetic data created by concatenating aligned phrases, and (iii) pre-training with synthetic tasks designed to encourage transfer learning of important translation properties such as long-distance reordering.

Pre-Training on Obfuscated Parallel Data
In order to gain insight into what makes a good pre-trained model, we design an obfuscated pretraining experiment in which the model learns to translate obfuscated source sequences to obfuscated target sequences.The synthetic training data for this experiment is created by obfuscating words in the original parallel data.We define separate 1-to-1 nonsense token vocabulary mappings for the set of all words that occur in the source and target sides of the data: each source word s i and target word t j has a corresponding obfuscated nonsense source token O s i and target token O t j .The synthetic pre-training corpus is created by replacing, with probability R, each source and target word with its corresponding obfuscated nonsense token.R thus determines the proportion of obfuscated tokens, allowing us to evaluate the extent to which pre-training knowledge transfer occurs with different obfuscation ratios.This method of obfuscation can be viewed as a trivial form of encrypted training.Although the original word identities are obscured, a great deal of useful information such as distributional frequencies, word order, dependency relations, alignments, and grammatical structure remain implicit in the obfuscated data.An example German-English parallel sentence pair and obfuscations at R = 0.25 and R = 1.00 (i.e.all tokens obfuscated) are shown below: We first extract a collection of aligned phrases P using the standard recipe implemented in the Moses SMT Toolkit (Koehn et al., 2007).The accuracy of the aligned phrases depends on the size and quality of the parallel data: we target low-resource MT and assume there is only a limited quantity of parallel data available.We generate synthetic parallel sentence pairs by first sampling a normally distributed phrase length P .We sample each phrase position p = 1 . . .P uniformly at random from P. The source and target sentences thus consist of concatenated source and target phrases.The word order within each sampled phrase is fluent and local reordering may also be captured.The boundaries between phrases, however, typically do not respect natural word order or grammar.In spite of these limitations, we show in Section 4.3 that this simple method of data augmentation can significantly improve the quality of an NMT model when training data is limited.An example Indonesian-to-English synthetic sentence pair, with phrase boundaries indicated by parentheses, is shown below: In this section, we define three completely synthetic task variants that can be used for NMT pre-training: (1) the identity operation, (2) case-mapping, and (3) permuted binary trees.All three tasks are based on a procedural data generation model and can thus be used to generate arbitrary quantities of synthetic data.Procedural generation of synthetic parallel sentence pairs allows for complete control over the alignments, length distribution, token frequency distribution, and level of noise in the data.
All three synthetic tasks are based on a 1-to-1 paired dictionary of source and target synthetic tokens: S for source and T for target.We define a pairwise mapping between the two vocabularies such that each synthetic source token S i is paired with a corresponding synthetic target token T i for each i ∈ 1 . . .N , where N is the size of the paired vocabulary.In the examples below, the source vocabulary consists of all 26 3 = 17576 three-character synthetic tokens that can be created using the lowercase English letters {a, . . ., z}.

Synthetic Task 1: Identity Operation
The simplest of the pre-training tasks we consider is the identity operation, which has been previously proposed by Wu et al. (2022) as a synthetic task for language model pre-training.For this task, the source and target sentences are identical.We include it not because we believe it to be in any way a proxy for the true translation task, but instead to serve as the simplest possible baseline sequenceto-sequence synthetic task.We generate parallel sentence pairs by first sampling a sentence length L from the normal distribution.Each source token s i for i = 1 . . .L is sampled uniformly from the source vocabulary S. The target sentence is simply a copy of the source: src cea qne jda rnu jkq ozf dke kzl hpo trg cea qne jda rnu jkq ozf dke kzl hpo

Synthetic Task 2: Case-Mapping
Our second pre-training task defines a casemapping operation.Each synthetic parallel sentence pair consists of the same sequence of tokens but the source sentence is lowercase and the target sentence is uppercase.We also design an extension of this task that includes insertions and deletions.Source and target tokens can be deleted with fixed probability d s (for source) and d t (for target).Random insertions and deletions are added to avoid having identical source and target lengths for every sentence pair, which might entrench the tendency of the model to mimic such behavior even at the fine-tuning stage where it is likely inappropriate.From the perspective of the translation task, a sentence pair with a missing target token corresponds to a deletion, while a missing source token corresponds to an insertion.The following example shows a parallel sentence pair for the case-mapping task with fixed source and target deletion probabilities d s = d t = 0.15: src qdo zwj iub uxj pls nsn igk mrz ojw trg QDO ZWJ IUB KWP UXJ PLS NSN IGK MRZ OJW

Synthetic Task 3: Permuted Trees
The third of our synthetic pre-training tasks is designed to reflect some aspects of the reordering process that occurs during natural language translation.We first generate random sentences with normally distributed lengths and uniformly distributed synthetic tokens, as for tasks 1 and 2. We then induce an artificial binary tree over the source sentence by picking a random point at which to split the sentence, and recursively repeat this process for the left and right sub-strings.The resulting binary tree structure allows us to generate synthetic parallel data with reordering that preserves the alignment of contiguous source-to-target token spans.The target tree is generated as a permutation of the source tree: we randomly swap left and right sub-trees with some fixed probability r.Generating synthetic sentence pairs in this way implies the existence of lexicalised synchronous context free grammar (SCFG) rules (Chiang, 2007) that could be used to generate the sentence pair as a parallel derivation.The example below shows a synthetic sentence pair generated using this method: Parentheses indicating the tree structure are shown for clarity.During pre-training, however, only the source and target synthetic token sequences are actually seen by the model.In this example, the source token 'ktp' was reordered with respect to the sub-tree containing the tokens 'hme nmc'.Figure 2 shows the token-level alignment and reordering operations encoded by this parallel sentence pair.

Experimental Framework
We evalute our synthetic pre-training data generation methods for NMT using using both Englishcentric and non-English-centric language pairs.

Experiment Setup
English-Centric Language Pairs For Englishcentric translation directions, we use fine-tuning data sets similar to Aji et al. ( 2020).For German-English, we use the official data from the WMT 2014 News Translation Task.For Myanmar-English, the fine-tuning data consists of 18.0k parallel sentence pairs in the news domain collected for the Asian Language Treebank (ALT) project (Ding et al., 2018).We use the original train, dev and test split.For Indonesian-English, we use a filtered set of 24.6k parallel sentence pairs from the IDENTIC v1.0 corpus (Larasati, 2012) which covers various genres.We randomly divide the original corpus into distinct train (90%), dev (5%), and test (5%) sets.For Turkish-English, we use data from the WMT 2017 News Translation Task (Yepes et al., 2017).The training set includes 207.7k parallel sentence pairs.We use the WMT newsdev2016 set for validation, and report results on newstest2017.Non-English-Centric Language Pairs For non-English-centric directions, we simulate lowresource translation conditions by sampling data from OPUS NLP (Tiedemann, 2012).The non-English-centric language pairs we evaluate are as follows: Indonesian-Myanmar, Indonesian-Turkish, Indonesian-Tagalog, Myanmar-Turkish, Myanmar-Tagalog, Tagalog-Turkish, German-Indonesian, and German-Myanmar.For each pair, we simulate low-resource conditions by creating fine-tuning sets of size 10k, 25k, 50k, and 100k via sampling from the set of all parallel corpora for that language pair on OPUS NLP.Minimal filtering is applied to our parallel data sets: we remove duplicates, discard sentences with extreme length ratios, and keep only sentence pairs for which the fasttext (Joulin et al., 2016) language ID matches the stated source and target.
Evaluation Following the evaluation setting of large-scale multilingual models such as FLORES-101 (Goyal et al., 2022), we score our translation hypotheses using sentencepiece BLEU (Papineni et al., 2002) (spBLEU).This avoids the need for custom post-processing for individual languages with unusual scripts and/or complex morphology such as Burmese.

Model Training Strategy
Our experiments consist of a pre-training stage followed by a finetuning stage.We use the transformer sequenceto-sequence 'base' model architecture (Vaswani et al., 2017) for all translation experiments.Since our goal is to gain insight into the relative importance of various aspects of synthetic pre-training, our baseline models are created by fine-tuning randomly initialized models using only the downstream task parallel data.
We use fairseq (Ott et al., 2019) to train our models with the Adam (Kingma and Ba, 2014) optimizer.We reset the learning rate scheduler and optimizer before starting the fine-tuning stage.Pretraining and fine-tuning continue until the BLEU score on the validation set converges.Further implementation details can be found in Appendix B.

Pre-training with Obfuscated Data
Following previous work that showed German-to-English to be a good pre-training direction for several language pairs (Aji et al., 2020), we also use German-to-English (de-en) for pre-training and randomly sample two million pairs from its training corpus to use as obfuscated parallel data.We vary the obfuscation ratio R from 0% to 100% in 25% increments.After pre-training, we fine-tune the models on the real-world parallel training corpus (described in Section 4.1) for each downstream language pair.We also investigate the scaling effect of different fine-tuning set sizes and show the results in Appendix A.1.
We report spBLEU scores on the test set for each language pair in Figure 3.We find that, surprisingly, even when as much as 75% of the pretraining data is obfuscated, the models are still able to achieve high or even comparable spBLEU scores to real-world pre-trained models (i.e., those with 0% obfuscation).Additionally, most of the models pre-trained on obfuscated data performed better than those trained from scratch on real-world fine-tuning data, even when the pre-training data was 100% obfuscated (e.g., 100% in id-en, my-en, and my-tl).This suggests that a small proportion of real-world data can provide the majority of the benefits of large-scale regular pre-training, implying a promising research direction for efficient pre-training or improving low-resource NMT.

Pre-training with Phrase Concatenation
The translation decoding results in Table 1 show substantial transfer learning benefits from pretraining with 2m sentence pairs of synthetic data generated by concatenating uniformly sampled aligned phrase pairs (phrase-cat).Compared to a model with no pre-training, i.e. one that trains from random initialization using only the fine-tuning data (random-init), we observe large gains of up to +9.9 spBLEU for language pairs with less than 25k of fine-tuning data (my↔en and id↔en).The gains of +1.4 to +2.1 for tr↔en are smaller: this pair has more fine-tuning data (>200k pairs) so the improved coverage and robustness of synthetic pretraining is less critical for good performance.It is important to note that this method does not utilize any additional real parallel or monolingual data, but instead derives new data directly from the existing fine-tuning corpus.Our synthetic pre-training corpus, although unnatural at the sentence-level, contains many phrase-level alignments and reordering information which reinforces the translation knowledge captured by the model.Any destructive effect from presenting to the model during pre-training sentence pairs with unnatural word order or bad grammar, can be rectified in the fine-tuning stage by showing the model the original fluent source and target sentences.

Pre-Training with Synthetic Data
We pre-train transformer (Vaswani et al., 2017) models using 2m sentence pairs of synthetic parallel data to match the data size used in our obfuscation experiments.We further explore the effect of scaling the synthetic pre-training data size in Appendix A.4. Separate synthetic training sets were generated for each of the three task variants described in Section 3.3.Additional sets of 4000 synthetic pairs were generated as validation data.Each pre-trained model is subsequently fine-tuned with real parallel data for a specific language pair: my↔en, id↔en, and tr↔en.In Table 1, we report sentencepiece BLEU (spBLEU) (Goyal et al., 2022) scores for our three synthetic pre-training task variants.For comparison purposes, we also show the scores obtained without pre-training -i.e. a randomly initialized model trained on only the fine-tuning data.
Our first observation is that synthetic pretraining with the identity operation task ( §3.3.1)does not perform well.For all three language pairs it is slightly worse than simply fine-tuning from a randomly initialized model.This is to be expected since the pre-training task is too crude: a simple copy operation from source to target with identical lengths.Pre-training with the case-mapping synthetic task ( §3.3.2) and deletion probability d s = d t = 0 improves the scores, with gains of +1.0 to +5.0 spBLEU over the identity operation on our test set.Although the case-mapping pre-training task is still quite crude, it is able to beat fine-tuning from a randomly initialized model for both Myanmar-to-English and Indonesian-to-English.Our best performing synthetic task is pbtrees ( §3.3.3) with a node reordering probability r = 0.15.This model shows that transfer learning from synthetic pre-training to real-world tasks can be substantial, with scores as high as +7.3 sp-BLEU over the baseline for Myanmar-to-English and +4.9 for Indonesian-to-English.We do not see gains for Turkish-to-English for any of our purely synthetic pre-training tasks.The fine-tuning data for this language pair is much larger than that of the other language pairs.As the fine-tuning data size increases, the benefits of transfer learning from pre-training diminish.
We also evaluate the strongest of our three purely synthetic pre-training tasks, pb-trees, on additional non-English-centric language pairs.Table 8 in Appendix A.7 shows spBLEU decoding results for these additional pairs.We compare performance over a range of different fine-tuning set sizes.On both OPUS-Test and FLORES-devtest, and for the majority of fine-tuning set sizes, synthetic pre-training with the pb-trees task typically outperforms fine-tuning from a randomly initialized baseline.

Synthetic Knowledge Transfer
In this section, we discuss what kind of useful representations are actually learned by the model when pre-training with purely synthetic tasks and data.Our empirical study has shown that pre-training on synthetic data can result in improved translation quality after fine-tuning for a specific language pair.Even though the pre-training data is entirely synthetic, the model must have successfully learned representations and structures relevant for translation that can be leveraged via transfer learning to the downstream task.
In Table 2, we show the word piece overlap be-tween our tokenized synthetic pre-training corpus and the real human language corpus for three finetuning language pairs.Our vocabulary consists of 26 3 paired lowercase-uppercase synthetic tokens, but after tokenization the number of unique word pieces is much lower.For example, there are only 3,541 unique source and 2,405 unique target word pieces after tokenizing a corpus of 2M synthetic parallel sentence pairs.The fine-tuning data, although much smaller, has a far greater token diversity for English, Indonesian, and Turkish.Myanmar is the exception: it is aggressively segmented by the XLMR sentencepiece model which results in far fewer unique word pieces.We compute the intersection between the set of word pieces in the synthetic pre-training data and those in the fine-tuning data in the last column of Table 2.We observe low word piece overlap.For example, only 35 of the 3541 word pieces that occur in the source side of the synthetic corpus also occur in the source side of the my-en corpus.This number is low because the Myanmar script is so different from English.But overlap remains low even for languages such as Indonesian and Turkish which have similar alphabets to English.Low levels of overlap were also observed in our obfuscated pre-training experiments (Table 6).The low word piece overlap means that most of the word embeddings learned during pre-training have little relevance to the fine-tuning or inference stages.We conclude that any transfer learning benefit exhibited by the model on the downstream task must be captured in the inner layers of the transformer.

Lexical and Structural Knowledge
The results in Table 1 show phrase-cat to be an effective pre-training strategy for low-resource NMT.Both lexical and structural knowledge is captured in the aligned phrases.However, since the phrases are sampled from the uniform distribution, long-

Translation Quality vs. Toxicity
To evaluate model toxicity, we consider catastrophic mistranslations (Costa-jussà et al., 2022).These errors occur when a model hallucinates toxic terms in the translated text, even though no such terms occur in the source text.Following the toxicity measurement setup of Goyal et al. (2022), we use the FLORES Toxicity-2001 word lists to calculate the toxicity rate of translations produced by a model.The lists cover 200 languages and contain frequently used profanities, insults, and hate speech terms.We consider a sentence toxic if it contains words that match entries in these lists.The toxicity rate for each model is defined as the proportion of sentences with hallucinated toxicity in translations of the test set and a larger set of 100k monolingual sentences randomly sampled from CC-100 (Wenzek et al., 2020;Conneau et al., 2019).We compare BLEU scores and toxicity rates for various models including current state-of-the-art large pre-trained multilingual translation models in Table 3.

Results and Analysis
We first observe that models pre-trained on synthetic data obtain signifi-cantly higher BLEU scores than baselines trained from scratch using only the fine-tuning data.This confirms that our proposed synthetic tasks indeed capture useful knowledge that can be applied through transfer learning to low-resource NMT tasks.When compared to the multilingual translation models FLORES-101 (615M parameters) and M2M-100 (1.2B parameters), we note that models pre-trained on synthetic data obtain comparable performance for languages my-en and even outperform multilingual models by a large margin on de-my, id-en, and my-tl, though with inferior translation quality on de-id.It should be noted that some of these language pairs represent zero-shot directions for M2M-100.We compare our synthetic methods with the standard NMT data augmentation technique of back-translation in Appendix A.3.
While these results are quite promising, we note that our goal in this paper is not to surpass the state-of-the-art in translation quality achieved by large-scale massively multilingual models on lowresource NMT.Instead, we seek to further understand which properties of pre-training based on synthetic tasks and data -along the structural and lexical knowledge axes of Figure 1 -enhance transfer learning performance, while minimizing toxicity and other data issues inherent in models that rely on large-scale pre-training using real data.
Analyzing toxicity, we observe the presence of catastrophic mistranslations in all models, but less frequently when training from scratch in most cases.This is because the low-resource fine-tuning data contains very little toxic content.On the other hand, as noted above, the BLEU scores when training models from scratch are very low.We see that the FLORES-101 and M2M-100 models both exhibit toxicity, since they were pre-trained on realworld corpora that can include toxic content.Our results show that synthetic pre-training can produce models with comparable BLEU scores while significantly reducing catastrophic mistranslations.We observe that parallel data generated from permuted binary trees has the lowest toxicity among the three synthetic pre-training methods, since it relies on purely synthetic data.This may indicate that patterns in the data can still trigger toxic terms, even after the words have been obfuscated or phrases have been shuffled.We include additional toxicity results and analysis in Appendix A.5. Table 3: BLEU scores and toxicity rates for various models on low-resource language pairs.Baseline is training on fine-tune real-world data as lower bound of performance.Large pre-trained models are upper bound of performance.

Conclusion
Our study of synthetic pre-training tasks for NMT showed that pre-training benefits can still be achieved even when using synthetic or obfuscated data.Additionally, we have shown that synthetic data has the potential to reduce model toxicity compared to models trained on web-scale crawled corpora.Our research provides insights into what types of knowledge transfer make for a good pretrained model.We believe that synthetic data augmentation techniques based on synthetic tasks and procedurally generated data are a promising solution for addressing pre-training data concerns, and can lead to efficient, accurate, and trustworthy NMT.In future work, we plan to further investigate synthetic pre-training by exploring more advanced data generation models and directly optimizing the parameters for specific downstream fine-tuning tasks.Increasing the effectiveness of synthetic data at different data scales is also worthy of further exploration.

Limitations
Our work seeks to gain insight into what pretraining knowledge is transferred and useful for downstream fine-tuning in NMT using synthetic tasks and data.We note that changes in the data generation methods do require re-running the pretraining stage, which is computationally expensive compared to the fine-tuning stage.
Our current synthetic data generation methods are somewhat crude.Although they are designed to encode varying degrees of lexical and structural translation knowledge, they do so in a rather simplistic way.For example, sampling phrases from the normal distribution ignores distributional frequencies which represent information that is likely useful for the synthetic data generation task.In this paper we have presented some interesting initial findings regarding the suitability of synthetic pre-training for NMT.We plan to explore more sophisticated data generation models in future work.
We acknowledge that synthetic pre-training is unlikely to surpass the quality of real-world massively multilingual pre-trained models in performance, especially if synthetic data is the only data used for pre-training.However, good performance can probably be achieved by combining synthetic pretraining and real-data pre-training.Of course, this risks exposing the model to toxic and sensitive or private content.Therefore, concerns of both model quality and data quality should be considered when evaluating the impact and benefits of synthetic pretraining.We view synthetic pre-training as a complimentary approach to finding an optimal balance rather than as a replacement for previous state-ofthe-art NMT pre-training methods.

Ethics Statement
All of the training data used in our experiments are official releases of publicly available benchmarks.In addition, the toxic word lists used to measure toxicity are obtained from the public FLORES repository which requires a password to access, thus reducing the risk of hacking by a malicious user or adversarial bot.In addition, as for the issue of hallucinated toxicity discussed previously, we note that our work also has the potential to address other problematic translation behaviors, such as hallucinated bias.

A.1 Scaling Effect of Obfuscated Pre-training
We first evaluate the performance of regular pretraining and fine-tuning with various quantities of real-world German-to-English data.The results in Figure 4 show that the highest BLEU scores are obtained by using regular real-world parallel data (i.e.0% obfuscation).We compare vs. models trained solely on the fine-tuning data ('Scratch'): the resulting BLEU scores are quite low when the training data size is small, confirming the importance and benefits of pre-training for NMT.

A.2 FLORES Obfuscated Pre-training Results
We show additional decoding results for the matched (with source and target fine-tuning languages that are the same as the pre-training languages: de-en) vs. unmatched (with source or target fine-tuning languages that differ from the pre-training languages: de-id, de-my, id-en, my-en, my-tl) conditions of obfuscated pretraining on the FLORES devtest set in Figure 5.

A.3 Back-Translation Comparison
Back-translation (Sennrich et al., 2016) is an effective technique for improving the quality of machine translation.It works by creating new parallel sentence pairs by translating target-side monolingual data into the source language using an inverse direction MT system.The new sentence pairs consist of a (possibly noisy) back-translated source sentence paired with a high-quality target sentence.We compare our synthetic training methods to an NMT  system that has been trained on an augmented data set that includes back-translated parallel data.We use our baseline models for my-en and en-my to produce the back-translated sentences.For each direction my-en and en-my, we generate an additional set of 2m back-translated sentences.The results are shown in Table 4.We note that backtranslation provides only limited improvements vs. the baseline model trained from scratch for my-en and actually hurts for en-my.This is because backtranslation requires a good quality model in the target-to-source direction in order to produce accurate and relevant translations.The my-en baseline model is not of sufficiently high quality to produce useful back-translations.Our synthetic methods significantly outperform back-translation for both translation directions, confirming our expectation about the limitations of back-translation in low-resource conditions, and further illustrating the potential of our proposed synthetic approaches.

A.4 Synthetic Pre-training Data Scaling
Figure 6 shows the data scaling behavior of the pb-trees and phrase-cat synthetic pre-training methods.We pre-train each model with proper subsets of varying sizes sampled from the full 2m pairs used in the rest of our experiments.For pb-trees, the scaling is mostly flat.The BLEU scores, while consistently higher than the baseline (which uses no pre-training at all), increase only gradually with additional synthetic training data.The BLEU gains over the baseline are therefore a result of priming the model for the task of translation, rather than learning any useful real-world lexical relationships between the source and target languages.For phrase-cat, the data scaling curve is much more pronounced.For all three tasks, we observe steadily increasing BLEU scores with larger synthetic training set sizes, reaching a plateau at around 1m pairs.The phrase-cat method benefits from additional samples and combinations of real phrase pairs since the synthetic pairs provide additional coverage of possible word orders and translation relationships that can aid the subsequent fine-tuning and decoding of the testset.

A.5 Further Analysis of Toxicity
We further analyze the toxicity of our models by comparing the toxicity rate of source language sentences and their translations.Firstly, we test de-en translation systems with obfuscated pre-training on WMT test, as shown in Table 5.We observe that training with real-world data (i.e.obfuscation ratio R = 0%) generates translations that contain toxic terms more frequently than they occur in the source.This indicates a toxicity amplification effect, a problem highlighted previously for toxicity (Costa-jussà et al., 2022) and bias (Leino et al., 2018).Pre-training with obfuscated data, however, is a promising way of mitigating this phenomenon, as shown by the big reduction in toxicity rate as the obfuscation ratio is increased.We observe a similar pattern for CC-100 data as well.The sentences in the CC-100 corpus are more toxic than those in the WMT testset (0.57% > 0.43%).

A.6 Word-Piece Overlap Statistics for Obfuscated Pre-Training
Similar to Section 5.1, we also report the token overlap between completely encrypted pre-training data (both source and target corpus) and real-world fine-tuning data, on de-en as shown in Table 5 and other language directions id-en, my-tn, and tr-en in Table 7.In de-en translation, we notice that the overlap is just 0.08% on the source language and 0.04% on the target language, with the largest size of the fine-tuning set (1M).On low-resource language pairs, we can see there is almost no overlap between pre-training and fine-tuning on both source and target sides, as shown in Table 7.This strong evidence supports the conclusion mentioned in Section 5.1 -most of the representations in the first layers are not touched during pre-training, and the benefits from pre-training may come from the inner layers which capture the transferable highlevel knowledge for downstream tasks.

A.7 Synthetic Pre-Training: Additional Language Pairs
Table 8 shows translation decoding results (sp-BLEU) for additional non-English-centric language pairs.We compare synthetic pre-training on permuted binary trees vs. fine-tuning from a randomly initialized model as a function of the finetuning set size.Cells marked 'n/a' indicate there was insufficient parallel data to create a fine-tuning set of the specified size.

B Implementation Details
This section describes implementation details for facilitating the reproduction of our work.

B.1 Model Architectures
All translation models described in our experiments are based on the sequence-to-sequence transformer 'base' architecture (Vaswani et al., 2017) as implemented in fairseq (Ott et al., 2019).The models have six encoder layers, six decoder layers, and eight attention heads.The word embedding size is 512, and the feed-forward layers have 2048 dimensions.All BLEU scores are computed using SacreBLEU (Post, 2018) with sentencepiece tokenization (Goyal et al., 2022).Our SacreBLEU  scoring signature 2 indicates that both source and reference are sentencepiece tokenized prior to scoring.

B.2 Hyper-Parameters and Training Configuration
Table 9 shows the hyper-parameters and training settings used for our experiments.We found different warm-up schedules were appropriate for the

Figure 2 :
Figure 2: Example synthetic sentence pair and partial derivation for the aligned permuted binary trees task.In this example, a single non-terminal node was reordered.

Figure 3 :
Figure3: Translation spBLEU scores after pre-training with different levels of obfuscation and real-world finetuning on downstream language pairs.Scratch refers to training from scratch using only fine-tuning data.Similar results on FLORES can be found in Appendix A.2.
This material is based upon work supported by the Defense Advanced Research Projects Agency under Contract No. FA8750-19-C-1001.Disclaimer: Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency.Zexue He is supported by an IBM Ph.D.

Figure
Figure Translation results after pre-training with different levels of obfuscation and real-world fine-tuning the same language pairs, with various quantities of fine-tuning data in de-en.Scratch refers to training from scratch using only fine-tuning data.

Figure 5 :
Figure 5: Translation decoding results on WMT for (a) regular parallel corpus (0%) vs. obfuscated pre-training as a function of fine-tuning set size (x-axis) and obfuscation ratio (in different colors), and (b) unmatch conditions.

Figure
Figure Effect on BLEU score of scaling up the size of the procedurally generated parallel data used during pre-training for two of our synthetic tasks: permuted binary trees 'pb-trees' (top), and concatenated aligned phrases 'phrase-cat' (bottom).

Table 1 :
Translation decoding results (spBLEU) for three purely synthetic pre-training variants and concatenation of aligned phrases vs. fine-tuning from a randomly initialized baseline (scratch) (English-centric language pairs).

Table 4 :
Synthetic pre-training v.s.back-translation on WMT test set and FLORES devtest set.

Table 5 :
Toxicity rate (%) on WMT Test (left) and sampled CC-100 data (right).Results that increase toxicity compared to the source (0.43% for WMT and 0.57% for CC-100) are colored in red; otherwise they are colored in green.The degree of toxicity is shown by the darkness of the color.

2
BLEU+case.mixed+numrefs.1+smooth.exp+tok.spm+version.1.5.1 pre-training and fine-tuning stages.We choose the best model during training by maximizing the tokenized BLEU score on the validation set.For both pre-training and fine-tuning, we allow training to continue until the BLEU score has fully converged.

Table 6 :
Tokenized pre-training (PT) and fine-tuning (FT) word piece counts and overlap statistics comparing obfuscated pre-training (upper part) vs. regular pre-training (lower-part) for German-to-English parallel data with various fine-tuning data set sizes.

Table 7 :
Tokenized pre-training (PT) and fine-tuning (FT) word piece counts and overlap statistics comparing obfuscated pre-training (upper part) vs. regular pre-training (lower-part) for additional language directions.