Recipes for Adapting Pre-trained Monolingual and Multilingual Models to Machine Translation

There has been recent success in pre-training on monolingual data and fine-tuning on Machine Translation (MT), but it remains unclear how to best leverage a pre-trained model for a given MT task. This paper investigates the benefits and drawbacks of freezing parameters, and adding new ones, when fine-tuning a pre-trained model on MT. We focus on 1) Fine-tuning a model trained only on English monolingual data, BART. 2) Fine-tuning a model trained on monolingual data from 25 languages, mBART. For BART we get the best performance by freezing most of the model parameters, and adding extra positional embeddings. For mBART we match or outperform the performance of naive fine-tuning for most language pairs with the encoder, and most of the decoder, frozen. The encoder-decoder attention parameters are most important to fine-tune. When constraining ourselves to an out-of-domain training set for Vietnamese to English we see the largest improvements over the fine-tuning baseline.


Introduction
Machine Translation (MT) has recently seen significant advances, with improvements in modeling, especially since the advent of neural models (Sutskever et al., 2014;Bahdanau et al., 2015), and the availability of large parallel corpora for training such systems (Smith et al., 2013;Kocmi and Bojar, 2017;Tiedemann, 2012). However, often standard neural systems do not perform well on low-resource language pairs (Koehn and Knowles, 2017), especially when the language pairs are only distantly related. Since these languages are spoken by a large fraction of the world's population, reducing the gap in performance between high and low-resource MT could have a large impact.
An explosion of interest in large-scale pretraining in Natural Language Processing has led BART Encoder (pre-trained) New Encoder (randomly initialized) BART Decoder (pre-trained)

Adapter Adapter
Adapter Adapter Adapters (randomly initialized) Figure 1: Schematic diagram showing the components of our system for adapting BART to MT. We learn a new encoder that takes as input the source language, with a potentially different vocabulary to the original BART system. We freeze most BART parameters (frozen model components are shown in blue).
to increased performance on smaller datasets, by simple fine-tuning of large pre-trained models on downstream tasks. The typical approach is to train a large model on text from the web (for example English Wikipedia), with a common objective predicting masked out tokens using the unmasked context. For Natural Language Generation (for example summarization of text), performance can be improved by pre-training a sequence-to-sequence model (Song et al., 2019;. However previous work has shown that on NLP tasks such as Natural Language Inference, the relative performance of fine-tuning vs. keeping the pretrained model frozen depends on the similarity of the pre-training and downstream tasks (Peters et al., 2019). We observe empirically that simple finetuning of a monolingual model for MT can result in worse performance than training from scratch (e.g. Table 1). For MT the more common mono- lingual (usually only English) pre-training (Peters et al., 2018;Radford et al., 2018;Devlin et al., 2019;Yang et al., 2019b; may be inadequate since the input or output domain for the downstream task will be a non-English language. Multilingual pre-training offers a solution, by modifying the pre-training objective to include many languages. Using a multilingual pre-trained model for MT gives good performance, especially on lower-resource language directions . However it is challenging to balance the training data so that higher-resource languages do not overwhelm lower-resource ones (Arivazhagan et al., 2019;Conneau et al., 2019). For a particular language it may be hard to source monolingual data, or it may be simply not included in training.
We also consider multilingual MT (training on many language pairs and sharing all or most model parameters) as a downstream task. Sharing 'knowledge' across language directions can improve performance on low-resource language pairs by transfer from other pairs included in training. Previous work observed problems of performance degradation, often on high-resource languages, due to interference and constrained capacity (Johnson et al., 2017;. And when initialising from a pre-trained model, we want to avoid 'catastrophic forgetting', where by fine-tuning on a particular language pair we lose the knowledge about another language pair that is stored in the model weights. Previous work has explored how to improve on simple fine-tuning, by freezing pre-trained model parameters (Peters et al., 2019;Houlsby et al., 2019) and using lightweight 'adapter modules' (Houlsby et al., 2019;Stickland and Murray, 2019) which are inserted between the layers of the pretrained network. We aim to explore and improve on these approaches for both bilingual and multilingual MT (in contrast to previous work largely focusing on text classification). We explore freezing different subsections of the pre-trained model.We expect freezing to be particularly useful when the parallel data is of low quality, in which case naive fine-tuning may, for example, over-specify the pretrained model to a particular domain.
Our main contributions are: • A novel fine-tuning approach, similiar to  but with adapter modules in the encoder of the pre-trained sequence-tosequence model and combining both learnable, and fixed sinusoidal, positional embeddings in the input module (see sections 3.1 and 3.2) that feeds into the pre-trained encoder.
• Extensive experiments with fine-tuning a multilingual pre-trained model for MT, showing the benefits and drawbacks of freezing various parameters. We find we should freeze the decoder but unfreeze the encoder-decoder attention when fine-tuning on Xx → En data, and in the other direction we should freeze the encoder but unfreeze the entire decoder (section 5.3). We find monolingual models benefit more from freezing parameters than multilingual models (section 5.2).
• Results on fine-tuning a multilingual pretrained model for multilingual MT showing that freezing parameters improves performance on some, mostly distantly related, language directions (section 5.5).

Background and Related Work BART and mBART
We briefly describe the pre-trained models we focus on in this work. In order to perform machine translation with the minimum of modifications to the pre-trained model, we prefer models that can perform conditional sequence generation. We concentrate on the BART (Bidirectional and Auto-Regressive Transformer) model  and the multilingual BART (mBART;  model. BART and mBART are sequence-to-sequence models with the standard transformer-based neural machine translation architecture, i.e. an encoder and autoregressive decoder. The pre-training task they are trained on is reconstructing a document from a noisy version of that document (so called 'denoising autoencoder'). Examples of noise added to the training data include randomly shuffling the order of the original sentences, randomly changing the start position of the document, and using a masking scheme where arbitrary length spans of text are replaced with a single mask token. BART and mBART are trained entirely on monolingual data from the web, with English data for BART and data from 25 different languages for mBART.
BART and mBART have almost identical architectures, with 12 encoder layers and 12 decoder layers with model dimension of 1024 and 16 attention heads. BART has a vocabulary of approximately 40k and ∼ 406M parameters, whereas mBART has a larger vocabulary of size 250k and ∼ 610M parameters.

Pre-trained Models for MT
There has been much recent progress in pre-training for NLP applications (Peters et al., 2018;Radford et al., 2018;Devlin et al., 2019;Yang et al., 2019b;, with the most relevant for our work focusing on text generation (Radford et al., 2019;Song et al., 2019;Dong et al., 2019;Raffel et al., 2019; Specifically for MT, Ramachandran et al. (2017) proposed pre-training the encoder-decoder modules as two separate language models, and Yang et al. (2019a); Zhu et al. (2020) explored approaches incorporating BERT model weights into the usual seq-to-seq architecture.
Multilingual MT Multilingual translation (Firat et al., 2016;Viégas et al., 2016;Aharoni et al., 2019;Arivazhagan et al., 2019) aims to jointly train one translation model that translates multiple language directions, and shares representations to improve the translation performance on low-resource languages (Gu et al., 2018). Our freezing approach is similar in spirit to Sachan and Neubig (2018) who investigate which parameters are most useful to share for multilingual MT with transformer models. We start from a multilingual pre-trained model, and decide between sharing or freezing parameters.
Transfer Learning for MT Transfer learning hopes to leverage a related task to perform well on a target task, for example by initialising the model weights from those resulting from training on a related task. For MT various approaches have been explored, with a common method training on high-resource language(s) and fine-tuning on a low-resource language (Neubig and Hu, 2018).
Closely related to our work is that of , who introduce freezing and adapters (extra parameters inserted within the transformer) for domain adaption in MT. They take an MT model trained on a large parallel corpus, and finetune in a different domain (e.g. legal text). We differ in that we start from a pre-trained model that has not been trained on parallel text, and study adapting it to MT. Approaches based on freezing various model components have also been proposed (Thompson et al., 2018;Zoph et al., 2016), but have focused on RNN models pre-trained with parallel data, not transformer models pre-trained on monolingual data.

Methods
Because BART has been trained on only English input, we need to use different techniques when fine-tuning BART and mBART for MT, with a schematic overview shown in Figure 1 and Figure 2. BART and mBART are standard sequenceto-sequence models, where an encoder consumes a sequence of source-side tokens, and a decoder acts as a conditional language model, generating target tokens given a source sequence. Intuitively, we want the encoder and decoder to be performing roughly the same tasks during fine-tuning as they were during pre-training. For BART this means the input to the encoder should be similar to (embedding vectors of) noisy English text. Therefore when training on say, Vietnamese to English, we first transform the Vietnamese source sentence into a representation useful for BART. We introduce new parameters (the 'Input Module') that consume the source sentence and produce hidden vectors we can feed into the BART encoder. We describe the Input Module architecture in section 3.1.
mBART can be fine-tuned without modification since during pre-training it saw the languages it will be fine-tuned on. To increase flexibility when freezing parts of the network, we optionally add extra parameters to both BART and mBART, described in section 3.3.

Input Module Architecture
We refer to the network that takes in the source language text and outputs hidden vectors useful for BART as an 'Input Module' or IM(·). To improve performance on low-resource MT, we use smaller token embedding vectors on the source side of size d s = 512, whereas BART uses hidden vectors of size d BART = 1024. The full network is as follows, with {e t } l t=0 token embeddings for a source sentence with l tokens, where BART(·) is the full BART encoder-decoder model. Where we would normally input token embeddings to the BART model we use the outputs of the Input Module. The t-th element of IM({e t } l t=0 ) as follows: and where LN(·) is layer-norm, W is a matrix projecting up from d s to d BART , and Transformer(·) is the application of a series of Transformer layers. α is a scalar, in our case equal to √ d BART , which is required to insure the input to BART is on the same scale as the embedding vectors BART was trained on. If we remove LN(·), W and α, and set d s = d BART , we recover the method introduced by Lewis et al. (2019) for fine-tuning BART on MT.

Extra Positional Embeddings
We found empirically that the details of positional embedding vectors are important for good performance (see Table 1), perhaps because of the need for the BART model to deal with different word order to that it was trained on. Transformer models normally have either learnable positional embedding vectors, or fixed sinusoidal positional embedding (Vaswani et al., 2017) vectors p t , with p t i = sin(t/10000 i/(ds/2−1) ), if 0 ≤ i < d s /2, and p t i = cos(t/10000 (i−(ds/2−1))/(ds/2−1) ) if d s /2 ≤ i < d s , where t indexes position and i indexes dimension.
Note that positional embedding are typically only added to the token embeddings. We use learnable positional embeddings at the embedding layer. But to get extra positional information, we optionally add fixed sinusoidal positional embedding to the input of each transformer layer in IM(·), i.e. the input to layer i, the previous layer output. This means the network has access to both learned positional embeddings (only at the embedding layer), and fixed sinusoidal ones at the input to each layer.

Within-Network Adapter Architecture
When freezing parts of a pre-trained model (either BART or mBART in our case), we may want to add flexibility by modifying the pre-trained model architecture. One approach is to use 'adapters', introduced by Houlsby et al. (2019); Stickland and Murray (2019) which are newly-initialised neural network layers that can be 'slotted in' to the layers of the pre-trained model.
We only considered simple adapter architectures, essentially feed-forward networks, with one hidden layer, and a residual connection to the output. The dimension of the hidden layer can be much smaller than the model dimension to reduce computational cost and parameter count. We use one adapter per transformer layer, inserting them at the end of the layer (Stickland and Murray, 2019;. We use the following architectures, with h the hidden state of a particular token after the usual transformer layer, and h out the hidden state of the token after the adapter layer: The tanh non-linearity helped with stability in early experiments, probably because it prevents the adapter output exploding by constraining it between -1 and 1. We also considered a version of the adapter based on the 'gated linear unit' (GLU; Dauphin et al., 2016) architecture: We found the network was sensitive to changes in the magnitude of the hidden states the adapter produced, and therefore multiply the sigmoid gate by 2 so that it approximately leaves the magnitude of the hidden states unchanged.

Freezing Details
BART We freeze all parameters of BART except the weights and biases of the layer-norm modules (following Houlsby et al. (2019)), and additionally unfreeze the self-attention module of the first layer in the BART encoder, which is a small fraction of total BART parameters (24 · 2d BART from layer-norm parameters and 4d 2 BART from the selfattention module). We freeze BART token embeddings (used in the softmax layer).
mBART In most of our experiments we unfreeze layer-norm parameters, positional and token embeddings, and either the entire encoder or decoder   module (or the encoder and subsections of the decoder). We unfreeze the self-attention module of the first layer in the mBART encoder and decoder.

Experimental Settings
We use the fairseq  library for all experiments. The final models are selected based on validation likelihood, except for multilingual fine-tuning where we evaluate the models after 10000 training steps. We use beam-search with beam size 5 for decoding, and evaluate all BLEU scores using SacreBLEU (Post, 2018) 1 . We use ISO 693-2 language codes in this work for convenience, and use the same parallel data as , both listed in listed in Appendix.
We fine-tune frozen BART and an Input Module on bilingual parallel text, feeding the source language into the Input Module. For mBART we feed the source language into the encoder, and use the same hyper-parameters as . When using adapters we use 0.1 dropout in the adapter bottleneck layer (z in section 3.3), and a hidden dimension of either 128, or 2/3 · 128 when using a gated linear unit adapter. We use the Adam (Kingma and Ba, 2015) optimizer. Hyperparameters are listed in Appendix B, and we use the same hyper-parameter search space for frozen and non-frozen models.

Multilingual MT
We train with a very large effective batch size, training on 32 GPUs with a per-GPU batch size of 4096 tokens, meaning our total batch size is N · 32 · 4096 tokens, where N is the number of language pairs. We evaluate our model after 10000 training steps (amounting to N · 10000 forwardsbackwards passes through the model).

Vocabulary
BART uses the GPT-2 tokenizer, which uses the BPE (Sennrich et al., 2016) approach (on the level of bytes, not characters). BART could technically take any Unicode string as input, however the BPE is learned on English text. When fine-tuning BART on machine translation we therefore learn a Languages Vi-En † Vi-En It-En My-En Ne-En Si-En Cs-En Es-En Pars (m) Size 110k 133k 250k 259k 564k 647k 11M 15M (   Table 2 or Table 1. 'Test (random init)' refers to training models (of various sizes) from scratch on the bitext for that language pair. 'Pars (m)' refers to the number of tunable parameters for each method in millions (note token embeddings are tuned in every method and account for 256m parameters). Bold indicates the best test set score and all scores whose difference from the best is not statistically significant (with p-value less than 0.05). (Statistical significance is computed via bootstrapping (Koehn, 2004).)

Vi-En It-En My-En Ne-En Si-En
Freeze decoder (don't ft layer-norm) 29.  new subword vocabulary (using the sentencepiece (Kudo and Richardson, 2018) library) on the source data from the fine-tuning dataset, and use a smaller vocabulary size of 5000, which empirically performs better for low-resource MT (Guzmán et al., 2019;Sennrich and Zhang, 2019). We don't change the mBART tokenizer or vocabulary. Adding extra flexibility with within-network adapters helps performance, especially when added to the BART encoder. It is important to use learned positional embeddings at the embedding layer in the Input Module, with an 10.1 BLEU score drop if we use fixed positional embeddings (at the embedding layer). We see consistent gains in Table 1  and Table 2 by adding additional, fixed sinusoidal positional embeddings to the input of every transformer layer of the Input Module (see section 3.2), even when using an unfrozen BART. The BART encoder 'expects' English input, and it may be the Input Module with extra fixed embeddings can better account for the different word order in the input language. In the next section we compare to mBART and baselines.

Frozen mBART
In Table 3 and Table 5 we list results from freezing various parts of mBART. We get better performance than fine-tuning ('ft all' in Table 3) with our freeze decoder + fine-tune encoder-decoder attention method ('ft enc-attn' in Table 3 the baseline otherwise. We believe a benefit to freezing, when finetuning on training data from a different domain to test data, will be avoiding specialising the pretrained model to the fine-tuning train data domain. To test this we constructed a new Vi-En parallel dataset (Vi-En † in Table 3) using the some of the same sources as the Flores (Guzmán et al., 2019) training data (the Si-En and Ne-En training sets used in this work), specifically GNOME/KDE/Ubuntu domain from the OPUS repository 2 and Bible translations from the biblecorpus 3 , and use the same test and validation sets as the IWSLT15 Vi-En dataset. By constraining ourselves to this out-of-domain training set we see the largest gains out of the language pairs we considered over the fine-tuning baseline (0.9 BLEU).
We also consider the effect of the size of the finetuning dataset. If we constrain the training data to a random subset of 200k training examples from Ro-En (Table 6), the 'ft enc-attn' method outperforms simple fine-tuning. This effect generalises to an mBART variant that was pre-trained on only Ro and En monolingual data (using the same data as ). Further results on Ro-En data are available in the Appendix, Table 10, and show  similar trends to Table 3, with fine-tuning encoderdecoder attention the most important. Table 3 shows the relative performance of frozen BART, frozen mBART and baselines. Fine-tuning mBART gave consistently better results than frozen BART especially for distantly related languages. For Si, Ne and My the performance of frozen BART is roughly on par with a randomly initialised model (or much worse in the case of Ne-En). The parallel data for these languages is often lower quality, and the BART system has to learn about the 2 http://opus.nlpl.eu/ 3 https://github.com/christos-c/bible-corpus/ non-English language from noisy or out-of-domain text (e.g. text from the Ubuntu manual for the En-Ne pair). For Vi and It, we have high quality parallel data, and the frozen BART method is only approximately 1.5 BLEU points behind the best mBART results. We note mBART was trained on more English data than BART, and with different noising function hyper-parameters.

What Should be Unfrozen?
Layer-Norm We find large benefits to simply fine-tuning the weights and biases of the pre-trained layer-norm weights (recall that after normalisation, the layer-norm module multiplies each hidden dimension by a weight and adds a bias); this was observed in the setting of BERT by Houlsby et al. (2019). This gains e.g. 0.5 BLEU for frozen BART (see Table 1) and an average of 0.8 BLEU across five languages for mBART (see Table 4 compared  to Table 3). Since these weights and biases are only 2d parameters per layer-norm, where d is the model dimension. This is parameter-efficient, with adding more parameters with 'Adapters' on top of unfrozen layer-norm providing a smaller improvement.
Encoder vs Decoder For the Xx → En direction (Table 3) we can see that freezing the decoder always performs better than freezing the encoder (except for It-En where they perform roughly the same.) For the En → Xx direction (Table 5) we see slightly weaker evidence for the opposite trend, with the decoder more useful to fine-tune; but for the high resource languages Es and Cs freezing the decoder works better. There is more English data in mBART pre-training than data in other languages, which may account for better results with a frozen encoder (when English is the source language) or decoder (when English is the target language). Adding flexibility with adapters in the frozen layers   Table 7: Test set BLEU score on many-to-one (Xx → En) multilingual MT with a simple round-robin training schedule. 'Ft enc-attn' refers to fine-tuning the encoder, and fine-tuning the encoder-decoder attention module in every decoder layer, leaving the other decoder sub-modules frozen. The 'Ft enc-attn' model setting uses adapter modules in the decoder to increase flexibility after freezing parameters. Bold indicates the best score and all scores whose difference from the best is not statistically significant (with p-value less than 0.05). For clarity we underline language pairs where the 'Ft enc-attn' method matches or outperforms naive fine-tuning.
improves performance in all languages and directions, except for Ne→En. We explore more fine-grained unfreezing for the Xx → En direction (Table 3). We fine-tuned three equally sized subsets of the decoder: the encoderdecoder attention layers (approx. 12 · 4d 2 BART parameters), the self-attention layers in the decoder (approx. 12 · 4d 2 BART parameters), or the entire last three layers of the decoder (approx. 3·16d 2 BART parameters). We observe that fine-tuning the encoderdecoder attention performed well (note the last three layers include three encoder-decoder attention layers), with fine-tuning self-attention the least useful. We hypothesize that the pre-training task of mBART (reconstructing noisy monolingual sentences) does not help with teaching the encoderdecoder attention to align source and target text of different languages.

Memory Cost
Freezing parameters means we no longer need to allocate memory to storing their gradients. We will obtain additional memory savings when using an optimizer that stores various other quantities (i.e. the Adam optimizer stores running averages  of the first and second moments of gradients.). The memory savings allow for roughly 45-75% larger batches for the methods we consider in this work (see Table 8 for our mBART methods), but for larger pre-trained models the proportion of GPU memory freed up by freezing will increase. At inference time we no longer require gradients and we have the same memory cost.

Multilingual Fine-tuning of mBART
We explore freezing parts of the mBART model when fine-tuning on a challenging multilingual MT task. Table 7 lists results from a naive fine-tuning baseline, and results from freezing most of the decoder but unfreezing the encoder-decoder attention (when freezing we use GLU adapters in the decoder, see section 3.3). Freezing parameters hurts performance on some language pairs, and since freezing removes flexibility from the model and we have to adapt to 25 different directions this is perhaps not surprising. The language pairs where we match or improve on the baseline are Zh, Es, Fi, Ne, Ja, Vi and Kk. These are mostly (five out of seven) non-European languages, and distantly related to En. However since most of these results are not statistically significant further study is needed to verify this. Note we see a clear benefit over bilingual fine-tuning for some language pairs (e.g. compare our best Ne result from Table 3, 14.6 BLEU vs. 20.8 BLEU for multilingual finetuning). We leave to future work a more thorough investigation of the multilingual MT setting.

Conclusion
We recommend: For a language with high quality parallel data but without a pre-trained model trained on monolingual data from that language, using a frozen (English-only) BART model with additional parameters at the source side (the 'input module') improves performance over a randomly initialised baseline. For this approach it is important to freeze the pre-trained model. We also give the model both learned positional embeddings at the embedding layer, and fixed sinusoidal positional embeddings at each layer of the input module. For a multilingual pre-trained model, we found performance improvements on some (mostly distantly related) languages for multilingual manyto-one fine-tuning. For bilingual En → Xx finetuning we did not see any improvement, although the performance drops are small, and by freezing parameters we need less memory at training time compared to fine-tuning. For Xx → En bilingual fine-tuning it is important to unfreeze the encoderdecoder attention, and keep the rest of the decoder frozen. This can improve on simple fine-tuning, especially for distantly-related language pairs or those with out-of-domain training data.
We recommend fine-tuning layer-norm parameters as a parameter-efficient complement to adapter layers. For our mBART experiments we found it was necessary to fine-tune the token embeddings, which correspond to a large number of parameters, and future work could remove this cost by working out a subset of the vocabulary to fine-tune, or another method.

A Additional Ablation Study
In Table 9 we reproduce Table 4 of the main paper with more context to study the effect of unfreezing layer-norm parameters when fine-tuning mBART. Across all language pairs we see improvements from fine-tuning layer norm parameters over not fine-tuning them, and additional, smaller, improvements from adding adapters, indicating both forms of adding flexibility are useful. In Table 10 we present additional results on the Ro-En pre-trained model (see section 3.2 of the main body).

B Fine-tuning Hyper-parameters
For all experiments with bilingual datasets we use a batch size of 2048×16 tokens, i.e. 2048 tokens per GPU and 16 GPUs (we investigate larger batch sizes for frozen models only to test GPU memory usage, and do not evaluate models trained with larger batch sizes). Ranking of hyper-parameters was done by validation set BLEU score.
Frozen BART We train with 0.3 dropout for the frozen BART parameters, and 0.2 dropout for the Input Module parameters, 0.1 label smoothing, 0.2 dropout for the self-attention scores in the Input Module, 5000 warm-up steps, and 7e−4 maximum learning rate. We performed a grid search over learning rates in {7e−4, 5e−4, 3e−4}, dropout for Input Module parameters in {0.2, 0.1}, and dropout for self-attention scores in {0.2, 0.1}. We train for a maximum of 50K training updates for all low and medium resource pairs and 100K for high resource pairs (which takes roughly 8 hours and 16 hours respectively).
Frozen mBART We train with 0.3 dropout, 0.2 label smoothing, 2500 warm-up steps, and 3e−5 maximum learning rate. We did not search over hyper-parameters, simply re-using those of . Despite the adapter parameters being randomly initialised, the small learning rate did not affect performance (we performed a small sweep of larger learning rates and found only marginal gains, and so kept the same settings for simplicity). We use a maximum of 40K training updates for all low and medium resource pairs and 100K for high resource pairs (Es and Cs in our case), this takes roughly 12 hours and 30 hours respectively.
Multi-lingual MT We train with 0.3 dropout, 0.1 dropout for self-attention scores, 4000 warm-up steps, and 1e−4 maximum learning rate.
Out-of-domain Vi-En Baseline To train a randomly initialised baseline for the out-of-domain Vi-En data (Vi-En † in Table 3 of the main body) we used the same model architecture and training settings as those of Guzmán et al. (2019) use for training MT systems on similar data (but with Si or Ne source language). Specifically a seq2seq transformer with 5 encoder and decoder layers, hidden dimension 512. shared embeddings between the input and softmax layers, and strong regularisation (e.g. 0.4 dropout on hidden states, 0.2 dropout on attention scores, 0.2 label smoothing). We learn a BPE vocabulary (joint across source and target data) of size 5000 on the training data. For full details of hyper-parameters we refer the reader to Guzmán et al. (2019) and the associated GitHub repository 4 .

C Pre-training Languages
We reproduce in Table 11 the details from  of the size of each pre-training language corpus for mBART.  Table 9: Validation BLEU score (unless stated otherwise) obtained by fine-tuning layer-norm parameters and of adding adapters for mBART, for Xx → En. 'ft' refers to fine-tuning, i.e. unfreezing. Note we are simply reproducing rows from Table 3 and