When does Parameter-Efficient Transfer Learning Work for Machine Translation?

Parameter-efficient fine-tuning methods (PEFTs) offer the promise of adapting large pre-trained models while only tuning a small number of parameters. They have been shown to be competitive with full model fine-tuning for many downstream tasks. However, prior work indicates that PEFTs may not work as well for machine translation (MT), and there is no comprehensive study showing when PEFTs work for MT. We conduct a comprehensive empirical study of PEFTs for MT, considering (1) various parameter budgets, (2) a diverse set of language-pairs, and (3) different pre-trained models. We find that ‘adapters’, in which small feed-forward networks are added after every layer, are indeed on par with full model fine-tuning when the parameter budget corresponds to 10% of total model parameters. Nevertheless, as the number of tuned parameters decreases, the performance of PEFTs decreases. The magnitude of this decrease depends on the language pair, with PEFTs particularly struggling for distantly related language-pairs. We find that using PEFTs with a larger pre-trained model outperforms full fine-tuning with a smaller model, and for smaller training data sizes, PEFTs outperform full fine-tuning for the same pre-trained model.


Introduction
There has been enormous progress on scaling up neural machine translation (NMT) in the recent years, resulting in 'massively multilingual' models that are capable of translating across many languages (Bapna et al., 2022). Most successful applications rely on sequence-to-sequence pretraining that (1) leverages web-scale monolingual * Both authors contributed the paper equally and the order is determined by coin flip. 1 Our code and scripts for reproducing the experiments are available at https://github.com/ahmetustun/ fairseq data with a masking objective to build a multilingual backbone (parent) model (Liu et al., 2020;Song et al., 2019), or (2) directly targets a manyto-many NMT system by mining parallel corpora (Fan et al., 2020).
Standard practice is to fine-tune every parameter of a particular a pre-trained model to specialize it to a language pair (or domain) of interest (Zoph et al., 2016;Neubig and Hu, 2018). However, if we require specialization to many language pairs or domains, the storage and time costs of full fine-tuning may become prohibitive. Moreover, as models grow ever larger, more efficient methods become attractive.
As an alternative to full model fine-tuning, several parameter-efficient fine-tuning methods (PEFTs) have been proposed. Such methods only fine-tune a small number of parameters, reducing storage cost, and avoid calculating the gradients for every model parameter, reducing training time and memory cost. Examples include adapters (Houlsby et al., 2019;Bapna andFirat, 2019) andprefixtuning (Li andLiang, 2021), which introduce a few extra parameters to fine-tune, keeping the pretrained model fixed. Others like BitFit (Zaken et al., 2021) tune only the bias vectors of the backbone model and similarly Gheini et al. (2021) update only cross-attention layers.
PEFTs can produce results that are competitive with full fine-tuning. For instance, adapters can match full fine-tuning performance on the GLUE benchmark using only 2-4% additional parameters (Houlsby et al., 2019). However their potential for MT has not been fully explored. Prior studies indicate that PEFTs designed for classification tasks can fail for MT (Stickland et al., 2021a), and it is not known how source and target language characteristics affect PEFTs' performance.
In this work, we provide a comprehensive analysis of PEFTs for MT. For our analysis, we consider: (1) different pre-trained models with varying size from 484 million to 1.2 billion total parameters, (2) several PEFTs, and (3) typographically and geographically diverse languages. Moreover, we vary the number of tuned parameters, resulting in different parameter 'budgets' for each fine-tuning experiment, ranging from 0.03% to 10% of the total number of model parameters. Our main research questions are: RQ1: For a given parameter budget, which PEFT works best?
RQ2: How does language similarity affect the performance of PEFTs for different parameter 'budgets'?
RQ3: How does (i) the pre-training objective, and (ii) the size of the parent model affect the performance of PEFTs?
RQ4: Do PEFTs work better than fine-tuning for small dataset sizes?
Key Findings 1) We found methods which introduce new parameters to a pre-trained model, namely adapters and prefix tuning, give us the best performance ( § 5.1). Of these, adapters retain good performance as the number of new parameters is increased, while prefix-tuning falls behind. 2) We found a large variation across language pairs in terms of PEFTs' performance. Specifically, the distance between the source and target languages is negatively correlated with performance, especially for methods tuning the smallest number of parameters and methods tuning a subset of existing parameters (like bias terms or cross attention) ( § 5.2).

3)
We observe that increasing model size, but keeping the same number of fine-tuned parameters, substantially increases MT performance ( § 5.3). Finally, 4) we observe that adapters perform better than full fine-tuning for small datasets, with the advantage for adapters increasing as dataset size gets smaller ( § 5.4).

Background
This section briefly describes the two multilingual pre-trained models that we focus on in this work, namely mBART and M2M-100.
Multilingual Denoising Pre-training Multilingual BART, mBART (Liu et al., 2020), is a sequence-to-sequence transformer model (Vaswani et al., 2017) that consists of an encoder and an autoregressive decoder. It is pre-trained with a denoising objective, reconstructing a document from a noisy version. mBART uses span masking and sentence permutation to noise the original document. It consists of 12 encoder and 12 decoder layers, with hidden dimension of 1024 and 16 attention heads. mBART is trained entirely on monolingual data that includes multiple languages and it has a large multilingual vocabulary of 250k tokens.
In our experiments, we use mBART-50 (Tang et al., 2020) which was pre-trained on 50 languages.
Many-to-Many Multilingual MT The M2M-100 model (Fan et al., 2020) is a many-to-many multilingual translation system that is pre-trained on a large-scale parallel dataset for 100 languages and 100×99 translation directions. This dataset is automatically constructed with a novel data mining method based on language similarities and backtranslation. The model is trained in a many-tomany fashion, balancing languages using sinkhorn temperature sampling. In our experiments, we use the base size M2M-100 with 484M parameters that consists of 12 encoder and 12 decoder layers, hidden dimension of 1024 and feedforward dimension of 4096. To study the effect of model size, we also use the medium size M2M-100 with 1.2B parameters, which has 24 encoder and 24 decoder layers, and feedforward dimension of 8192. Both models have a multilingual vocabulary of 128K unique tokens that are distributed across 100 languages with temperature sampling.

Parameter Efficient Fine-tuning Methods
All of our experiments fall under the umbrella of specialising a pre-trained sequence-to-sequence transformer model for MT of a particular language pair, with source language x and target language y. If the pre-training task was MT, and x and y were included, then a lower bound will be simply applying the pre-trained model without any changes. Conversely an upper bound is fine-tuning 100% of the pre-trained model parameters ('full fine-tuning'). In between full fine-tuning and directly using the pre-trained model, we consider the following parameter-efficient fine-tuning methods (PEFTs) in this work: where an adapter module A at layer consists of a layer-normalization LN of the input h ∈ R d , followed by a down-projection W d ∈ R d×b with bottleneck dimension b, a non-linear function f (·) and an up projection W u ∈ R b×d . Finally, a residual connection with input h is added to the output of the adapter: h → A (h ) + h . We write 'adapterb' to mean adapters with bottleneck dimension b throughout this work.
Prefix-tuning (Li and Liang, 2021) prepends a sequence of continuous task-specific vectors ('prefixes') to the model input, in analogy to natural language prompts (e.g. 'translate this sentence:') which the transformer can attend to, but the prefix consists entirely of free parameters. For each transformer layer, the prefix is replaced with a new set of vectors, increasing the expressiveness of the method. Concretely, we replace token embeddings by with E ∈ R L×d the original token embeddings packed into a matrix, V 0 ∈ R p×d the prefix vectors, and L the original sequence length, p the prefix length and d model dimension. Before transformer layer we additionally set the first p hidden states to a new prefix vector, i.e. H [:p, :] = V with H ∈ R (L+p)×d the hidden states and V ∈ R p×d .
BitFit (Zaken et al., 2021) Bias term fine-tuning was introduced in the context of fine-tuning BERT for classification tasks, and consists of training only the bias terms and the task-specific classification layer. To use this method for MT we additionally fine-tune all decoder bias terms, and do not need the classification head. We introduce a simple improvement to BitFit, based on replacing redundant parameters with ones that increase the expressiveness of the method. Note that BitFit fine-tunes bias parameters in layer-norm (LN) modules (Ba et al., 2016), since the layer-norm contains the following affine transformation: where z is the normalized input after a residual connection. γ, β ∈ R d are learnable weights and the bias parameters of the LN module. For the standard transformer model, the LN module is always followed by a matrix multiplication plus a bias term i.e. W m ·LN aff (z )+b m = W m ·γ z +W m ·β+b m . Notice the same space of functions is available by only updating the b m term in W m · β + b m . We simply switch to updating γ instead of β, i.e. unfreezing the LN weight and freezing the bias, in order to increase expressiveness (confirmed empirically in § 5.1). We use this version of BitFit throughout this work unless stated otherwise.
X-attention Tuning (Gheini et al., 2021) refers to fine-tuning only cross-attention (X-attention) and corresponding layer-norm parameters located in each decoder layer of a transformer model. This method is based on the importance of crossattention for MT.

Experiments
Datasets We conduct experiments with a selection of 12 typologically and geographically diverse languages, paired with English. In our experiments, we fine-tune the pre-trained model on only one language pair and translation direction at a time (e.g. Italian → English). The parallel data for all languages is from TED talks in order to factor out the impact of the domain differences (except Finnish and Estonian which we only use for a separate control experiment). To pick these languages, we consider variation in language families   Figure 1: For increasing parameter budget, does prefix-tuning or adapters work best (RQ1)? We show relative MT performance over full fine-tuning vs. number of fine-tuned parameters for mBART and M2M-100. b and p refer to adapter bottleneck dimension and prefix length respectively. Due to the large effective sequence length, we limit prefix-tuning experiments. and scripts. More details of the datasets are given in Appendix A.
Experimental Settings We used mBART-50 (Liu et al., 2020;Tang et al., 2020) and M2M-100 (Fan et al., 2020) as our multilingual pre-trained models, and all the languages we experiment with are included in their pre-training data. mBART needs to learn machine translation with parallel data, but M2M-100 can also be used without finetuning, since it is initially pre-trained for MT (see § 2). We conduct experiments with both the base and the medium size M2M-100, to measure the impact of parent model size. For all fine-tuning methods, we fine-tuned models with a maximum learning rate of 1e-4 with 2500 warm-up steps for 100K training updates. We picked the best model based on dev set perplexity. We used a maximum batch size of 1024 tokens for mBART and 600 tokens for M2M-100, with a gradient accumulation step (update-frequency) of 2 for both models. All experiments are performed with the fairseq (Ott et al., 2019) library. Additional details including dataset splits are in Appendix A.
We use BLEU scores to estimate MT quality, calculated from Sacrebleu 2 (Post, 2018). To compare fine-tuning methods across different languages, we often report relative performance with respect to full fine-tuning (FT) for each language by calculating the ratio of each method's BLEU score w.r.t. the full FT BLEU score. 3 Table 1 shows the performance of PEFTs in terms of BLEU score for it→en and tr→en. In the table, each block (separated with a dashed line) consists of PEFTs with approximately the same number of updated parameters. Adapters outperform other methods at almost all parameter budgets for both mBART and M2M-100, except the smallest budget where only 120k parameters are updated. In this block, prefix-tuning (prefix-5) performs better than adapters for mBART. However, when the portion of fine-tuned parameters increases, as shown in Figure  1, prefix-tuning quickly falls behind adapters, confirming previous findings (He et al., 2021). Furthermore, in terms of training speed/memory cost, prefix-tuning slows down the training relative to adapters, and imposes a significant memory cost due to a large effective sequence length; see also Appendix B. 5 As for the methods that fine-tune existing parameters, BitFit is better than adapters for mBART and worse for M2M, whereas X-attention performs worse than adapters in all cases. We confirm that 5 Prefix-13 causes a 30% slow-down in training speed relative to adapter-5.  Table 2: Pearson correlation coefficients between relative performance w.r.t. fine-tuning and language distance. Negative correlation means that relative performance tends to decrease as the distance between source and target language increases. Numbers in italics are not statistically significant (p=0.05).

RQ1: Comparing fine-tuning methods
our method of tuning layer norm weights rather than biases improves performance, outperforming adapters for mBART on the language pairs in Table 1. Averaging across 10 language pairs, adapters still outperform BitFit for both parent models, see Figure 4.

RQ2: Impact of language relatedness
In order to evaluate how language similarity between translation pairs affects the performance of different PEFTs, we extend our experiments to 10 languages paired with English (x→en, en→x), representing a diverse set of linguistic typology. Figure 2 shows performance w.r.t. full fine-tuning, for mBART and M2M respectively. 6 We found that similarity between source and target languages impacts the performance of PEFTs, with distantly related languages (e.g. English and Korean) leading to lower performance for the methods with a small number of updated parameters such as BitFit and adapter-5. And so when translating between distantly related languages, we need to tune more parameters to match full fine-tuning and get the most out of the parent model.
More concretely, relative performance w.r.t. full FT is negatively correlated with language distance measured by lang2vec. 7 These correlations are stronger for mBART than M2M. Methods which tune existing parameters (X-attention and BitFit) and M2M with no fine-tuning show higher correlation than adapters with similar parameter budgets; see Table 2. One explanation could be that adding parameters, and therefore increasing model capacity with adapters is beneficial for overcoming the difficulty of translating distant languages.
To investigate whether our findings extend beyond English-centric settings, we designed another set of experiments. We picked 3 languages from MultiParaCrawl, Finnish, Estonian and English, where Finnish and Estonian are from the same language family and typologically similar. We measure translation performance into Finish from Estonian and English, for different fine-tuning methods, and similarly for translation into Estonian. Figure 3 shows results for both mBART and M2M-100.
As shown in the first two plots, when translating into Finnish, Estonian as the source language gives an advantage over English for BitFit and adapter-5 (This advantage is higher in M2M-100 than mBART). Likewise, for translation into Estonian, as the number of trainable parameters decreases, relative MT performance drops less when Finnish is the source language compared to English, for both parent models. Thus, when the source and target languages are typologically similar, PEFTs make better use of the parent model. 6 Due to space limitations, Figure 2 shows results for x→en. en→x results are given in Appendix C 7 lang2vec a python package based on the URIEL typology database (Littell et al., 2017). For language distance, we compute the cosine distance between typological features vectors of languages that consists of syntactic, phonological and inventory features.

RQ3: Impact of parent model
Pre-training Objective Figure 4 shows the overall performances for PETFs aggregated over all languages (x↔en) when the model is initialized with mBART or M2M-100. In general, PEFTs for M2M-100 provides higher relative performance than mBART (Fig. 4). This difference is larger when the number of trainable parameters is small (BitFit and adapter-5). While M2M-100 is pretrained for MT with parallel data, mBART is pretrained with a (monolingual 8 ) denoising objective. Thus, more parameters are required at fine-tuning time to 'learn' the MT task for mBART. Finally, we note mBART results have a higher variance than M2M-100 (see Fig. 4), due to the higher negative correlation with language distance.

Model Size
We investigate how parent model size affects the performance of fine-tuning methods, comparing M2M-100's base model (484M) to its medium model (1.2B).  are also shown, representing lower bounds.
Predictably, the medium model outperforms the base model across all fine-tuning methods. The magnitude of this improvement is larger when translating into English (x→en) vs. x→en, and the increase for small adapters is larger than for other trainable parameters (0.07% of 484M and 0.03% of 1.2B total parameters). methods. When translating into English, small adapters with the medium model outperform full fine-tuning of the base model for most languages despite tuning only 0.03% of its parent model parameters. For en→x, small adapters are still competitive with full fine-tuning of the base model with almost the same average performance. But for distantly related languages to English (Farsi, Korean and Turkish), adapters' (1.2B) performance falls behind full fine-tuning of the base model.
When it is used without any parameter updates ('no FT'), the medium model (while outperforming the base model for no FT) is not competitive with small size adapters for the base model, in either direction (x↔en). Furthermore, relative performance w.r.t. full fine-tuning is still negatively correlated with language distance (see Appendix Table 5). Therefore, even at large scales, parameter efficient fine-tuning is useful, taking MT performance to the upper bound of a smaller model.

RQ4: Impact of fine-tuning dataset size
We noticed that for the datasets with the smallest amount of training data (Vietnamese and Czech), PEFTs outperformed full fine-tuning (see Appendix C). We therefore designed a control experiment to test for the effect of the training data size on PEFT' performance, taking a random subset of sizes 2000, 8000, 32000 and 128000 training examples for Italian to English and Turkish to English. We then evaluated full fine-tuning, large adapters (≈50m parameters) and small adapters (≈300k parameters) on each dataset; see Figure 5.
For all models, at the smallest dataset size, large adapters outperformed full fine-tuning, and for M2M full fine-tuning only catches up at 128k examples. For mBART, small adapters lag far behind, indicating they do not provide enough capacity to 'learn' the MT task. For M2M however, small adapters are on a par with larger ones for small dataset sizes, but fall behind as dataset size increases. Again, we believe this is because more capacity is needed to get the most out of larger datasets.
Chen et al. (2022) explore the effect of finetuning dataset size for RoBERTa fine-tuned on English NLU tasks, finding PEFTs outperform full fine-tuning for dataset size <1000. Interestingly, for mBART, similarly small dataset sizes are required for outperforming full fine-tuning. However, for M2M, we see adapters outperforming up until dataset sizes of ≈ 128k. Perhaps the 'gap' between RoBERTa's masked language model pre-training objective and the fine-tuning objective is similar to the gap between mBART's pre-training objective and MT, whereas since M2M is pre-trained for MT, leaving the base model unchanged is viable up to larger fine-tuning dataset sizes. We leave further exploration of this to future work. Finally, we observe that full fine-tuning always converges in fewer iterations than the adapter methods, in a result similar to that of Chen et al. (2022).

Conclusion
Do PEFTs work for MT? We found that the answer depends on multiple factors: the particular method, the backbone model, the number of tuned parameters and the fine-tuning language pair. Adapter layers usually have the highest performance out of all parameter-efficient fine-tuning methods ( § 5.1), although for the smallest parameter budgets we consider, prefix tuning outperforms adapters for mBART. For large parameter budgets (≈50m parameters) adapters almost recover full fine-tuning performance, and even for lower budgets, if the pre-training task was MT, i.e. M2M-100, adapters can recover >90% of full FT performance. However PEFTs only outperform full FT for smaller dataset sizes ( § 5.4), less than around ≈2k examples for mBART and ≈128k for M2M. Future work could explore in detail how the difference between pre-training objective and fine-tuning task affects this phenomenon.
Using PEFT with a larger model (M2M-100 medium size) can outperform full FT of a smaller model (M2M-100 base size). However when translating in the en→x direction where x is distantly related to English e.g. Korean, full FT is superior ( § 5.3). More generally, distantly related language pairs require more parameters to be tuned to get close to full FT, for all methods ( § 5.2). Although we attempted to cover a diverse set of languages, future work could explore truly low resource languages, and those not included in the pre-training data of our models, where one would expect even larger performance gaps.  (2021) and used maximum of 1e-4 with polynomial learning rate decay, based on their adapter-tuning experiments. We fine-tune models by using 0.3 dropout, 0.2 label smoothing, 2500 warm-up steps for 100K training updates with an early-stopping patience of 10 epochs. We used a maximum batch size of 1024 tokens for mBART and 600 tokens for M2M-100, with a gradient accumulation step (update-frequency) of 2 for both models. For full fine-tuning (and not other methods) with the 1.2 billion size M2M model we use the Adafactor optimizer (Shazeer and Stern, 2018) in order to save memory (and use learning rate 5e-5), and otherwise use the Adam optimizer (Kingma and Ba, 2014). We report the result of a single random seed/training run throughout this work whenever we list BLEU scores. All parameter-efficient fine-tuning methods are implemented on top of the Fairseq framework (Ott et al., 2019). We will share our code and scripts to reproduce all experiments.
Computing Budget and Infrastructure All the experiments are conducted using Tesla V100 GPUs with mixed precision (fp16). Parameters that are fine-tuned for each model are reported in the experiments section ( § 4). Each individual experiment took 3-10 hours on one GPU depending on the fine-tuning method and the language-pair.

B Prefix-tuning Details
There is relationship between memory cost and training time for prefix-tuning: including virtual tokens in a sentence will increase the effective length of that sentence, and we can either impose additional memory cost for the virtual tokens, or we can reduce the total number of 'real' i.e. natural language as opposed to virtual tokens in each batch. With the latter method we avoid a large memory cost, however the time taken to iterate through a given number of training examples will be longer, since the number of real tokens per batch will be decreased, increasing training time. We use the latter (decreased 'real' tokens) method in all experiments.
Finally we note that inference speed will decrease as we increase the number of virtual tokens, since the decoder attention mechanism needs to attend to virtual tokens, i.e. when decoding token n it will attend to n − 1 + p previous tokens for prefix length p. Table 6 shows chrF scores 10 for the experiments comparing different PEFTs on it→en and tr→en (Table 1). These results confirms that the trends discussed in Section 4 are the same regardless of metric used for MT quality.

C Additional Results and Metrics
In Tables 7, 8 and 9, we show BLEU scores for other experiments presented in the paper only in terms of performance relative to full FT. Additionally we show adapter-1024 and X-attention scores for M2M-100; in general adapter-1024 outperforms X-attention, and both methods come close to full FT performance or slightly outperform it. Note that for M2M, for the two smallest dataset sizes (cs and vi) we see adapter-1024 (and adapter-2 for the medium size M2M) outperforming full fine-tuning, similarly to § 5.4.
In Table 8 we show results of a smaller (40m parameters) transformer model trained from scratch on each dataset separately, with an architecture consisting of 6 encoder and decoder layers, hidden dimension of 512 and feed-forward hidden dimension 1024. We train a unique sentencepiece (Kudo and Richardson, 2018) vocabulary for each dataset, shared between source and target language, of size approximately 16k. Training hyper-parameters were the same as our other 10 Table 9: (en, et, fi) results in terms of BLEU for M2M-100 and mBART experiments. Note that BLEU scores are not directly comparable as the datasets are different for each language-pair. For a comparison between fine-tuning methods, we refer to relative performances over full fine-tuning (Fig. 3).