Causes and Cures for Interference in Multilingual Translation

Multilingual machine translation models can benefit from synergy between different language pairs, but also suffer from interference. While there is a growing number of sophisticated methods that aim to eliminate interference, our understanding of interference as a phenomenon is still limited. This work identifies the main factors that contribute to interference in multilingual machine translation. Through systematic experimentation, we find that interference (or synergy) are primarily determined by model size, data size, and the proportion of each language pair within the total dataset. We observe that substantial interference occurs mainly when the model is very small with respect to the available training data, and that using standard transformer configurations with less than one billion parameters largely alleviates interference and promotes synergy. Moreover, we show that tuning the sampling temperature to control the proportion of each language pair in the data is key to balancing the amount of interference between low and high resource language pairs effectively, and can lead to superior performance overall.


Introduction
Multilingual machine translation models can benefit from transfer between different language pairs (synergy), but may also suffer from interference (Ha et al., 2016;Firat et al., 2016;Aharoni et al., 2019;Arivazhagan et al., 2019). While there are methods to reduce interference and achieve better performance (Wang et al., 2020a;Kreutzer et al., 2021;, such approaches are often compute intensive, and do not always work (Xin et al., 2022). In this work, we demonstrate that interference in multilingual translation largely occurs when the model is very small compared to the abundance of training data, and that the simple principled approach of enlarging the model and tuning the data sampling temperature provides a consistent solution to the interference problem that can even promote synergy.
This work methodically deduces the most simple ways of reducing interference in multilingual translation. We begin by inquiring what are the dominant factors that may interfere with learning to translate a particular language pair of focus s → t, in the context of learning a multilingual translation model with many different language pairs. Controlled experiments show that besides model size and number of s → t training examples, the main factor that correlates with the level of interference is the proportion of focus pair examples (s → t) observed out of the total number of examples (all language pairs) seen at each training step on average. Surprisingly, aspects like language similarity or number of translation directions have a much smaller effect.
In model and data scaling experiments, we observe that interference mainly occurs in extreme parameter poverty, when the language pair of focus is data-rich, but has to "share" a crowded parameter space with large quantities of other data. Enlarging the model to standard model sizes in machine translation literature alleviates interference and even facilitates synergy. For context, given a language pair of 15M sentence pairs that accounts for 20% of the total training data (75M), we observe severe levels of interference with 11M-and 44M-parameter transformers, but no interference when scaling the model to 176M parameters (the "big" model of Vaswani et al. (2017)) and significant synergy with 705M parameters. Interestingly, when the model is large enough, we find that increasing the amount of non-focus data to a certain point can further increase synergy.
Finally, given the evidence that data sizes and ratios strongly correlate with interference, we experiment with a natural lever that controls the proportion of each dataset in the overall mix in the simplest way: sampling temperature. Indeed, we find that calibrating the distribution of language pairs via temperature can substantially reduce the amount of interference in both high-and lowresource language pairs. Our results demonstrate the importance of tuning the temperature hyperparameter in multitask training, and suggest that previously reported accounts of severe interference in multilingual translation models might stem from suboptimal hyperparameter configurations.

Measuring Interference
We assume a common multilingual translation setup that involves L language pairs s → t, where the source is always the same language s (English), and the target language t varies (English-to-many), or vice versa (many-to-English). The overall training data is a union of these training subsets, we note their sizes by D s→t . Sampling a training example x follows the distribution: Where T is the temperature hyperparameter (Devlin et al., 2019;Arivazhagan et al., 2019). T = 1 maintains the original data proportions, 0 < T < 1 starves low resource language pairs, and T > 1 increases their representation in the training distribution. We mostly focus on the English-to-many setting in which interference is more apparent. 1 We define interference as a negative interaction between different translation directions in a multilingual translation model. It is measured for a specific translation direction s → t by the relative difference in performance (test-set cross-entropy loss) between a bilingual model trained to translate only from s to t (L bi s→t ) and a multilingual counterpart that is trained to translate other additional directions (L multi s→t ): Negative values of I s→t indicate interference, while positive values indicate synergy.

Experimental Setup
Models We train encoder-decoder Transformer (Vaswani et al., 2017) Table 2 provides additional dataset statistics.
Tokenization We build a shared vocabulary of 64K BPE tokens with sentencepiece (Kudo and Richardson, 2018) using a sampling temperature of 5 to increase the lower resource languages' representation. We use this vocabulary for all our experiments. We also add language ID tokens to our vocabulary, which are prepended to each source and target sequence to indicate the target language (Johnson et al., 2017).
Training We use Fairseq (Ott et al., 2019) to train transformer models with the Adam optimizer (Kingma and Ba, 2015) for up to 100K steps, with a dropout rate of 0.1, inverse square root learning rate schedule up to a maximum of 0.004, 8K warmup steps, and a batch size of 256K tokens. We choose the best checkpoint according to the average validation loss of all language pairs. 4 What Impacts Interference in Multilingual Translation?
We consider 5 factors that may potentially impact the performance of a given language pair s → t in the multilingual translation setting:  (1) Model size (2) Training data size of s → t, D s→t (3) Proportion of s → t examples observed during training P (x ∈ s → t) (4) Total number of languages L (5) Similarity between s → t and other pairs 4 In the experiments we describe next, we provide empirical evidence that indicate the last two factors do not actually have a significant effect on the level of interference, and can therefore be pruned away. Subsequent experiments reveal that interference is indeed a function of model size, data size, and data proportion. Most striking is the fact that, across various data settings, enlarging the model to standard sizes consistently alleviates interference and may even promote synergy.

Does Language Similarity Matter?
Intuitively, data from languages that humans perceive as similar (e.g. languages that have some degree of mutual intelligibility, exhibit similar linguistic properties, or have shared vocabularies) should have a more positive effect on translation quality comparing to data from distinct languages (Lin et al., 2019;Wang et al., 2020b). To test this, we fix a focus language, and train trilingual models to translate from English to two languages, the focus language and an additional interfering language. We then look at interference trends as we vary the   Table 3 provides an overview of the language similarity experiments.
Results Figure 1a shows the interference rate for every model size when Spanish has only 118K parallel examples (left) and when using the full English-Spanish dataset (right). The variance in results somewhat correlates with language similarity when the dataset is very small, which aligns with previous work (Lin et al., 2019); French seems to help Spanish more than other languages when the model is big enough, while Chinese helps less. However, when training with the full dataset, the differences between other languages diminish for all model sizes. Concurrently  We observe similar trends when Estonian is the focus language. Figure 1b shows that when Estonian only has 118K training examples, combining with Finnish data seems to have some positive effect. However, this effect also shrinks when using all of the English-Estonian train set (only 2.2M examples, compared to the 15.2M of English-Spanish) and a model that is not too small. 8

Does the Number of Languages Matter?
Do we get more interference when training with one interfering language pair or fourteen? We train models with varying numbers of language pairs while controlling for the overall number of interfering examples. We find that splitting the interfering data across more language pairs has a mild positive effect, which diminishes as the amount of focuslanguage data and/or model parameters scales up.   up to a fixed 15.2M examples budget, distributed as evenly as possible among the different languages. 9 We repeat these experiments when Estonian is the focus language and the interfering example budget is 6.6M. Table 4 provides an overview of these experiments.
Results Figure 2a shows that more than one interfering language pair somewhat helps when English-Spanish has few training examples, but this effect largely disappears in the full training set and with larger models. We see similar trends for Estonian in Figure 2b, even though its full training set has only 2.2M examples. This phenomenon might be related to the fact that when the data distribution is sharp (i.e. one high resource paired with one very low resource) there is not enough incentive for the model to pay attention to the focus language's identifier token, compared to when the distribution is much more uniform. This result also corroborates similar findings for pretrained multilingual models (Conneau et al., 2020), although those experiments did not control the total quantity of data as in ours. 10

The Impact of Model and Data Size
Seeing that language similarity and the number of interfering languages have only a limited effect on interference, we design a controlled setup to measure interference as a function of the remaining three factors: model size, focus language data size, and its proportion in the total amount of data seen during training.
Setup We train models using all the available 15.2M English-Spanish examples, with an increasing example budget for interfering language pairs, ranging from 1/8 (1.9M) to 8 times (122M) the English-Spanish data, divided as evenly as possible between French, Czech, Russian, and Chinese. 11 To observe trends across D s→t sizes, we 10 See Figure 6 in Appendix A for the results of these experiments with absolute BLEU scores. 11   rerun these experiments with a quarter (3.8M) of the English-Spanish data, while keeping the ratios with the rest of the data similar. Finally, we also conduct these experiments in the many-to-English setting.
Results Figures 3a and 3b show the interference and synergy for English-Spanish using a varying number of interfering examples. For smaller models (XS and S), increasing the amount of interfering data (i.e. decreasing the proportion of focus data) exacerbates interference. However, larger models appear to benefit from significant quantities of interfering examples; for instance, when training with D s→t = 3.8M, a large model (L) can gain over 10% relative loss improvement when there is 32 times more interfering data than focus data (P (x ∈ s → t) ≈ 3%). Interestingly, we also observe that interference is sensitive to the ratio between model parameters and focus data, as the and sample the remainder of the example budget from the three from French, Czech, and Russian. , but also introduces substantial gains from synergy. Our results align with trends observed on cross lingual transfer when scaling pretrained multilingual models to 3.5 and 10 billion parameters (Goyal et al., 2021).

Tuning Interference with Temperature
In the previous sections we demonstrated that the dominant factors impacting interference are the model size, the amount of focus language pair data D s→t , and the proportion of focus pair examples observed during training P (x ∈ s → t). In a practical situation where both model size and multilingual data are fixed, how can one control the level of interference? Recalling Equation 1, we observe that the proportion of focus pair examples P (x ∈ s → t) is controlled via the temperature hyperparameter T . Although previous literature has largely used a value of T = 5 following Arivazhagan et al. (2019), our systematic experiments with different temperatures across three different data distributions and four model sizes suggest that this value can be sub-optimal and induce a substantial amount of interference, especially for model sizes that alleviate significant amounts of interference (M and L). Conversely, tuning the temperature shows that lower values (T = 1, 2) are typically able to reduce high-resource interference without harming low-resource synergy in our standard multilingual translation setting.
Setup We train models of four sizes with temperature ranging from 1 to 5 on three training distributions: (1) all available training data, (2) discarding 3 high resource languages (Czech, French and Russian), (3) discarding 4 low resource languages (Latvian, Lithuanian, Romanian and Hindi). When illustrating the results, we assign languages to high and low resource according to whether their relative data proportion decreases or increases when going from T = 1 to T = 2.
Results Figure 4 shows the trade-offs between the lower and higher resource languages, as defined above. First, we can see a clear trade-off for the smaller models (XS and S) from T = 1 to T = 4 in most cases. Increasing T helps promote synergy for low resource languages at the cost of increasing interference for the high resource languages. However, the larger models (M and L) clearly degrade when using T ≥ 3; in fact, values of T = 1 and T = 2 are often better for high-and low-resource language pairs than the commonlyused T = 5. These results align with recent work Xin et al. (2022) showing that tuned scalarization is key to achieving strong bilingual baselines that often outperform more complicated multitask optimization methods. 12

Related Work
Scaling Laws in Machine Translation Previous work also looked at scaling trends of data and 12 See Table 5 in Appendix A for the results of these experiments with absolute BLEU scores. 40% 30% 20% 10% 0% 10% 20% 30% 40% Average interference in low resource: es,fi, de,et,lv,lt,ro,hi,kk,tr,gu  models sizes for machine translation. Gordon et al. (2021) proposed scaling laws in the data and model parameters and demonstrated their ability to predict the validation loss of bilingual translation models from Russian, Chinese, and German to English. Ghorbani et al. (2022) found scaling laws for different configurations for the encoder and decoder, independently varying the number of layers in each of them. Bansal et al. (2022) examined different architectures and described data size scaling laws for machine translation in a large scale for English to German and English to Chinese. While all of these works focused on the bilingual setting, we unveil trends for multilingual translation, which has increased complexity. Concurrently to our work, Fernandes et al. (2023) proposed scaling laws for multilingual machine translation, focusing on trilingual models trained on English-German with English-Chinese or French Multitask Methods for Multilingual Machine Translation Multitask methods have been proposed extensively to enhance the performance of multilingual translation models. Some utilize validation based signals to determine which language pairs should be prioritized throughout training, either with adaptive scheduling , gradient similarities to the validation set Wang et al. (2020a), or a multi-armed bandits model (Kreutzer et al., 2021). Zhu et al. (2021) added dedicated embedding and layer adapter modules to the Transformer, and Lin et al. (2021) suggested learning a binary mask for every model parameter and every language pair, both requiring further training after the base multilingual model converges. Li and Gong (2021) used per language gradients geometry to rescale gradients of different language pair to improve performance on low resource languages.  extended PCGrad (Yu et al., 2020) to create Gradient Vaccine, a method that attempts to deconflict different language pairs gradients by replacing them with more similar vectors in terms of cosine similarity. While the motivation for these methods is clear and intuitive, they are usually more complex and computationally expensive than the baseline. Moreover, their efficacy is often demonstrated using relatively small 13 models, while modestly increasing the model size can both strengthen the bilingual baselines and reduce the interference problem significantly.

Critical Takes on Multitask Optimization
Methods Multitask optimization methods were recently under scrutiny. Kurin et al. (2022) experimented with many of those for image classification and reinforcement learning problems, and found that none of them consistently outperformed a well tuned baseline with proper use of known regular-13 Transformer-base or big from Vaswani et al. (2017). ization techniques. Similarly, Xin et al. (2022) showed that despite their increased complexity, no popular multitask method was superior to a sweep over scalarization weights for a baseline trilingual translation model. This work complements this line of research by examining multilingual translation models and how can modest scale and calibrated temperature reduce problems associated with multitasking.

Conclusion
This work examines the dominant factors that influence interference in multilingual machine translation. Namely, the model size, the amount of parallel data for the focus language pair, and the proportion of examples from the focus language pair with respect to the total data seen during training. While specialized multitask techniques are sometimes demonstrated on small transformer models, we find that a standard baseline model of 176M parameters reduces the interference problem significantly, and further scaling up results in synergy among the different language pairs. We further demonstrate the importance of tuning the temperature at which different language pairs are sampled during training; while existing literature largely relies on high temperatures, which indeed improve low-resource performance in parameter-poor settings, larger models benefit from a more natural distribution that reflects the raw training data. These simple strategies for addressing interference call into question the necessity and perhaps even the validity of recentlyproposed complex anti-interference methods and reaffirm the tried-and-true method of increasing model capacity to accommodate for higher data diversity.

Limitations
One limitation of this work is the focus on Englishto-many and many-to-English settings, while previous studies also went beyond English-centric translation (Freitag and Firat, 2020;Fan et al., 2022). Second, we experiment with a WMT based benchmark that has a total of 15 languages and 200M training examples, when translation models were also trained on larger datasets (Aharoni et al., 2019;Arivazhagan et al., 2019;NLLB Team et al., 2022).
We leave questions about the amount of scale that will be required to effectively mitigate interference in massively (many-to-many, billions of parallel sequences) multilingual settings for future work.
Additionally, the data collected from high resource languages may be of higher quality compared to that collected from low resource languages. Further research is needed to determine the impact of low quality training data on interference and synergy. Finally, while we explore trends when scaling models width, deeper models (Ghorbani et al., 2022) might help mitigating interference even further.

A BLEU Scores
Throughout the paper we calculate interference in terms of test loss values. We additionally provide the test BLEU scores achieved by our models. We generate using beam search with 5 beams, without length penalty. We use SacreBLEU (Post, 2018) to calculate test sets BLEU (Papineni et al., 2002) scores.
Language similarities Figure 5 shows BLEU scores of models from experiments in Section 4.1.
They reflect similar trends, as the variance between different interfering languages when the focus language has only 118K examples diminish when a decent amount of training data is available.
Number of languages Figure 6 shows BLEU scores of models from experiments in Section 4.2. They also demonstrate that low resource pairs benefit when there are more interfering languages, but this effect disapper with a decent amount of training data.