Towards a Better Understanding of Variations in Zero-Shot Neural Machine Translation Performance

Multilingual Neural Machine Translation (MNMT) facilitates knowledge sharing but often suffers from poor zero-shot (ZS) translation qualities. While prior work has explored the causes of overall low ZS performance, our work introduces a fresh perspective: the presence of high variations in ZS performance. This suggests that MNMT does not uniformly exhibit poor ZS capability; instead, certain translation directions yield reasonable results. Through systematic experimentation involving 1,560 language directions spanning 40 languages, we identify three key factors contributing to high variations in ZS NMT performance: 1) target side translation capability 2) vocabulary overlap 3) linguistic properties. Our findings highlight that the target side translation quality is the most influential factor, with vocabulary overlap consistently impacting ZS performance. Additionally, linguistic properties, such as language family and writing system, play a role, particularly with smaller models. Furthermore, we suggest that the off-target issue is a symptom of inadequate ZS performance, emphasizing that zero-shot translation challenges extend beyond addressing the off-target problem. We release the data and models serving as a benchmark to study zero-shot for future research at https://github.com/Smu-Tan/ZS-NMT-Variations


Introduction
Multilingual Neural Machine Translation (MNMT) has shown great potential in transferring knowledge across languages, but often struggles to achieve satisfactory performance in zero-shot (ZS) translations.Prior efforts have been focused on investigating causes of overall poor zero-shot performance, such as the impact of model capacity (Zhang et al., 2020), initialization (Chen et al., 2022;Gu et al., 2019;Tang et al., 2021;Wang et al., 2021), and how model forgets language labels can affect ZS performance (Wu et al., 2021;Raganato et al., 2021).
1 https://github.com/Smu-Tan/ZS-NMT-Variations/Figure 1: SpBleu Distribution for English-centric and zero-shot directions.The y-axis denotes the percentage of performance surpassing the value on the x-axis.Condition-1 refers to resource-rich ZS directions where the source and target language share linguistic properties, while condition-2 includes other ZS directions.
In contrast, our work introduces a fresh perspective within zero-shot NMT: the presence of high variations in the zero-shot performance.This phenomenon suggests that certain ZS translation directions can closely match supervised counterparts, while others exhibit substantial performance gaps.We recognize this phenomenon holds for both English-centric systems and systems going beyond English-centric data (e.g.: m2m-100 models).This raises the question: which factors contribute to variations in the zero-shot translation quality?
Through systematic and comprehensive experimentation involving 1,560 language directions spanning 40 languages, we identify three key factors contributing to pronounced variations in zeroshot NMT performance: 1) target side translation capacity, 2) vocabulary overlap, and 3) linguistic properties.More importantly, our findings are general regardless of the resource level of languages and hold consistently across various evaluation metrics, spanning word, sub-word, character, and representation levels.Drawing from our findings, we offer potential insights to enhance zero-shot NMT.
Our investigation begins by assessing the impact of supervised translation capability on zero-shot performance variations.This is achieved by decomposing the unseen ZS direction (Src→Tgt) into two seen supervised directions using pivot language English.For instance, Src→Tgt can be decomposed into two seen supervised directions: Src→En (source side translation) and En→Tgt (target side translation) for English-centric systems.Our findings show the target side translation quality significantly impacts ZS performance and can explain the variations the most.Surprisingly, the source side translation quality has a very limited impact.
Moreover, our analysis demonstrates the substantial impact of linguistic properties, i.e., language family and writing system, in elucidating the variations in zero-shot performance.Figure 1 highlights this conclusion by showing the much stronger ZS performance of resource-rich ZS directions with similar linguistic properties compared to other directions.Intriguingly, our investigation also shows that the impacts of linguistic properties are more pronounced for smaller models.This suggests that larger models place less reliance on linguistic similarity when engaging in ZS translation, expanding more insights upon prior research about the impact of model capacity on ZS NMT (Zhang et al., 2020).Furthermore, we found the language pair with higher vocabulary overlap consistently yields better zero-shot capabilities, suggesting a promising future aspect to improve the ZS NMT.In addition, while Zhang et al. (2020) asserts the off-target issue as a primary cause that impairs the zero-shot capability, we conclude that the off-target issue is more likely to be a symptom of poor zero-shot translation qualities rather than the root cause.This is evident by small off-target rates (smaller than 5%) not necessarily resulting in high ZS capabilities.
Lastly, we argue that prior research on zero-shot NMT is limited by focusing only on the 1% of all possible ZS combinations (Aharoni et al., 2019;Pan et al., 2021;Tang et al., 2021) or prioritizing resource-rich language pairs (Yang et al., 2021;Raganato et al., 2021;Chen et al., 2022;Zhang et al., 2020;Qu and Watanabe, 2022).To overcome these limitations, we create the EC40 MNMT dataset for training purposes and utilize multi-parallel test sets for fair and comprehensive evaluations.Our dataset is the first of its kind considering real-world data distribution and diverse linguistic characteristics, serving as a benchmark to study ZS NMT.

MNMT corpus
Current MNMT studies mainly utilize two types of datasets: English-centric (Arivazhagan et al., 2019b;Yang et al., 2021), which is by far the most common approach, and, more rarely, non-Englishcentric (Fan et al., 2021;Costa-jussà et al., 2022).English-centric datasets rely on bitext where English is either the source or target language, while Non-English-centric ones sample from all available language pairs, resulting in a much larger number of non-English directions.For instance, OPUS100 dataset contains 100 languages with 99 language pairs for training, while the M2M-100 dataset comprises 100 languages covering 1,100 language pairs (2,200 translation directions).
Non-English-centric approaches primarily enhance the translation quality in non-English directions by incorporating additional data.However, constructing such datasets is challenging due to data scarcity in non-English language pairs.In addition, training becomes more computationally expensive as more data is included compared to English-centric approaches.Furthermore, Fan et al. (2021) demonstrates that English-centric approaches can match performance to non-Englishcentric settings in supervised directions using only 26% of the entire data collection.This suggests that the data boost between non-English pairs has limited impact on the supervised directions, highlighting the promise of improving the zero-shot performance of English-centric systems.Therefore, in this study, we focus on the English-centric setting as it offers a practical solution by avoiding extensive data collection efforts for numerous language pairs.

Understanding Zero-shot NMT
Previous studies have primarily focused on investigating the main causes of overall poor zero-shot performance, such as the impact of model capacity, initialization, and the off-target issue on zero-shot translation.Zhang et al. (2020) found that increasing the modeling capacity improves zero-shot translation and enhances overall robustness.In addition, Wu et al. (2021) shows the same MNMT system with different language tag strategies performs significantly different on zero-shot directions while retaining the same performance on supervised directions.Furthermore, Gu et al. (2019); Tang et al. (2021); Wang et al. (2021) suggest model initializa-tion impacts zero-shot translation quality.Lastly, Gu et al. (2019) demonstrates MNMT systems are likely to capture spurious correlations and indicates this tendency can result in poor zero-shot performance.This is also reflected in the work indicating MNMT models are prone to forget language labels (Wu et al., 2021;Raganato et al., 2021).
Attention is also paid to examining the relationship between off-target translation and zero-shot performance.Off-target translation refers to the issue where an MNMT model incorrectly translates into a different language (Arivazhagan et al., 2019a).Zhang et al. (2020) identifies the offtarget problem as a significant factor contributing to inferior zero-shot performance.Furthermore, several studies (Gu and Feng, 2022;Pan et al., 2021) have observed zero-shot performance improvements when the off-target rate drops.
Our work complements prior studies in two key aspects: 1) Unlike previous analyses that focus on limited zero-shot directions, we examine a broader range of language pairs to gain a more comprehensive understanding of zero-shot NMT. 2) We aim to investigate the reasons behind the variations in zero-shot performance among different language pairs and provide insights for improving the zeroshot NMT systems across diverse languages.

EC40 Dataset
Current MNMT datasets pose significant challenges for analyzing and studying zero-shot translation behavior.We identify key shortcomings in existing datasets: 1) These datasets are limited in the quantity of training sentences.For instance, the OPUS100 (Zhang et al., 2020) dataset covers 100 languages but is capped to a maximum of 1 million parallel sentences for any language pair.2) Datasets like PC32 (Lin et al., 2020) fail to accurately reflect the real-world distribution of data, with high-resource languages like French and German disproportionately represented by 40 million and 4 million sentences, respectively.3) Linguistic diversity, a critical factor, is often overlooked in datasets such as Europarl (Koehn, 2005) and Mul-tiUN (Chen and Eisele, 2012).4) Lastly, systematic zero-shot NMT evaluations are rarely found in existing MNMT datasets, either missing entirely or covering less than 1% of possible zero-shot combinations (Aharoni et al., 2019;Pan et al., 2021;Tang et al., 2021).
To this end, we introduce the EC40 dataset to address these limitations.The EC40 dataset uses and expands OPUS (Tiedemann, 2012) and consists of over 66 million bilingual sentences, encompassing 40 non-English languages from five language families with diverse writing systems.To maintain consistency and make further analysis more comprehensive, we carefully balanced the dataset across resources and languages by strictly maintaining each resource group containing five language families and each family consists of eight representative languages.
EC40 covers a wide spectrum of resource availability, ranging from High(5M) to Medium(1M), Low(100K), and extremely-Low(50K) resources.In total, there are 80 English-centric directions for training and 1,640 directions (including all supervised and ZS directions) for evaluation.To the best of our knowledge, EC40 is the first of its kind for MNMT, serving as a benchmark to study the zeroshot NMT.For more details, see Appendix A.1.
As for evaluation, we specifically chose Ntrex-128 (Federmann et al., 2022) and Flores-200 (Costa-jussà et al., 2022) as our validation and test datasets, respectively, because of their unique multiparallel characteristics.We combine the Flores200 dev and devtest sets to create our test set.We do not include any zero-shot pairs in the validation set.These datasets provide multiple parallel translations for the same source text, allowing for more fair evaluation and analysis.

Experimental Setups
Pre-processing To handle data in various languages and writing systems, we carefully apply data pre-processing before the experiments.Following similar steps as prior studies (Fan et al., 2021;Baziotis et al., 2020;Pan et al., 2021), our dataset is first normalized on punctuation and then tokenized by using the Moses tokenizer.2In addition, we filtered pairs whose length ratios are over 1.5 and performed de-duplication after all pre-processing steps.All cleaning steps were performed on the OPUS corpus, and EC40 was constructed by sampling from this cleaned dataset.
We then learn 64k joint subword vocabulary using SentencePiece (Kudo and Richardson, 2018).Following Fan et al. (2021); Arivazhagan et al. (2019b), we performed temperature sampling (T = 5) for learning SentencePiece subwords to over- come possible drawbacks of overrepresenting highresource languages, which is also aligned with that in the training phase.
Models Prior research has suggested that zeroshot performance can be influenced by both model capacity (Zhang et al., 2020) and decoder pretraining (Gu et al., 2019;Lin et al., 2020;Wang et al., 2021).To provide an extensive analysis, we conducted experiments using three different models: Transformer-big, Transformer-large (Vaswani et al., 2017), and fine-tuned mBART50.Additionally, we evaluated m2m-100 models directly in our evaluations without any fine-tuning.
As suggested by Johnson et al. (2017); Wu et al. (2021), we prepend target language tags to the source side, e.g.: '<2de>' denotes translating into German.Moreover, we follow mBART50 MNMT fine-tuning hyper-parameter settings (Tang et al., 2021) in our experiments.More training and model specifications can be found in Appendix A.2.
While we acknowledge that metrics like Sacrebleu may have limitations when comparing translation quality across language pairs, we believe that consistent findings across all these metrics provide more reliable and robust evaluation results across languages with diverse linguistic properties.For Comet scores, we evaluate the supported 35/41 languages.As for beam search, we use the beam size of 5 and a length penalty of 1.0.

Variations in Zero-Shot NMT
Table 1 presents the overall performance of three models for both English-centric and zero-shot directions on four metrics.It is evident that all models exhibit a substantial performance gap between the supervised and zero-shot directions.Specifically, for our best model, the zero-shot performances of Sacrebleu and SpBleu are less than onethird compared to their performance in supervised directions, which highlights the challenging nature of zero-shot translation.In addition, compare the results of mT-big and mT-large, we observe that increasing the model size can benefit zero-shot translation, which aligns with previous research (Zhang et al., 2020).Furthermore, we show that while the mBART50 fine-tuning approach shows superior performance in Src→En directions, it consistently lags behind in En→Tgt and zero-shot directions.2021) have shown that pre-trained seq2seq language models can help alleviate the issue of forgetting language IDs often observed in Transformer models trained from scratch, leading to improvements in zero-shot performance.However, our results show an interesting finding: When the MNMT model size matches that of the pre-trained model, the benefits of pre- training on zero-shot NMT become less prominent.This result is consistent for both seen and unseen languages regarding mBart50, see Appendix A.4.Our observation aligns with previous claims that the mBART model weights can be easily washed out when fine-tuning with large-scale data on supervised directions (Liu et al., 2020;Lee et al., 2022).
Quantifying variation We identify the higher variations that exist in zero-shot translation performance than supervised directions by measuring the Coefficient of Variation (CV = σ µ ) (Everitt and Skrondal, 2010).The CV metric is defined as the ratio of the standard deviation σ to the mean µ of performance, which is more useful than purely using standard deviation when comparing groups with vastly different mean values.
As shown in Table 1, we find substantially higher CV scores in the zero-shot directions compared to the supervised ones, with an average increase of around 100% across all models and metrics.This observation highlights that zero-shot performance is much more prone to variations compared to the performance of supervised directions.This raises the question: What factors contribute to the significant variations observed in zero-shot performance?
Exploring the Role of Non-English-Centric Systems Training with non-English language pairs has shown promise in improving zero-shot performance (Fan et al., 2021).To delve deeper into this aspect, we evaluate m2m100 models directly without further finetuning on our benchmark test set because our goal is to investigate whether the high variations in the zero-shot performance phenomenon hold for non-English-centric models.
Our analysis consists of English-centric (54), supervised (546), and zero-shot (860) directions, which are determined by the training settings of m2m100.The results in Table 2 yield two important observations.Firstly, significant performance gaps exist between supervised and zero-shot directions, suggesting that the challenges of zero-shot translation persist even in non-English-centric systems.More importantly, our finding of considerable variations in zero-shot NMT also holds for non-English-centric systems.

Factors in the Zero-Shot Variations
We investigate factors that might contribute to variations in zero-shot directions: 1) English translation capability 2) Vocabulary overlap 3) Linguistic properties 4) Off-target issues.For consistency, we use our best model (mT-large) in the following analyses, unless mentioned otherwise.For simplicity, we denote the zero-shot direction as Src→Tgt throughout the following discussion, where Src and Tgt represent the Source and Target language respectively.In this chapter, we present all analysis results using SpBleu and provide results based on other metrics in the Appendix for reference.
Table 3: Resource-level analysis based on SpBleu (we provide analyses based on other metrics in appendix A.5.1).We include both English-centric (shaded blue) and zero-shot (shaded red) directions.Avg→Avg denotes averaged zero-shot SpBleu score.

English translation capability
We first hypothesize that data size, which is known to play a crucial role in supervised training, may also impact zero-shot capabilities.We categorize data resource levels into four classes and examine their performance among each other as shown in Table 3. English-centric results are also included for comparison.Our findings indicate that the resource level of the target language has a stronger effect on zero-shot translations compared to that of the source side.This is evident from the larger drop in zero-shot performance (from 14.18 to 3.70) observed when the target data size decreases, as opposed to the source side (from 11.38 to 8.41).
Setup To further quantify this observation, we conducted correlation and regression analyses, see Table 4, following Lauscher et al. (2020) to analyze the effect of data size and English-centric performance.Specifically, we calculate both Pearson and Spearman for correlation, and use Mean absolute error (MAE) and root mean square error (RMSE) for regression.We use data size after temperature sampling in Src→En and En→Tgt directions, as well as the corresponding performances as features.

Results
Three key observations can be made from these results: 1) The factors on the target side consistently exhibit stronger correlations with the zero-shot performance, reinforcing our conclusions from the resource-level analysis in Table 3.
2) The English-centric performance feature demonstrates a greater R-square compared to the data size.This conclusion can guide future work to augment out-of-English translation qualities, we further expand it in the section 6. 3) We also observe that the correlation alone does not provide a comprehensive explanation for the underlying variations observed in zero-shot performance by visualizing the correlation (Figure 2).

The importance of Vocabulary Overlap
Vocabulary overlap between languages is often considered to measure potential linguistic connections such as word order (Tran and Bisazza, 2019), making it a more basic measure of similarity in surface forms compared to other linguistic measurements such as language family and typology distance (Philippy et al., 2023).Stap et al. (2023) also identifies vocabulary overlap as one of the most important predictors for cross-lingual transfer in MNMT.In our study, we investigate the impact of vocabulary sharing on zero-shot NMT.We build upon the measurement of vocabulary overlap proposed by Wang and Neubig (2019) and modify it as follows: , where V Src and V T gt represent the vocabularies of the source (Src) and target (Tgt) languages, respectively.This measurement quantifies the proportion of subwords in the target language that is shared with the source language in the zero-shot translation direction.4 for more details).
Setup We first investigate the correlation between the vocabulary overlap and zero-shot performance.As noted by Philippy et al. (2023), vocabulary overlap alone is often considered insufficient to fully explain transfer in multilingual systems.We share this view, particularly in the context of multilingual translation, where relying solely on vocabulary overlap to predict zero-shot translation quality presents challenges.Hence, we incorporate the extent of the vocabulary overlap factor into our regression analysis with English translation performance in section 5.1.
Results As shown in Table 5.The results indicate that considering the degree of overlap between the source and target languages further contributes to explaining the variations in zero-shot performance.Importantly, this pattern holds true across different model capacities, and it shows more consistent results than linguistic features such as script and family.We recognize this conclusion can promote to future investigation on how to improve the zeroshot NMT.For example, encouraging greater crosslingual transfer via better vocabulary sharing by leveraging multilingual dictionaries, or implicitly learning multilingual word alignments via multisource translation, we leave them to future work.

The impact of Linguistic Properties
Previous work on cross-lingual transfer of NLU tasks, such as NER and POS tagging, highlights that transfer is more successful for languages with high lexical overlap and typological similarity (Pires et al., 2019) and when languages are more syntactically or phonologically similar (Lauscher et al., 2020).In the context of multilingual machine translation, although it is limited to only validating four ZS directions, Aharoni et al. (2019) has empirically demonstrated that the zero-shot capability between close language pairs can benefit more than distant ones when incorporating more languages.Accordingly, we further extend this line of investigation by examining linguistic factors that may impact zero-shot performance in MNMT.We measure the role of two representative linguistic properties, namely language family and writing system, in determining the zero-shot performance.The specific information on linguistic properties of each language can be found in Appendix A.1.
Setup To examine the impact of linguistic properties on zero-shot performance, we specifically evaluate the performance in cases where: 1) source and target language belong to the same or different family and 2) source and target language use the same or different writing system.This direct comparison allows us to assess how linguistic similarities between languages influence the effectiveness of zero-shot translation.

Results
To provide a fine-grained analysis, we examine this phenomenon across different resource levels for the target languages, as shown in Table 6.The results reveal a significant increase in zero-shot performance when the source and target languages share the same writing system, irrespective of the resource levels.Additionally, we observe that the language family feature exhibits relatively weaker significance as shown in Appendix A.5.3.To further quantify the effect of these linguistic properties on ZS NMT, we conduct a regression analysis, see Table 5.Our findings highlight their critical roles in explaining the variations of zero-shot performance.Furthermore, our analysis reveals interesting findings regarding the effect of linguistic properties considering the model size.As shown in Table 5, we observed that the contribution of linguistic features is more pronounced for the smaller model, i.e., mT-big.While the larger model tends to place more emphasis on English-centric performance.This suggests that smaller models are more susceptible to the influence of linguistic features, potentially due to their limited capacity and generalization ability.In contrast, larger models exhibit better generalization capabilities, allowing them to rely less on specific linguistic properties.

The role of Off-Target Translations
Previous work (Gu and Feng, 2022;Pan et al., 2021) have demonstrated a consistent trend, where stronger MNMT systems generally exhibit lower off-target rates and simultaneously achieve better zero-shot BLEU scores.To further investigate this, we analyze the relationship between off-target rates and different levels of zero-shot performance.
Setup We adopt the off-target rate measurement from Yang et al. (2021) and Costa-jussà et al. (2022) using fasttext (Joulin et al., 2016) to detect if a sentence is translated into the correct language.
Results While Zhang et al. (2020) identifies the off-target issue as a crucial factor that contributes to poor zero-shot results.However, our analysis, as illustrated in Figure 3, suggests that the reasons for poor zero-shot performance go beyond just the off-target issue.Even when the off-target rate is very low, e.g., less than 5% of sentences being offtarget, we still observe a wide variation in zero-shot performance, ranging from very poor (0.1 SpBleu) to relatively good (34.6 SpBleu) scores.Based on these findings, we conclude that the off-target issue is more likely to be a symptom of poor zero-shot translation rather than the root cause.This emphasizes that translating into the correct language cannot guarantee decent performance.

From Causes to Potential Remedies
In this section, we summarize our findings and offer insights, building upon the previous observations.Enhance target side translation We identified that the quality of target side translation (En→Tgt) strongly influences the overall zero-shot performance in an English-centric system.To this end, future research should explore more reliable approaches to enhance the target side translation capability.One practical promising direction is the use of back-translation (Sennrich et al., 2016) focusing more on improving out-of-English translations.Similarly, approaches like multilingual regularization, sampling, and denoising are worth exploring to boost the zero-shot translation directions.
Focus more on distant pairs We recognize that distant language pairs constitute a significant percentage of all zero-shot directions, with 61% involving different scripts and 81% involving different language families in our evaluations.Our anal-ysis reveals that, especially with smaller models, distant pairs exhibit notably lower zero-shot performance compared to closer ones.Consequently, enhancing zero-shot performance for distant pairs is a key strategy to improve overall capability.An unexplored avenue for consideration involves multisource training (Sun et al., 2022) using Romanization (Amrhein and Sennrich, 2020), with a gradual reduction in the impact of Romanized language.
Encourage cross-lingual transfer via vocabulary sharing Furthermore, we have consistently observed that vocabulary overlap plays a significant role in explaining zero-shot variation.Encouraging greater cross-lingual transfer and knowledge sharing via better vocabulary sharing has the potential to enhance zero-shot translations.Previous studies (Wu and Monz, 2023;Maurya et al., 2023) have shown promising results in improving multilingual translations by augmenting multilingual vocabulary sharing.Additionally, cross-lingual pre-training methods utilizing multi-parallel dictionaries have demonstrated improvements in word alignment and translation quality (Ji et al., 2020;Pan et al., 2021).

Conclusion
In this work, we introduce a fresh perspective within zero-shot NMT: the presence of high variations in the zero-shot performance.We recognize that our investigation of high variations in zero-shot performance adds an important layer of insight to the discourse surrounding zero-shot NMT, which provides an additional perspective than understanding the root causes of overall poor performance in zero-shot scenarios.
We first show the target side translation quality significantly impacts zero-shot performance the most while the source side has a limited impact.Furthermore, we conclude higher vocabulary overlap consistently yields better zero-shot performance, indicating a promising future aspect to improve zero-shot NMT.Moreover, linguistic features can significantly affect ZS variations in the performance, especially for smaller models.Additionally, we emphasize that zero-shot translation challenges extend beyond addressing the off-target problem.
We release the EC-40 MNMT dataset and model checkpoints for future studies, which serve as a benchmark to study zero-shot NMT.In the future, we aim to investigate zero-shot NMT from other views, such as analyzing the discrepancy on the representation level.

Limitations
One limitation of this study is the overrepresentation of Indo-European languages in our dataset, including languages in Germanic, Romance, and Slavic sub-families.This could result in non-Indo-European languages being less representative in our analysis.Additionally, due to data scarcity, we were only able to include 5 million parallel sentences for high-resource languages.As a result, the difference in data size between high and medium-resource languages is relatively small compared to the difference between medium and low-resource languages (which is ten times).To address these limitations, we plan to expand the EC40 dataset in the future, incorporating more non-Indo-European languages and increasing the data size for high-resource languages.

Impact
We collected a new multilingual dataset (EC40) from OPUS, which holds potential implications for the field of multilingual machine translation.The EC40 dataset encompasses a diverse range of languages and language pairs, offering researchers and developers an expanded pool of data for training and evaluating translation models.It also serves as a benchmark for enabling fair comparisons and fostering advancements in multilingual translation research.Recognizing the inherent risks of mistranslation in machine translation data, we have made efforts to prioritize the incorporation of high-quality data, such as the MultiUN (Chen and Eisele, 2012) dataset (translated documents from the United Nations), to enhance the accuracy and reliability of the EC40 dataset.By sharing the EC40 dataset, we aim to contribute to the promotion of transparency and responsible use of machine translation data, facilitating collaboration and driving further progress in multilingual machine translation research.

A.1 Dataset Statistics
We list the details of the EC40 dataset in Table 7. Overall, EC40 is an English-centric multilingual machine translation dataset containing over 66 million sentences including 41 languages (together with English).EC40 is more profound in the total number of languages and in the balance of language family and writing systems.Specifically, for each language family, we include 8 representative languages across different resources.
Moreover, we set the number of sentences the same for each resource level, e.g.: all Highresource Languages have 5M sentences.Note: we list precise numbers in the table instead of approximate ones, for instance, 5M denotes exactly 5,000,000 number of sentences after preprocessing.We use ISO 639-13 in this table.We follow Flores-200 to label the writing system classes and use WALS (Dryer and Haspelmath, 2013) to label the language family for languages in our dataset.We also show the model specifications of mTransformer-big, mTransformer-large, and mBart50 Fine-tuning in the Table 8.It is worth noting that mBart50 utilizes vocabulary larger than our trained-from-scratch models.Furthermore, we adopt Vaswani et al. (2017) to set up the learning rate as 5e-4 with 4000 warmup steps and label smoothing of 0.1.

A.2 Training and Model specification
To keep the consistency of learning Sentence-Piece vocabulary, we also used temperature sampling (T = 5) for training all models.We trained all models (including mBart50 FT) with 4 NVIDIA A6000 GPUs for a maximum of 200k updates.For larger models, we set the total max tokens as 215,040 using gradient accumulation to stimulate the large batch-size training in Tang et al. (2021).

A.3 Validation of Spurious Correlation
To ensure that our model does not inadvertently capture spurious correlations during training, we conduct a validation process by visualizing the perplexity curves for both English-centric and zeroshot directions as proposed by (Gu et al., 2019).It is important to note that these curves are solely used for visualization purposes and are not used as criteria for early stopping.Our early stopping criterion for training is based solely on the validation perplexity, and we only consider English-centric directions in this regard.
In Figure 4 and Figure 5, we present the perplexity curves for English-centric and zero-shot directions, respectively.We observe that the perplexities for the zero-shot directions gradually decrease during training, indicating that the model is learning and improving its translation performance on those directions.Importantly, no significant overfitting patterns are observed in the zero-shot perplexity curves.Instead, the decreasing perplexities on zeroshot directions suggest that the model is effectively learning the underlying patterns and generalizing its translation capabilities to unseen language pairs.Table 9: Resource-Based Translation Performance Analysis of mT-large based on Sacrebleu.We include both English-centric and zero-shot directions.A.5.2 The impact of data and English-centric performance

A.5.3 The effect of Linguistic properties
We investigate how zero-shot performances change if the source and target languages are linguistically more similar, considering language family and writing system.

A.5.4 Overall Correlation Analysis using all factors
Our findings hold consistently across various evaluation metrics, spanning word, sub-word, character, and representation levels.For analyses in the section 5.2 and 5.3, we show the additional results that based on Sacrebleu, Chrf++, Comet in Table 16.

A.5.5 The role of Off-Target Issue
We utilized SpBleu in Section 5.4 to align with the setup employed by Zhang et al. (2020).We Does pre-training matter?Gu et al. (2019); Tang et al. (2021); Wang et al. (

Figure 2 :
Figure 2: Correlation between En→Tgt SpBleu and zero-shot (All↔Tgt) SpBleu.Each faded blue point denotes the performance of a single zero-shot direction based on SpBleu.R=0.68 indicates the Pearson correlation coefficient (see Table4for more details).

Figure 3 :
Figure 3: Correlation between off-target rate and zeroshot performance (SpBleu).R represents the Spearman correlation coefficient.We focus on directions where the off-target rate is considerably low (less than 5%).Results based on other metrics can be found in A.5.5.

Figure 4 :
Figure 4: perplexity curves on English-centric directions on our test-set.

Figure 5 :
Figure 5: perplexity curves on zero-shot directions on our test-set.h, m, l, e denote High, Medium, Low, and extremely-Low resource levels, respectively.

Figure 9 :
Figure 9: Zero-shot performance of mTransformer-large on 1560 directions for Sacrebleu and Chrf++

Figure 10 :
Figure 10: Zero-shot performance of mTransformer-large on 1560 directions for SpBleu and Comet

Table 1 :
Average performance scores and coefficient of variation on English-centric and Zero-shot (ZS) directions.The table includes three metrics: Sacrebleu, Chrf++, SpBleu, and Comet.The best performance scores (higher means better) are highlighted in bold depending on values before rounding, while the highest CV scores in the coefficient of variation section (higher means more variability) are underlined to highlight high variations.

Table 4 :
Analysis of zero-shot performance considering data size and English-centric performance based on Sp-Bleu.Data-size † is after the temperature sampling as it represents the actual size of the training set.

Table 5 :
Prediction of Zero-Shot Performance using En-Centric performance, vocabulary overlap, and linguistic properties.We present the result based on SpBleu in this table.

Table 6 :
The impact of linguistic properties on zero-shot performance (we use mT-large and SpBleu here for an example).We conduct Welch's t-test to validate if one group is significantly better than another.The detailed table, including the impact of X resource, can be found in Appendix A.5.3.

Table 8 :
Model specification

Table 10 :
Resource-Based Translation Performance Analysis of mT-large based on Chrf++.We include both English-centric and zero-shot directions.

Table 11 :
Resource-Based Translation PerformanceAnalysis of mT-large based on Comet.We include both English-centric and zero-shot directions.

Table 12 ,
13, and 14 show the impact of data and English-centric performance of the mTransformerlarge model across three different metrics.Combined with Table4, we verify that our conclusions in section 5.1 are consistent across all four metrics.

Table 12 :
Analysis of zero-shot performance considering data size and English-centric performance based on Sacrebleu.

Table 13 :
Analysis of zero-shot performance considering data size and English-centric performance based on Chrf++.

Table 14 :
Analysis of zero-shot performance considering data size and English-centric performance based on Comet.

Table 15 :
The impact of linguistic properties on zeroshot performance.To investigate it in depth, we analyze it in fine-grained levels by observing different resource levels of Y.We also conducted Welch's t-test to validate if one group is significantly better than another.