Building Multilingual Machine Translation Systems That Serve Arbitrary XY Translations

Multilingual Neural Machine Translation (MNMT) enables one system to translate sentences from multiple source languages to multiple target languages, greatly reducing deployment costs compared with conventional bilingual systems. The MNMT training benefit, however, is often limited to many-to-one directions. The model suffers from poor performance in one-to-many and many-to-many with zero-shot setup. To address this issue, this paper discusses how to practically build MNMT systems that serve arbitrary X-Y translation directions while leveraging multilinguality with a two-stage training strategy of pretraining and finetuning. Experimenting with the WMT’21 multilingual translation task, we demonstrate that our systems outperform the conventional baselines of direct bilingual models and pivot translation models for most directions, averagely giving +6.0 and +4.1 BLEU, without the need for architecture change or extra data collection. Moreover, we also examine our proposed approach in an extremely large-scale data setting to accommodate practical deployment scenarios.


Introduction
Multilingual Neural Machine Translation (MNMT), which enables one system to serve translation for multiple directions, has attracted much attention in the machine translation area (Zoph and Knight, 2016;Firat et al., 2016). Because the multilingual capability hugely reduces the deployment cost at training and inference, MNMT has actively been employed as a machine translation system backbone in recent years (Johnson et al., 2017;Hassan et al., 2018).
Most MNMT systems are trained with multiple English-centric data for both directions (e.g., English → {French, Chinese} (En-X) and {French, * Equal contributions.
Chinese} → English (X-En)). Recent work (Gu et al., 2019;Zhang et al., 2020;Yang et al., 2021b) pointed out that such MNMT systems severely face an off-target translation issue, especially in translations from a non-English language X to another non-English language Y. Meanwhile, Freitag and Firat (2020) have extended data resources with multi-way aligned data and reported that one complete many-to-many MNMT can be fully supervised, achieving competitive translation performance for all X-Y directions. In our preliminary experiments, we observed that the complete manyto-many training is still as challenging as one-tomany training (Johnson et al., 2017;Wang et al., 2020), since we have introduced more one-to-many translation tasks into the training. Similarly reported in the many-to-many training with zero-shot setup (Gu et al., 2019;Yang et al., 2021b), the complete MNMT model also suffers from capturing correlations in the data for all the X-Y directions as one model training, due to highly imbalanced data.
In this paper, we propose a two-stage training for complete MNMT systems that serve arbitrary X-Y translations by 1) pretraining a complete multilingual many-to-many model and 2) finetuning the model to effectively transfer knowledge from pretraining to task-specific multilingual systems. Considering that MNMT is a multi-task learner of translation tasks with "multiple languages", the complete multilingual model learns more diverse and general multilingual representations. We transfer the representations to a specifically targeted task via many-to-one multilingual finetuning, and eventually build multiple many-to-one MNMT models that cover all X-Y directions. The experimental results on the WMT'21 multilingual translation task show that our systems have substantial improvement against conventional bilingual approaches and many-to-one multilingual approaches for most directions. Besides, we discuss our proposal in the  Figure 1: Average translation performance of our systems in the WMT'21 large-scale multilingual translation Task 1 (top) and Task 2 (bottom), with the respective average improvement of (12E6D, 12E6D+FT, 24E12D, 24E12D+FT) = (+3.6, +4.7, +5.0, +6.0) and (+2.0, +2.9, +3.2, +4.1) against the bilingual baseline ("Bi") and the pivot translation baselines ("Pivot"). "12E6D/24E12D" denote our two settings, with "+FT" suffix for finetuned systems.
light of feasible deployment scenarios and show that the proposed approach also works well in an extremely large-scale data setting.

Two-Stage Training for MNMT Models
To support all possible translations with |L| languages (including English), we first train a complete MNMT system on all available parallel data for |L| × (|L| − 1) directions. We assume that there exist data of (|L| − 1) English-centric language pairs and remaining (|L|−1)×(|L|−2) 2 non-English-centric language pairs, which lets the system learn multilingual representations across all |L| languages. Usually, the volume of English-centric data is much greater than non-English-centric one. Then, we transfer the multilingual representations to one target language L by finetuning the system on a subset of training data for many-to-L directions (i.e., multilingual many-to-one finetuning). This step leads the decoder towards the specifically targeted language L rather than multiple languages. As a result, we obtain |L| multilingual many-toone systems to serve all X-Y translation directions. We experiment with our proposed approach in the following two settings: 1) WMT'21 large-scale multilingual translation data with 972M sentence pairs and 2) our in-house production-scale dataset with 4.1B sentence pairs.

WMT'21 Multilingual Translation Task
We experiment with two small tasks of the WMT'21 large-scale multilingual translation task. The tasks provide multilingual multi-way parallel corpora from the Flores 101 data (Wenzek et al., 2021). The parallel sentences are provided among English (en), five Central and East European languages of {Croatian (hr), Hungarian (hu), Estonian (et), Serbian (sr), Macedonian (mk)} for the task 1, and five Southeast Asian languages of {Javanese (jv), Indonesian (id), Malay (ms), Tagalog (tl), Tamil (ta)} for the task 2. We removed sentence pairs either of whose sides is an empty line, and eventually collected the data with (Englishcentric, Non-English-centric)=(321M, 651M) sentence pairs in total. The data size per direction varies in a range of 0.07M-83.9M. To balance the data distribution across languages (Kudugunta et al., 2019), we up-sample the low-resource languages with temperature=5. We append language ID tokens at the end of source sentences to specify a target language (Johnson et al., 2017). We tokenize the data with the SentencePiece (Kudo and Richardson, 2018) and build a shared vocabulary with 64k tokens. We train Transformer models (Vaswani et al., 2017) consisting of a m-layer encoder and n-layer decoder with (hidden dim., ffn dim.) =(768, 3072) in a complete multilingual many-to-many fashion. We have two settings of (m, n) = (12, 6)  Table 1: Average sacreBLEU scores for many-to-L directions on both Task 1 and 2, and the data statistics of the corresponding L-centric training data (L={en, hu, hr, sr, et, id, ms, tl, mk, jv, ta}). All the multilingual systems including many-to-one baselines and the proposed model are 12E6D. Note that the "Pivot-based" system for manyto-English directions is identical to "Bilingual".
for "12E6D" and (24, 12) for "24E12D" , to learn diverse multilingual dataset. The model parameters are optimized by using RAdam (Liu et al., 2020) with an initial learning rate of 0.025, and warm-up steps of 10k and 30k for the 12E6D and 24E12D model training, respectively. The systems are pretrained on 64 V100 GPUs with a mini-batch size of 3072 tokens and gradient accumulation of 16. After the pretraining, the models are finetuned on a subset of X-L training data. We finetune the model parameters gently on 8 V100 GPUs with the same mini-batch size, gradient accumulations, and optimizer with different learning rate scheduling of (init_lr, warm-up steps)=({1e-4, 1e-5, 1e-6}, 8k). The best checkpoints are selected based on development loss. The translations are obtained by a beam search decoding with a beam size of 4, unless otherwise stated.
Baselines For system comparison, we build three different baselines: 1) direct bilingual systems, 2) pivot translation systems via English (only applicable for non-English X-Y evaluation) (Utiyama and Isahara, 2007), and 3) many-to-one multilingual systems with the 12E6D architecture. The bilingual and pivot-based baselines employ the Transformer base architecture. The embedding dimension is set to 256 for jv, ms, ta, and tl, because of the training data scarcity. For the X-Y pivot translation, a source sentence in language X is translated to English with a beam size of 5 by the X-En model, then the best output is translated to the final target language Y by the En-Y model.

Results
All results on the test sets are displayed in Figure 1 and Table 1, where we report the case-sensitive sacreBLEU score (Post, 2018) for translation accuracy. Overall, our best systems ("24E12D+FT") are significantly better by ≥ +0.5 sacreBLEU for 83% and 88% directions against the bilingual baselines and the pivot translation baselines, respectively. In Table 1, we present the average sacreBLEU scores for many-to-L directions, showing that our proposed approach successfully achieved the best performance in most targeted languages. Compared to the many-to-one multilingual baselines, the proposed approach of utilizing the complete MNMT model transfers multilingual representations more effectively to the targeted translation directions, as the L-centric data size are smaller. We also note that the winning system of the shared task achieved (task1, task2)=(37.6, 33.9) BLEU with a 36-layer encoder and 12-layer decoder model (Yang et al., 2021a) that is pretrained on extra language data including parallel and monolingual data, while our best system with a 24-layer encoder and 12-layer decoder obtained (task1, task2)=(25.7, 22.8) sacreBLEU, without using those extra data.

In-house Extremely Large-Scale Setting
Deploying a larger and larger model is not always feasible. We often have limitations in the computational resources at inference time, which leads to a trade-off problem between the performance and the decoding cost caused by the model architecture. In this section, we validate our proposed approach in an extremely large-scale data setting and also discuss how we can build lighter NMT models without the performance loss, while distilling the proposed MNMT systems (Kim and Rush, 2016). We briefly touch the following three topics of 1) multi-way multilingual data collection, 2) English-centric vs. multi-centric pretraining for X-Y translations, and 3) a lighter NMT model that addresses the tradeoff issue between performance and latency. Then,  we report the experimental results in the extremely large-scale setting.

Multilingual Data Collection
We build an extremely large-scale data set using our in-house English-centric data set, consisting of 10 European languages, ranging 24M-192M sentences per language. This contains available parallel data and back-translated data between English and {German (de), French (fr), Spanish (es), Italian (it), Polish (pl), Greek (el), Dutch (nl), Portuguese (pt), and Romanian (ro)}. From these English-centric corpora, we extract a multi-way multilingual X-Y data, by aligning En-X and En-Y data via pivoting English. Specifically, we extracted {de, fr, es, it, pl}-centric data and concatenate them to the existent direct X-Y data, providing 78M-279M sentence pairs per direction. Similarly as in Section 3, we build a shared SentencePiece vocabulary with 128k tokens to address the large-scale setting.

En-centric vs Multi-centric Pretraining
In a large-scale data setting, a question might come up; Which pretrained model provides generalized multilingual representations to achieve better X-Y translation quality? Considering English is often a dominant text data, e.g., 70% tasks are Englishcentric in the WMT'21 news translation task, the model supervised on English-centric corpora might learn representations enough to transfer for X-Y translations. To investigate the usefulness of the multi-centric data training, we pretrain our Transformer models with deeper 24-12 layers described in Section 3, on the English-centric data and the L-centric data (L={en,de,fr}), individually. After pretraining, we apply the multilingual many-to-one finetuning with a subset of the training data and evaluate each system for the fully supervised X-Y directions, i.e., xx-{en,de,fr}, and the partially supervised X-Y directions, i.e., xx-{es,it,pl}. We followed the same training and finetuning settings as described in Section 3, unless otherwise stated.
MNMT with Light Decoder At the practical level, one drawback of the large-scale models would be latency at inference time. This is mostly caused by the high computational cost in the decoder layers due to auto-regressive models and the extra cross-attention network in each block of the decoder. Recent studies (Kasai et al., 2021;Hsu et al., 2020;Li et al., 2021) have experimentally shown that models with a deep encoder and a shallow decoder can address the issue, without losing much performance. Fortunately, such an architecture also satisfies demands of the many-to-one MNMT training, which requires the encoder networks to be more complex to handle various source languages. To examine the light NMT model architecture, we train the Transformer base architecture modified with 9-3 layers (E9D3) in a bilingual setting and compare it with a standard Transformer base model, with 6-6 layers (E6D6), as a baseline. Additionally, we also report direct X-Y translation performance, when distilling the best large-scale MNMT models alongside the light NMT models as a student model. More specifically, following Kim and Rush (2016), we train light NMT student models (E9D3) that serve many-to-L translations (L={de, fr, es, it, pl}).
Results Table 2 reports average sacreBLEU scores for many-to-one directions in our in-house X-Y test sets. For the xx-{de,fr} directions, the proposed finetuning helps both English-centric and multi-centric pretrained models to improve the accuracy. Overall, the finetuned multi-centric models achieved the best, largely outperforming the English pivot-based baselines by +2.6 and +2.8 points. For the comparison among the multilingual systems, the multi-centric model without finetuning already surpasses the finetuned English-centric systems with a large margin of +0.9 and +0. that, by pretraining a model on more multi-centric data, the model learns better multilinguality to transfer. For the xx-{es,it,pl} directions 1 , the fineutned multi-centric systems gain similar accuracy improvement, averagely outperforming the conventional pivot-based baselines. Figure 2 shows the effectiveness of our light NMT model architecture for five bilingual En-X directions, reporting the translation performance in sacreBLEU scores and the latency measured on CPUs. Our light NMT model (E9D3) successfully achieves almost 2x speed up, without much drop of the performance for all directions. Employing this light model architecture as a student model, we report the distilled many-to-one model performance in Table 3, measured by sacreBLEU and COMET scores (Rei et al., 2020). For consistent comparison, we also built English bilingual baselines (E6D6) that are distilled from the bilingual Teachers, then we obtained the English pivot-based translation performance. For all the many-to-L directions (L={de,fr,es,it,pl}), the light NMT models that are distilled from the best MNMT models show the best performance in both metrics. Besides that, we also note that our direct X-Y light NMT systems successfully save the decoding cost with 75% against the pivot translation 2 .

Models
BLEU COMET  Table 3: Average direct X-Y translation performance of our proposed light NMT models. All "Our" NMT systems employ the light models (E9D3) that are distilled from the best systems reported in Table 2.

Conclusion
This paper proposes a simple but effective twostage training strategy for MNMT systems that serve arbitrary X-Y translations. To support translations across languages, we first pretrain a complete multilingual many-to-many model, then transfer the representations via finetuning the model in a many-to-one multilingual fashion. In the WMT'21 translation task, we experimentally showed that the proposed approach substantially improve translation accuracy for most X-Y directions against the strong conventional baselines of bilingual systems, pivot translation systems, and many-to-one multilingual systems. We also examined the proposed approach in the extremely large-scale setting, while addressing the practical questions such as multiway parallel data collection, the usefulness of multilinguality during the pretraining and finetuning, and how to save the decoding cost, achieving the better X-Y quality.