Lego-MT: Learning Detachable Models for Massively Multilingual Machine Translation

Multilingual neural machine translation (MNMT) aims to build a unified model for many language directions. Existing monolithic models for MNMT encounter two challenges: parameter interference among languages and inefficient inference for large models. In this paper, we revisit the classic multi-way structures and develop a detachable model by assigning each language (or group of languages) to an individual branch that supports plug-and-play training and inference. To address the needs of learning representations for all languages in a unified space, we propose a novel efficient training recipe, upon which we build an effective detachable model, Lego-MT. For a fair comparison, we collect data from OPUS and build a translation benchmark covering 433 languages and 1.3B parallel data. Experiments show that Lego-MT with 1.2B parameters brings an average gain of 3.2 spBLEU. It even outperforms M2M-100 with 12B parameters. The proposed training recipe brings a 28.2$\times$ speedup over the conventional multi-way training method.\footnote{ \url{https://github.com/CONE-MT/Lego-MT}.}


Introduction
Multilingual neural machine translation (MNMT) translates languages by mapping a source sentence to a unified representation space and decoding a target sentence from this space (Johnson et al., 2017;Gu et al., 2018;Neubig and Hu, 2018;Aharoni et al., 2019;Zhang et al., 2020).Traditional MNMT models use a shared network to align representations in different languages.Recently, scaling up the size of MNMT models brings significant quantitative improvements and new qualitative capabilities (M2M-100, Fan et al. 2021;NLLB-200, Costa-jussà et al. 2022;inter alia).Beyond MNMT, recent large-scale language models (e.g., Chat-GPT) also show promising results on zero-shot (or 1 https://github.com/CONE-MT/Lego-MT. (2) Lego-MT is a multi-way structure that includes both multilingual (denoted as M) and language-specific encoders and decoders for English (denoted as E), Chinese (denoted as C) and Nepali (denotes as N).The architecture is detachable at inference time where only a specific encoder and decoder are needed.U (Unified space) represents hidden representations generated by encoders.
few-shot) translation, especially for language-to-English translation.Despite great potential, there is still a large gap between LLMs and existing MNMT models on massive translation directions.
Simply using a shared model for massive MNMT brings new effectiveness and efficiency issues.First, memorizing multilingual knowledge within finite parameters causes parameter interference (Ha et al., 2016a), especially between high-resource and low-resource languages (Li and Gong, 2021), which leads to significant performance degradation.Second, the centralization feature requires all parameters to be included in the computation graph during the inference stage, resulting in heavy computational overhead (Song et al., 2021).Common fixes of these issues include adapter-based approaches (Zhu et al., 2021), which handle parameter interference via fine-tuning new parameters to fit bilingual translation, and mixture-of-expert (MoE), which supports dynamic activation.These methods either fail to adapt to massive translation directions or require all parameters to be loaded into memory, thus remaining unsatisfactory consid-ering the efficiency of training and inference.
To find out the best recipe for massive multilingual translation, we revisit the classic multi-way (or multi-branch) architecture (Dong et al., 2015;Firat et al., 2016), whose philosophy is to allocate an individual encoder and decoder for each language (or group of languages), as shown in Figure 1.The immediate benefit of this structure is: 1) The utilization of individual modules for specific languages mitigates parameter interference; 2) Each branch can be independently loaded during inference, significantly reducing computational costs and decreasing inference latency.
Despite appealing, there remain two big challenges when training multi-way structures: representation alignment between different languages due to the lack of shared parameters; and low GPU efficiency during training because unused parameters occupy GPU memory but do not have any computations.Furthermore, the feature of random language mixture in a batch makes it infeasible to use an online-loading method (i.e., loading during usage) to accelerate training since it will cause impractical IO communication costs during batch switching (between CPU and GPU).
To address these challenges, we propose a novel training recipe, which results in our new detachable model, Lego-MT.We classify the training data into different language-centric groups such that we only need to load specific branches into GPU memory, eliminating the need to load different modules constantly.The language-centric group is trained in sequential order.Second, during each languagecentric training, we introduce a multilingual branch and propose a new triple-flow method to help a model learn to map to and translate from a unified space.Specifically, a unified space is a type of representation space rather than a module.It creates a common representation of language that can be used across multiple language tasks.
To evaluate our training recipe for massive MNMT, we construct a many-to-many translation dataset 2 covering 7 language-centric groups, 433 languages, based on the open-source website OPUS 3 (Tiedemann, 2012).
speedup compared with the conventional multiway training method.We also conduct comprehensive experiments on branch combinations, thanks to the detachable nature of the model.We find that low-resource languages prefer multilingual branches and high-resource languages prefer language-specific branches.In addition, we also observe that the unseen combination of a highresource language encoder and a high-resource language decoder can achieve better performance, showing that Lego-MT can align different branches into a unified space effectively.The main contributions can be summarized as follows: • We build an effective detachable model Lego-MT for multilingual machine translation.

Related Work
In this part, we review recent related multilingual machine translation models.We classify them into three categories: fully / group-shared (Dabre et al., 2020), and Mixture-of-expert (MoE).
The fully-shared model is the most prevalent model in Multilingual Neural Machine Translation (MNMT).This model employs a single architecture to translate in all directions (Ha et al., 2016b;Johnson et al., 2017;Bapna et al., 2019;Lin et al., 2020;Liu et al., 2020;Pan et al., 2021;Sun et al., 2021) and has demonstrated efficacy in aiding low-resource directions.However, fullyshared models are often subject to capacity bottlenecks and trade-offs between translation quality and the number of languages (Aharoni et al., 2019;Zhang et al., 2020;Ha et al., 2016a).Group-shared models incorporate individual parameters for each group and represent a popular solution for sharing language-specific encoders or decoders (Lee et al., 2017;Zoph and Knight, 2016).Lee et al. (2017); Sachan and Neubig (2018); Ji et al. (2020); Lyu et al. (2020); Sachan and Neubig (2018) proposed an MNMT model only for shared languagespecific modules.LaSS (Lin et al., 2021) learns language-specific sub-networks for each language direction for multilingual translation.Adpater methods (Bapna and Firat, 2019;Zhu et al., 2021)  add additional side networks to each language direction in addition to the main multilingual Transformer encoder-decoder.While these studies can alleviate the capacity bottleneck to some extent, challenges remain when handling larger-scale languages.
Mixture-of-Expert (MoE) has recently emerged as a prominent research direction (Jacobs et al., 1991;Shazeer et al., 2017;Lepikhin et al., 2020;Fedus et al., 2021;Du et al., 2022;Fan et al., 2021;Costa-jussà et al., 2022), which are sparsely activated, with each inference only activating a subset of parameters.Researchers have applied MoE to massively multilingual translation and introduced various regularization strategies to enhance performance (Dai et al., 2022;Costa-jussà et al., 2022).Despite promising results, MoE's objective differs from ours, as it still requires the entire structure to be stored in GPU memory during inference.
The encoder-decoder structure has demonstrated considerable flexibility through the utilization of the Lego-NN (Dalmia et al., 2022).The Lego-NN can be applied to various tasks with decoder modules being detachable, in contrast, the Lego-MT model design allows for the performance of massively MNMT with all modules being detachable.

Overview
This paper aims to build a detachable multi-branch model with a language (or group)-specific encoder and a language (or group)-specific decoder.As shown in Figure 2, the detachable structure provides an effective mechanism to only load a part of modules during training and inference.
During training, we introduce a new training method by classifying multilingual data into language-centric groups.During each training phase, only language-centric data and related branches are loaded.All language-centric groups are trained in a sequential way.We empirically found that the orders contribute little to the final performance and we fix the training order for simplification in the next parts.
During each language-centric training phase, we introduce a multi-lingual branch to help languagespecific branches learn to encode to a unified space and decode from a unified space.Unified Space is a concept that aims to map all languages into a unified representation space without any parameters.This concept is used in natural language processing and machine learning to create a common representation of language (Lyu et al., 2020;Fan et al., 2021) that can be used across different languages.
The training maintains triple-flow: Enc-Flow (language-specific encoder + multilingual decoder) for training specific encoder, Dec-Flow (multilin- gual encoder + language-specific decoder) to train language-specific decoder, and Mix-Flow (multilingual encoder + multilingual decoder) to avoid the overfitting of multilingual encoder and decoder to each language-centric training data.Surprisingly, we find that Dec-flow cannot be trained together with Mix/Enc-flow, resulting in catastrophic forgetting in the multilingual encoder (detailed discussion in Section 5).Therefore, the basic training processes can be briefly divided into two stages: the Mix/Enc-Flow phase and the Dec-Flow phase.
During inference, there are three alternative flows in Lego-MT for language-centric translation to be translated ("Inference Stage" in Figure 2).As shown in Figure 2, users can decide to choose which path for inference.

Triple-Flow Training
Given a multilingual dataset with N languages, where each D s i →t j contains a parallel data from the source language S i to the target language T j , s i refers to the i-th (i ∈ N ) language being translated from, t j represents the j-th (j ∈ N ) language being translated into, respectively.Specifically, one-to-many multilingual data for a specific language (lg) can be expressed as Similarly, the many-to-one multilingual data for a specific language (lg) can be denoted as D •→lg = {D s 1 →lg , D s i →lg , ..., D s N →lg }.All input sequence is preceded by a special tag (called the language tag) to indicate the source language and target languages.During each training phase, we have tripleflows playing for different rules, Mix-Flow, Dec-Flow, and Enc-Flow.

Mix-Flow
Mix-Flow is built upon a multilingual encoder branch and a multilingual decoder branch.It is trained on multilingual to multilingual data.This flow learns a mapping function f from a sentence in any language to another language.All language data is mixed together.The input source sequence is preceded by a special tag (called the language tag) to indicate the source languages.Following traditional methods, we also add a target language tag in the decoder part.The training loss for a Mix-Flow is: where x, y is a pair sampled from multilingual training data.It is used to avoid over-fitting language-specific data in Enc-Flow and Dec-Flow.

Enc-Flow
Enc-Flow includes a language-specific encoder and a multilingual decoder.It is trained with one-tomany multilingual data.The structure of such a design is natural for language-specific encoder training: the encoder input data comes from the same source language lg, and the decoder is multi-lingual data.The language tag is also added to the encoder and decoder parts.The training loss for languagespecific Enc-Flow is: where x, y is a pair sampled from one-to-many training data.

Dec-Flow
Dec-Flow includes a multilingual encoder and a language-specific decoder.It is trained with manyto-one translation.We separate the training of Dec-Flow from the training of Enc-Flow and Mix-Flow.
The parameters used for training Dec-Flow are initialized with the latest model trained by Mix-Flow and Enc-Flow.The language tag is also added to the encoder and decoder parts.Given a many-toone dataset D •→lg , the training loss is: where x, y is a pair sampled from many-to-one training data.

Training Algorithm
Algorithm 1 shows the whole training procedure.
We will go into the effects of the two-stage design in Section 5.In the first stage, we initialize each module of the Lego-MT model with a pre-trained MT model θ 0 .After initialization, we shuffle a oneto-many dataset to obtain a new training sequence for Enc-Flow training.In the second stage, we fix the encoder parameter of M-Flow θ m and learn the D-Flow decoder θ d .The iteration keeps running for L epochs.During inference, users can decide to load which flow for inference.We also evaluate the gap between these inference flows in experiments.

Experiments
While Lego-MT is generic, we focus the experiments on M2M-100-1.2Bas backbone models since M2M-100 is a leading MT model.

Dataset
Training Data We create a Many-to-Many dataset from OPUS 4 .We build a dataset covering 7 language-specific data and 433 languages.The 7 core languages are En, Zh, De, Ar, Ne, Az, Ceb.

Baselines
We conduct experiments by using a pre-trained multilingual machine translation model: M2M-100-1.2B(Fan et al., 2021) as initialization.We build 7 language-specific encoders and 7 language-specific decoders to model 7 core languages.We compare Lego-MT with the following baselines.
Flores-175MB / 615MB Flores-101 (Goyal et al., 2022)  LG-Centric Fine-Tuning To build a fair comparison, we also use the constructed dataset to fine-tune M2M-100-1.2B.We follow the standard fine-tuning paradigm, which uses a Transformer initialized with M2M-100-1.2B.In this baseline, we only use LG-centric data to train models.We simply merge all translation pairs related to language LG together to get the mixed training data.Like our model does, we also add language code in the encoder and decoder parts.M2M-100-1.2Bw.Multilingual Fine-Tuning In order to establish an equitable comparison, the constructed dataset was utilized to fine-tune M2M-100-1.2B.All translation data was amalgamated for the purpose of fine-tuning M2M-100-1.2B in this baseline.Correspondingly, language codes were incorporated in both the encoder and decoder components, as is done in our model.

Settings and Metric
Training Details The training code is bulit on the code repository fairseq6 .Each flow is initialized with a pre-trained M2M-100-1.2Bmodel.We train all models using Adam optimizer with β 1 = 0.9, β 2 = 0.999, the learning rate is set to 1e-4, and the max token number is set as 8,000.The training of all centric languages is conducted in random order: En, De, Ne, Az, Ceb, Ar, Zh.We split the whole dataset into 70 shards.And the whole training process takes around 15 days on 32 A100 GPUs.Metric We use the same evaluation metric (sp-BLEU) in the Flores-101 dataset.Before computing BLEU, we de-tokenized all data and then apply sentence piece tokenization for each language.It facilitates a more accurate assessment of model quality on the long-tail of low-resource languages.

Results
Lego-MT is an efficient translation model, outperforming M2M-100-12B with only 10% inference parameters Table 1 show experiment results on the Flores-101 devtest set.As we can see, Lego-MT is an efficient translation model that achieves large performance improvements over M2M-100-1.2B,with 7.4 spBLEU improvements on manyto-one translation and 5.8 spBLEU improvements on one-to-many translation.It even outperforms M2M-100-12B especially on one many-to-one settings, with a gain of 5.0 spBLEU.As a comparison, with the same training data, a shared model M2M-100-1.2Bonly obtains slight performance improvements, 4.3 spBLEU on many-to-one translation, and 2.5 spBLEU on one-to-many translation.
These results demonstrate Lego-MT provides an effective solution by using fewer inference parameters to achieve higher results.
Compared with high-resource translation, lowresource translation benefits more from multiway architectures.We observe that the improvements achieved by Lego-MT are not equally contributed by different languages.As we can see from feature.In real-world applications with unlimited data, inference costs are more critical than training costs.The advantage of Lego-MT is that it largely improves translation performance without incurring additional inference costs.

Analysis on Lego-MT
Ablation studies on triple-Flow training We design three flows in Lego-MT: Mix-Flow, Enc-Flow, and Dec-Flow.Mix-FLow contains a multilingual encoder and a multilingual decoder, which is essential in regularizing language-specific training.
We start from M-Flow and see how Enc-Flow and Dec-Flow affect the final performance, which gives more insights into the design of our framework.
For simplification, we use Chinese-centric data in top-10 shards and select 7 Zh→X and X→Zh translation pairs as a small training set, which includes high-resource languages (Be, De, Fa, Jv) and low-resource languages (Ne, Pa, Sw).We train Lego-MT on the selected set and observe results in Table 3.We can see that jointly training Enc-Flow and Mix-Flow boosts the performance in most directions.In contrast, jointly training Dec-Flow and Mix-Flow causes large performance degeneration.
It is mainly because that language-specific decoder may cause a large distribution shift on multilingual encoders, resulting in catastrophic forgetting.
That's why we split the training into two stages and keeps Dec-Flow in the second stage.Analysis on inference path section Due to the plug-and-play features, there are several possible inference paths for a single translation direction.At the inference stage, there are three alternative solutions for language-centric translation: Mix-Flow,  Enc-Flow, and Dec-Flow.Figure 2 shows the comparison between these inference paths.For lowresource languages (eg., Ceb, Az, Ne), Mix-Flow (M-encoder + M-decoder) works better than either Enc-Flow (E-encoder + M-decoder) or Dec-Flow (M-encoder + D-decoder).High-resource languages (eg., En,De,Zh, Ar) prefer languagespecific branches.Dec-Flow (a multilingual encoder and a language-specific decoder) achieves better performance among these paths.This demonstrates that specific parameters are more important when the amount of data in a language is huge.In summary, the Mix-Flow (M-encoder + M-decoder) is recommended for inference tasks with low-resource languages, and the Dec-FLow (M-encoder + D-decoder) is more appropriate for high-resource languages.For high-resource languages (eg., en, de, zh, ar), Dec-Flow (a multilingual encoder and a language-specific decoder) achieves better performance among these paths.

Lego-MT can learn the align different branches
Figure 4: The spBLEU gap between Mix-Flow (a multilingual encoder and a multilingual decoder) and unseen language-specific Flow (the combination of a languagespecific encoder and a language-specific decoder).Positive numbers mean the results of the language-specific Flow are better than that of the M-Flow.The unseen language-specific Flow achieves better results on 9 out of 12 directions, demonstrating that Lego-MT can learn the alignment for different branches.
into a unified space.During training, we propose a triple-flow way to train Lego-MT.These three flows contain Mix-Flow, Dec-Flow, and Enc-Flow.
To evaluate the quality of the hidden representations, we conduct experiments by directly using a language-specific encoder and a language-specific decoder for inference.Since such combinations do not occur in the training phase, it can evaluate the quality of the unified hidden space.We randomly combine the language-specific encoder and the language-specific decoder of four high-resource languages (En, De, Zh, Ar) with 12 translation directions.Figure 4 shows the performance of directly combining language-specific encoder and decoder.We find that such unseen combinations can get better results in most translation directions (9 out of 12).These results prove that Lego-MT can effectively map all languages into a unified space.In addition, it proves that the performance of highresource languages still has room for improvement by using language-specific parameters.
Lego-MT achieves promising results in unseen directions.We also conduct experiments on unseen directions to evaluate Lego-MT's performance in these scenarios, as demonstrated in Table 4. Distinguishing unseen translation directions can involve two scenarios: 1) The training data set lacks a specific translation direction.In this case, we start with the low-resource Ceb language and identify translation directions not included in our constructed data set.
2) The training data set lacks a direct translation between two languages.For instance, our training corpus may contain translations from Ast to En and from En to Es, but not a direct translation from Ast to Es.To address this, we randomly select four languages (Ast, Da, Hu, Lo) and evaluate the average performance on the Flores-101 devtest with one-to-many and many-toone settings.According to all experimental results, Lego-MT significantly surpasses the Multilingual FT baseline and is on par with the M2M-100-12B.
Lego-MT performance is independent of pretrained model initialization and converges faster than existing pre-training pipelines.To evaluate the necessity of pre-trained model initialization, we compare Lego-MT with the traditional multilingual pre-training pipeline that uses a single encoder-decoder model for all languages.We con-duct experiments on a subset of our constructed corpus, which contains parallel data for 433 languages.
We randomly initialize both models and train them on only 1/7 of the data, then measure their performance on Flores-101.As shown in Table 5, our experimental results demonstrate that our Lego-MT model is independent of the pre-trained model initialization and achieves faster convergence than the traditional multilingual pre-training pipeline.Moreover, our Lego-MT model outperforms the traditional multilingual pre-training pipeline on most of the machine translation tasks, showing its superior generalization and adaptation ability.
Lego-MT surpasses ChatGPT in the En→X direction and is on par with ChatGPT in the X→En direction, in terms of performance.A comparative analysis between ChatGPT and Lego-MT, as shown in Table 6, reveals that in zeroshot performance, ChatGPT lags behind Lego-MT.However, in eight-shot performance, ChatGPT surpasses Lego-MT in the X→En direction but falls short in the En→X direction.The prompts utilized for ChatGPT are "You are a helpful assistant that translates {SOURCE_LANG} to {TAR-GET_LANG}."for the system and "Translate the following {SOURCE_LANG} text to {TAR-GET_LANG}: {SOURCE_TEXT}." for the user.

Conclusion
With the increasing scale of languages, using a single model to translate all directions brings new challenges in practice.This paper proposes an efficient training recipe, which results in a detachable multilingual translation model, Lego-MT.To validate the effectiveness of our algorithm, we develop a massive MNMT translation dataset, covering 433 languages.Results on Flores-101 show that Lego-MT-1.2Bachieves large performance improvements over strong baselines under a fair comparison.It even outperforms the result of M2M-12B with a gain of 4 BLEU on many-to-one.

Limitation
Despite promising results, we also notice several limitations in this paper.First, we find that lowresource translation is not boosted by languagespecific decoders and language-specific encoders, which require more exploration of the trade-off between parameter sharing and parameter tension.Second, the evaluation of few-shot languages still remains a large problem.Although the final training dataset covers 433 languages, we only evaluate the translation performance on the available evaluation set that only covers 86 languages since baselines do not support so many languages.More standard benchmarks are required for evaluation.

A Dataset construction
In this section, we will describe the construction details of the Many-to-Many dataset.As shown in Table 5, the pipeline mainly consists of six steps: (1) Data Collection Step 1: Data Collection The raw data is collected from OPUS 7 , which is an open corpus that collects numerous parallel sentences from the web and overs a large number of domains from legislative to religious texts.
Step 2: Data Unification Since the OPUS includes datasets from different sources, it leads to the following two significant issues.1) Different Language Code: Some language in OPUS has several corresponding language codes.One of the reasons is that different corpora use different standards for language code, including 7 https://opus.nlpl.eu/ISO 639-1, ISO 639-2, ISO 639-3 or self-defined language codes in OPUS.Another reason is that some corpora append region ids at the end of language codes to distinguish the same language used in different regions.To unify language codes, we replace ISO 639-2 and ISO 639-3 language codes with ISO 639-1 language codes if the codes from ISO 639-1, ISO 639-2 and ISO 639-3 have the same language name in the code set published by SIL International (formerly known as the Summer Institute of Linguistics) 8 .
2) Inconsistent Operation: Some datasets pretokenize their sentences in OPUS, especially for Chinese and Japanese.
Therefore, we remove the region id if the language code ends with a region id.All replaced language codes are shown in Table 7.For the language codes out of ISO 639 series, we list them and the corpus they come from in Table 9.Furthermore, we report all used language codes and the full names of their corresponding languages in our dataset in Table 10.Then we detokenize all sentences by removing white space and unifying our texts.
Step 3: Data Merging After data unification, the parallel data is merged with the same language code pair from a different corpus.
Step 4: Data Cleaning The OPUS corpus collected from the web contains some poor-quality data.The main problems are: 1) Duplication: we use the deduplication script from fairseq9 to remove all duplicated sentence pairs for each language pair.
2) Missing Translation: We remove the sentence without corresponding translation or repeating itself as translation.
3) Length Mismatching: After segmentation the sentences with white space for most languages or individual characters for Chinese and Japanese, we apply a filtering script from Moses decoder10 to remove the sentences that the length is more than 250 words or three times difference in length between source and target sentences.
Step 5: Train-Dev-Test Split Different traindev-test split schemes are developed based on the data quantity.
1) A parallel data with more than 6,000 sentence pairs.We randomly sample separately about 2,000 sentence pairs as validation and test set, respectively.And the rest is train set.
2) A parallel data with fewer than 6,000 sentence pairs.We take 80%, 10%, 10% of all samples as train, validation, and test.
To avoid any overlap between our training data and used benchmark test data, we filter all sentences that exist in the common benchmarks (WMT, Flores-101) from our train and validation set.
Step 6: Data Preprocessing The data preprocessing consists of two main steps: 1) Sampling: Because the full dataset is huge, we sample some data for our training.The final dataset contains 1,307,143,514 sentence pairs, 433 languages, and 1,922 training pairs.
2) Preprocessing: The data is preprocess using the SentencePiece tokenizer provided by Fan et al. (2021)  D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 1 :
Figure 1: Multi-way architecture.(1) Monolithic Model is the fully-shared model for all translation directions;(2) Lego-MT is a multi-way structure that includes both multilingual (denoted as M) and language-specific encoders and decoders for English (denoted as E), Chinese (denoted as C) and Nepali (denotes as N).The architecture is detachable at inference time where only a specific encoder and decoder are needed.U (Unified space) represents hidden representations generated by encoders.

Figure 2 :
Figure 2: The overview of Lego-MT and training recipe.During training, we introduce an efficient training method by classifying multilingual data into language-centric groups.The language-centric groups are trained in a sequential way.During each training phase, only language-specific parameters are loaded into GPU memory.The training maintains three flows: Enc-Flow (language-specific encoder + multilingual decoder) for training specific encoder, Dec-Flow (Multilingual encoder + language-specific decoder) to train language-specific decoder, and Mix-Flow (multilingual encoder + multilingual decoder) to avoid the overfitting of multilingual encoder and decoder to each language-centric training data.U means the unified space, hidden representations generated by encoders.

Figure 5 :
Figure 5: The construction pipeline for Many-to-Many dataset.
Epoch number L. Training data for Mix-Flow, Enc-Flow and Dec-Flow: D multi = {Ds 1 →t 1 , Ds i →t j , ..., Ds N →t N } and D lg→• = {D lg→t 1 , D lg→t j , ..., D lg→t N } and D •→lg = {D s 1 →lg , D s i →lg , ..., D s N →lg }, respectively.The parameters used for Mix-Flow and Enc-Flow are initialized as θm = θ0 and θe = θ0.Note, the parameters used for Dec-Flow are initialized as θ d = θm after training of Mix-Flow and Enc-Flow.for epoch l = 1 to L do Shuffle D lg→• to obtain a new training sequence.for each batch De ∈ D lg→• do Evaluate the objective by Equation 2 on De: le = x,y∼D b −logP θe (y|x) Get a minibatch of multilingual data Dm ∈ D multi Evaluate the objective by Equation 1 on Dm: lm = x,y∼Dm −logP θm (y|x) Update θm and θe by: θm ← θm − η ▽ θm (lm + le) and θe ← θe − η ▽ θe le end end for epoch l = 1 to L do Shuffle D •→lg to obtain a new training sequence.for each batch D d ∈ D •→lg do Calculate D d by Equation 3 : Input: (Fan et al., 2021)he construction process are delineated in Appendix A. All training pairs have been deduplicated with Flores-101.Evaluation Data We use Flores-101(Fan et al., 2021)as the evaluation set, which provides humanwritten translation pairs covering 101 languages.

Table 1 :
AVG.Translation results on Flores-101.The top group shows the results of many-to-one translation and the bottom part shows the results of one-to-many settings.We display the spBLEU on the devtest of Flores-101.Each cell represents the average performance of translating from the rest languages."Param."represents the number of required parameters during inference.Baseline 7 has the exact same training data with Lego-MT.For a fair comparison, we use Mix-Flow in Lego-MT for all translation pairs.Lego-MT outperforms M2M-100-1.2Bw.
multilingual fine-tuning by a large margin, with an average gain of 3.2 spBLEU (3.1 on many-to-one translation and 3.3 on one-to-many translation).

Table 1
from training language-specific parameters.Since the language-specific branch has the same size as the multilingual branch, the training costs only double.We believe that the training costs for such a model are reasonable, given its one-time training

Table 3 :
Ablation studies on Triple-Flow training.→zh refers to the results of translating to zh and zh→ refers to the results of translating from zh.Dec-Flow brings a large performance drop.

Table 4 :
The results on unseen directions that are not covered by the constructed dataset."M" means M2M-100.Lego-MT shows the best generalization results."M-FT" means M2M-100-1.2Bw.Multilingual FT.

Table 5 :
ModelAst→X Hu→X Da→X Lo→X En→X De→X Ar→X Az→X Ceb→X Ne→X Zh→X AVG.Commencing with random initialization, 1/7 of the data is trained and the performance differential between Multilingual-FT and Lego-MT is evaluated.The findings indicate that the mean performance of Lego-MT across all language orientations surpasses that of Multilingual-FT and exhibits a more rapid convergence rate.

Table 6 :
Comparison of ChatGPT and Lego-MT: zeroshot and eight-shot results.While ChatGPT lags behind Lego-MT in zero-shot performance, it outperforms Lego-MT in the X→En direction with eight-shot.However, in the En→X direction, ChatGPT falls behind Lego-MT even with eight-shot.

Table 9 :
with a shared vocabulary of size 128,112.Unkown Language Codes, which are out of ISO 639 series.We can't confirm their full names.

Table 10 :
List of Languages.Our dataset mainly use ISO 639 series as language code.For traditional Chinese, we define "zhtrad" as code.C2.Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?seesection4.3.C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?see section 4.C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)? see section 4.3 D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.