Multilingual Neural Machine Translation with Deep Encoder and Multiple Shallow Decoders

Recent work in multilingual translation advances translation quality surpassing bilingual baselines using deep transformer models with increased capacity. However, the extra latency and memory costs introduced by this approach may make it unacceptable for efficiency-constrained applications. It has recently been shown for bilingual translation that using a deep encoder and shallow decoder (DESD) can reduce inference latency while maintaining translation quality, so we study similar speed-accuracy trade-offs for multilingual translation. We find that for many-to-one translation we can indeed increase decoder speed without sacrificing quality using this approach, but for one-to-many translation, shallow decoders cause a clear quality drop. To ameliorate this drop, we propose a deep encoder with multiple shallow decoders (DEMSD) where each shallow decoder is responsible for a disjoint subset of target languages. Specifically, the DEMSD model with 2-layer decoders is able to obtain a 1.8x speedup on average compared to a standard transformer model with no drop in translation quality.


Introduction
Encoder-decoder based neural machine translation (NMT) systems have achieved great success on bilingual translation tasks (Sutskever et al., 2014;Gehring et al., 2017;Vaswani et al., 2017). Recently, multilingual neural machine translation (MNMT) has also attracted much attention because of its ease of deployment, knowledge transfer among languages and the potential to enable zero-shot translation (Dong et al., 2015;Firat et al., 2016;Ha et al., 2016;Johnson et al., 2017;Arivazhagan et al., * Work done at Facebook AI. 2019; Zhang et al., 2020). While MNMT can support translations in several directions, not all of them have better performance when compared to their corresponding bilingual models. Suspecting that poor performance in some directions is due to the limited model capacity, many prior works adopt deeper encoder and decoder (Zhang et al., 2019;Wang et al., 2019;Zhang et al., 2020). However, increasing the number of layers, especially in the decoder, deteriorates the latency of translation and memory costs. Recently, Kasai et al. (2020) show that given a fixed capacity budget, as measured by the number of layers, models with a deep encoder and a shallow decoder (DESD) are faster at inference time when compared to standard models with an equal number of encoder and decoder layers while maintaining translation quality.
Inspired by findings from Kasai et al. (2020), in this work, we explore the speed-accuracy trade-off in multilingual machine translation systems. Given the same model capacity budget, we experiment various layer allocation strategies. We analyze multilingual models in the one-to-many (O2M) setting and many-to-one (M2O) setting. In the one-tomany setting, there are numerous target languages from a single source language (limited to English in this study); and in the many-to-one setting, several possible source languages are translated into a single target language (again, English in this study).
In the many-to-one scenario, we find that allocating more capacity to the encoder reduces the latency while achieving comparable performance. We hypothesize that a deeper encoder helps the model accommodate multiple source languages, while a shallow decoder is sufficient to support a single target language.
However, in the one-to-many translation setting, speed-accuracy trade-off is complicated. We observe a performance drop as the decoder depth is reduced. We hypothesize that the shallow decoder can no longer model several different target languages adequately. With the goal of obtaining low latency while maintaining translation quality, we propose using multiple shallow decoders where each decoder is responsible for a subset of the target languages. Clearly, the introduction of multiple shallow decoders increases the size of our model. However, at inference time only one shallow decoder will be used, thus not adding latency or memory costs. With multiple target languages and decoders, one natural question is how to assign each target language to one of these decoders. We investigate several methods to assign each target language to one of these shallow decoders. More details are in the Section 3. Experimental results on three multilingual translation corpora show the effectiveness of our method to improve translation accuracy with lower latency at the same time.
Our main contributions are summarized as follows: • We extend the speed-accuracy trade-off study of DESD models from bilingual to multilingual machine translation tasks with various layer allocations.
• We show that on many-to-one translation, multilingual DESD models enable 1.8x speedup on average without sacrificing performance comparing to the baseline (equal model capacity).
• We further proposed shared encoder and multiple shallow decoders (DEMSD) for one-tomany setting again achieving 1.8x speed-up in decoding while preserving high-quality translations at the same time.
2 Deep encoder and shallow decoder (DESD) for multilingual NMT Background The transformer-based NMT model (Vaswani et al., 2017) achieves state-ofthe-art performance on many translation tasks. It consists of an encoder and a decoder, each of which contains several stacked layers. Since the transformer relies entirely on the attention mechanism, it allows more parallelization compared to recurrent neural networks. Specifically, at training time, the computation can be parallelized both in the encoder and decoder. At inference time, due to the auto-regressive property, the decoder needs to generate tokens one by one.
However, the computation in the encoder is still parallelized given the source sentence. Therefore, the main latency of the transformer at inference time happens in the decoder, especially translating long sentences. Recently, Kasai et al. (2020) find that on bilingual machine translation tasks, putting more capacity of the transformer model to the encoder substantially reduces the decoding time and maintain the performance at the same time.
Because this deep encoder and shallow decoder model achieves a superior speed-accuracy trade-off on bilingual translation tasks, in this section, we try to understand the layer allocations of transformer on the multilingual neural machine translation task given the same capacity budget which is measured by the number of layers in the encoder and decoder. We first experiment with three multilingual translation corpora.
• ML50 (Tang et al., 2020) Instead of just trying the shallowest possible decoder (1-layer), we train models with various configurations on each of these three corpora. Other than the layer allocation, all the other hyperparameters and model configurations are the same among these models and the same training procedure is applied to these models (odel and training details are listed in Appendix A.). To understand the speed and accuracy trade-off of the layer allocation, two metrics are reported: • BLEU: the average tokenized BLEU score (Papineni et al., 2002) over all directions.
• DS: the decoding speed. It is measured by the number of tokens per second the system The results are shown in Figure 1. Models with fewer decoder layers obtain higher decoding speed.

Many-to-one translation
In the M2O translation, there is no significant performance difference among these layer allocations. We hypothesize that this is because the deeper encoder learns better representations from a large number of source languages while on the decoder side only one language needs to be modeled. Therefore, given a more robust representation of source languages, the shallow decoder is able to generate high-quality translations. For example, the model with 10 encoder layers and 2 decoder layers obtain slightly better performance and a 1.8x speedup at the same time.

One-to-many translation
However, in the O2M translation setting, although models with the shallower decoder have lower latency compared to the standard transformer (6-6), there is a clear performance drop in terms of translation accuracy, especially for models with just 1 or 2 decoder layers. We attribute this to the shallow decoder not having enough capacity to model a large number of target languages.

Deep Encoder and Multiple Shallow Decoders (DEMSD)
We have seen that in one-to-many translation, DESD models have a performance drop compared to the standard transformer. In order to preserve translation quality and low latency at the same time, we propose a model with a shared encoder and multiple shallow decoders (DEMSD), each of which is used to decode a subset of target languages. Although this will introduce more parameters, at inference time only one shallow decoder is needed for a given translation (since the output language is fixed) thus the model incurs no extra latency or memory costs. One natural question that arises when using this multiple-decoder approach is how to assign output languages to each of the decoders.
In this section, we explore several language assignment methods to assign each target language (or language group) to one of these multiple decoders. As a result, each decoder only needs to handle a disjoint subset of target languages.

One language per decoder (EACH)
The simplest way is to use a separate decoder for each output language. As a result, we will have as many decoders as the number of target languages and each decoder only needs to model one language.

Random language set per decoder (RAND)
In this method, we assign a random set of languages to a single decoder. As the performance of the model will vary significantly based on the random assignment, we repeat this scheme with three different random assignments and report the average results. Instead of completely random grouping languages, we let each decoder handle a same number of languages but languages in one decoder are randomly grouped.

One language family per decoder (FAM)
Another intuitive way for language assignment is to use linguistic features (Comrie, 1989;Lewis, 2009;Dryer and Haspelmath, 2013), such as language family, typology, etc. In this method, we are guided by the intuition that languages from the same linguistic family share similar features which might be captured by a single decoder resulting in better performance. Thus, we group languages into several sets based on their linguistic families, and assign a family of languages to each decoder. As a result, we will have as many decoders as the number of language families in the target languages. We expect that in the same decoder, a better knowledge transfer will happen among languages in the same language family. For example, in the TED8-Related corpus, 8 target languages are split into 4 languages families which are TURKIC, SLAVIC, ROMANCE and CZECH-SLOVAK. The details are shown in Table 1. The language family-based assignment results on other corpora are shown in Appendix B.

Pre-trained language embedding based assignment (EMB)
From Johnson et al. (2017), a common way to indicate the target language is prepending a target language token to the source sentence. With the goal of capturing the information of languages they represent, their embeddings are trained end-to-end with source-target sentence pairs. We call these embeddings as the language embeddings here. According to Johnson et al. (2017), these language embeddings are able to capture target language features in their training data. Therefore, we first extract them from a well-trained model and group target languages according to them. Finally, each group are assigned to one of these decoders.

Self-taught assignment (ST)
One disadvantage of the pre-trained language embedding based grouping method is the need of a pre-trained machine translation model. It would be better if the model assign each target languages to one of these multiple shallow decoders during the training automatically. We expect that given a fixed number of decoders and target languages, the model is capable of choosing the most appropriate decoders for each language. Specifically, our model consists of a shared encoder, E, and N multiple decoders, D = [D 1 , D 2 , ..., D N ]. Given a language L, the model will choose a decoder, D i for training and translation so that the log probability of output sequence y given the input sequence x is log p(y|x, E, D i ) where i=arg max j p(j|L e ) and p(·|L e ) is the probability of each decoder being chosen given the language L and its language embedding vector L e . Intuitively, our model will learn the distribution of each decoder being chosen given a language and choose the one with the highest probability. However, the arg max operation here is non-differenetiable thus during trainiing we consider the Gumbel-Softmax (Jang et al., 2016), a differentiable approximation of the arg max operation.
In Gumbel-Softmax, it models the p(j|L e ) as: where l is the logit and g=− log(− log(u)) and u ∼ U(0, 1). In the forward pass, the differentiable approximation of the arg max operation is used to choose the decoder for the input language and during the backward, the true gradient of the Straight-Through Gumbel-Softmax outputs is used.
In our experiments, the temperature τ is linearly reduced from 5 to 0.5. Finally, during training, the probability of the target sequence y given the source sentence x and multiple decoders D is: where p n (y|x) is the probability of y given x in the n-th decoder and p(n|L e ) is the probability of the nth decoder being sampled given the embedding of the language L. During inference, only the decoder with the highest probability will be used to decode the input sentences.
One language per decoder EACH Random language set RAND One language family per decoder FAM Pre-trained Language embedding EMB Self-taught ST (a) The abbreviations of language assignment methods.
# parameters at training time #TP # parameters at inference time #DP Decoding speed DS # decoders #DEC (b) The abbreviations of metrics. corpora. ML50 (Tang et al., 2020) is an Englishcentral translation benchmark of 50 languages with publicly available training and evaluation sets, including high, mid, and extremely low resource directions. Following Tang et al. (2020), we adopt the 250k SentencePiece model (Kudo and Richardson, 2018) used in XLM-R (Conneau et al., 2019) to tokenize the dataset so that all languages share the same vocabulary. For TED8-Related, TED8-Diverse corpora, we follow the preprocessing steps in Wang et al. (2020).
Hyperparameters On ML50, we follow most of the standard hyperparameters in the transformerbase (Vaswani et al., 2017): 8 attention heads per layer, 512 model dimensions, 2048 hidden dimensions and 0.1 dropout. We train batches of 64k tokens using Adam (Kingma and Ba, 2014) with β = (0.9, 0.98) and = 10 −6 and 0.1 label smoothing. The learning rate goes to 1e−3 within 4,000 steps, and then decays with the inverse square-root schedule. All models are trained for 100,000 steps. Furthermore, to mitigate the training data imbalance issue, the temperature sampling method is adopted (Arivazhagan et al., 2019) which is set as 5 in all experiments.
On TED8 corpora, a smaller transformer model with 512 model dimensions, 1024 hidden dimensions and 0.3 dropout is adopted. All models are trained for 40k steps with batches of 16k tokens with a smaller learning rate 2e−4. The other training procedure is the same as the ML50. Evaluation metrics For all models, we evaluate on the checkpoint with the best validation loss and use beam size 5 and length penalty 1.0 in decoding. Besides reporting the average BLEU score over all languages, on ML50, we predefine high (> 1M), mid (100K, 1M]) and low (< 100K) resource languages according to their data sizes and average BLEU scores on each of them are also computed. For the evaluation speed, DS, it is measured by the number of tokens the system translates per second given one sentence at a time on a single GPU.

Results
From Figure 1, we find that for O2M translation, models with 1-or 2-layer decoders have a clear performance drop compared to the standard transformer (6-6). Therefore, our main experiments adopt multiple shallow decoders with 1 and 2 decoder layers. Results on ML50 and TED8 corpora are shown in Table 3 and 4 respectively. For simplicity, we introduce the abbreviation of each language assignment method and evaluating metrics in Table 2.
One language per decoder (EACH) With this assignment method, models obtain superior performance on high and mid resource languages but poor results on low resource languages. On ML50, if each language has its own decoder, we find that it achieves great results on high resource languages (BLEU H in rows 2 vs. 3 and 8 vs. 9 in Table 3). We think that given enough training data, the shallow decoder has enough ability to model one language. However, it performs worse on the low resource languages compared with the baseline (BLEU L between rows 2 vs. 3 and 8 vs. 9 in Table 3). To further understand this assignment method, we also   Table 4: Translation speed and accuracy trade-off on TED8-Related and TED8-Diverse corpora. Notation information can be found in Table 2. show the BLEU score differences between models 10-2 and 10-2-EACH on TED8-Related in Figure 2. The left three languages are relatively low resourced and their performance is lower than the baseline model in which all languages share one decoder 1 . This also demonstrates that their decoders are not able to learn robust representations given a limited amount of training data. And decoders trained with high resource languages generate higher quality translations and we attribute this to the enough training data and no negative transfer effect when trained without other languages (Ari-1 Note that although sk is defined as a low resourced language in this dataset, the reason why language sk still have slightly better result is that sk has 61.5k training data but the other three low resource languages (az, be, gl) have less than 10k training sentence pairs. vazhagan et al., 2019).

Random language set assignment (RAND)
We find that random language set assignment slightly improve the performance over the baseline due to the sub-optimal knowledge transfer among languages in the same decoder. If each decoder handles a similar number of languages, it also slightly improve the performance compared to the model with one shared decoder (BLEU scores between rows 2 vs. 4 and 8 vs. 10 in Tables 3 and  4). We attribute this to that the shallow decoder performs better given fewer languages. This also demonstrates that one shallow decoder does not have enough capacity to model a large number of languages. However, compared to language family and embedding assignment methods, the random language set method has lower translation quality, showing that how to assign target languages into these decoders is also crucial.
One language family per decoder (FAM) We group all languages into several groups according to their language families and assign each family to one shallow decoder. As a result, we have 15, 4, 5 language families in ML50, TED8-Related and TED8-Diverse corpora respectively. From the comparison between rows 2 vs. 5 and 8 vs 11 in Tables 3 and 4. It is clear to find that language familybased decoders achieves better accuracy and maintain the low latency at the same time. Furthermore, for models with multiple 2-layer decoders, they achieve comparable performance with the model 6-6 and obtain around a 1.8 times speedup at inference time. We think the improvement is mainly coming from the better knowledge transfer among similar languages (in one language family). In order to understand this further, we plot the BLEU score difference between models 10-2-EACH and 10-2-FAM on TED8-Related in Figure 3. We find that the major improvement of model 10-2-FAM over 10-2-EACH is from the low resource languages which means the high resource languages help their relevant low resource languages effectively.

Language embedding-based assignment (EMB)
For the fair comparison, languages are also grouped into the same number of language families according to language embeddings from the well-trained baseline model 6-6. Grouping results are listed in the Appendix C. We first find that language embedding-based grouping method is able to group similar languages together, showing the ability of language embeddings to effectively capture language characteristics during training. For example, on TED8-Related, the language embedding achieve the same grouping result as the language family-based one shown in Table 1. The language embedding-based assignment method achieves similar results compared to the language family-based one and effectively improve the performance of the baseline model.
Self-taught language assignment (ST) In this method, the model tries to assign target languages to multiple decoders automatically and there is no need having any prior knowledge (linguistic families) or well-trained models (language embeddings). From the rows 7 vs. 2 and 13 vs. 8 in Tables 3 and 4, our self-taught method improves around 1 BLEU score over the baseline. It also achieves similar results compared with the language family (embedding)-based language assignment methods, demonstrating the effectiveness of this method.

Multiple decoders for various layer allocations
In our main experiments, we use multiple very shallow decoders (i.e., 1 and 2-layer decoders) because there is a clear performance drop when using a single decoder with this configuration for one-to-many translation compared with the standard transformer (6-6), and compared to deeper decoders, employing multiple 1-or 2-layer decoders keeps the number of parameters manageable at training time. Nevertheless, it will be meaningful to explore the effect of multiple decoders on various layer allocations. Considering the model size and tractable training time, we only conduct experiments on TED8 corpora and the results are shown in Figure 4. On each line (the same language assignment method), the deeper decoders achieve better performance and the shallower decoder has lower latency. Moreover, if we compare language family-based assignment and the baseline models, given the same decoding speed at inference time, the former one consistently improve the performance with the same decoding speed at inference time. And with the similar performance, e.g., 10-2-FAM and 6-6, our best multiple shallow decoder models have much

Speed-accuracy trade-off in multilingual machine translation
From the above experiments and findings, in the one-to-many translation, the DESD framework obtains superior speed-accuracy trade-off. For example, the model with 10 encoder layers and 2 decoder layers obtain slightly better accuracy and a 1.8x speedup. Under the one-to-many setting, multiple shallow decoders are needed to mitigate the performance drop of the DESD model. And the crucial part is to group languages with similar features to one decoder to obtain the better knowledge transfer among languages (our FAM, EMB and ST methods). With this, our DEMSD model with multiple 2-layer decoder is capable of achieving similar performance and a 1.8x speedup compared to the standard transformer.

Related Work
Speed and accuracy are two important metrics to evaluate a machine translation system. In this work, we mainly discuss the transformer architecture (Vaswani et al., 2017). A number of works have explored various ways to improve its inference speed. Kim et al. (2019) adopt shallow decoder and layer trying to speed up the inference on CPUs. Shi and Knight (2017) and Senellart et al. (2018) employee vocabulary reduction to speed up the softmax layer. Li et al. (2020) employ a latent depth transformer model which prune layers during inference time to reduce the inference cost. There are also some works optimizing attention computations to speed up the inference speed Kitaev et al., 2020;Katharopoulos et al., 2020;Chelba et al., 2020). Recently, Kasai et al. (2020) places more capacity to the encoder side and keep an extremely shallow (one-layer) decoder to achieve a superior speed-accuracy trade-off.
Multilingual neural machine translation (MNMT) is an attractive field recently (Firat et al., 2016;Ha et al., 2016;Johnson et al., 2017) because MNMT tries to employ one model to translate more than one language pair, even including ones unseen during training (zero-shot translation). Knowledge transfer among languages boosts the performance of low-resource languages. However, many works (Arivazhagan et al., 2019;Zhang et al., 2020;Aharoni et al., 2019) have shown the capacity bottleneck of translation when modeling many languages. Therefore, before simply stacking more layers in the encoder and decoder, it is crucial to first understand how to balance the speed and accuracy given a fixed capacity budget. Therefore, in this work, we try to understand various capacity allocations to achieve the best speed-accuracy trade-off.

Conclusion
In this work, we study speed-accuracy trade-offs using various layer configurations for multilingual neural machine translation. We find that for manyto-one translation, deep encoder and shallow decoder (DESD) models improve decoding speed while maintaining translation quality with the same model capacity. However, for one-to-many translation we do observe a drop in quality when the decoder depth is reduced. To mitigate the performance drop of DESD models in one-to-many translation, we proposed using a shared encoder and multiple shallow decoders (DEMSD). Our best DEMSD models with 2-layer decoders are capable of speeding up decoding by 1.8 times while achieving the same quality compared to a standard transformer.
Our work can be combined with techniques mentioned in Section 6 such as optimized attention computation, vocabulary reduction, knowledge distillation, etc. We expect that these combinations will further improve the decoding speed and obtain a better speed-accuracy trade-off. This work can also be extended to other encoder-decoder applications beyond translation, such as question answering, dialogues, and so on. We will explore these directions in the future work.

A Training details of DESD model
In order to explore how DESD models work on multilingual machine translation, we train transformerbased models with various layer allocations on three multilingual machine translation corpora, ML50, TED8-Related and TED8-Diverse. For the fair comparison, the training process is the same across all models. On ML50, we employ the standard transformerbase model: 8 attention heads per layer, 512 model dimensions, 2048 hidden dimensions and 0.1 dropout. All models are trained for 100,000 with batches of 64k tokens using Adam and 0.1 label smoothing. The learning rate goes to 1e−3 within 4,000 steps,and then decays with the inverse square-root schedule.
On TED8 corpora, following (Wang et al., 2020), a smaller transformer model is adopted, i.e., 4 attention heads per layer, 512 model dimensions, 1024 hidden dimensions and 0.3 dropout. All models are trained for 40,000 with batches of 16k tokens using Adam and 0.1 label smoothing. The learning rate goes to 2e−4 within 4,000 steps, and then decays with the inverse square-root schedule.

B Language family assignment results
In Table 5, we show the language family-based assignment result on TED8-Diverse. Since this corpus is collected without considering relatedness, some groups just have one language. But its multiple decoders model improves the accuracy, showing the effectiveness of this method.
The language families in ML50 is shown in Table 6.

C Language embedding assignment results
On TED8-Related, we obtain the same language assignment results as the language family-based one.  On TED8-Diverse, the result of language embedding assignment is pretty similar to the language family assignment result (Table 7. The only difference is that language bg is grouped with language mk. We think this is because the language embedding not only contains the linguistic feature but the data feature as well.