Multilingual Machine Translation: Closing the Gap between Shared and Language-specific Encoder-Decoders

State-of-the-art multilingual machine translation relies on a universal encoder-decoder, which requires retraining the entire system to add new languages. In this paper, we propose an alternative approach that is based on language-specific encoder-decoders, and can thus be more easily extended to new languages by learning their corresponding modules. So as to encourage a common interlingua representation, we simultaneously train the N initial languages. Our experiments show that the proposed approach outperforms the universal encoder-decoder by 3.28 BLEU points on average, while allowing to add new languages without the need to retrain the rest of the modules. All in all, our work closes the gap between shared and language-specific encoderdecoders, advancing toward modular multilingual machine translation systems that can be flexibly extended in lifelong learning settings.


Introduction
Multilingual machine translation is the ability to generate translations automatically across a (large) number of languages. Research in this area has attracted a lot of attention in recent times both from the scientific and industrial community. Under the neural machine translation paradigm (Bahdanau et al., 2015), the opportunities for improving this area have dramatically expanded. Thanks to the encoder-decoder architecture, there are viable alternatives to expensive pairwise translation based on classic paradigms 1 .
The main proposal in this direction is the universal encoder-decoder (joh, 2017) with massive multilingual enhancements (Arivazhagan et al., 2019). While this approach enables zero-shot translation and is beneficial for low-resource languages, it has multiple drawbacks: (i) the entire system has to be retrained when adding new languages or data; (ii) the quality of translation drops when adding too many languages or for those with the most resources (Arivazhagan et al., 2019); and (iii) the shared vocabulary grows dramatically when adding a large number of languages (especially when they do not share alphabets). Other limitations include the incompatibility of adding multiple modalities such as image or speech.
In this paper, we propose a new framework that can be incrementally extended to new languages without the aforementioned limitations ( §3). Our proposal is based on language-specific encoders and decoders that rely on a common intermediate representation space. For that purpose, we simultaneously train the initial N languages in all translation directions. New languages are naturally added to the system by training a new module coupled with any of the existing ones, while new data can be easily added by retraining only the module for the corresponding language.
We evaluate our proposal on three experimental configurations: translation for the initial languages, translation when adding a new language, and zero-shot translation ( §4). Our results show that the proposed method is better in the first configurations by improving the universal system: by 0.40 BLEU points on average in the initial training and by 3.28 BLEU points on average when adding new languages. However, our proposed system is still lagging behind universal encoder-decoder in zero-shot translation.

Related Work
Multilingual neural machine translation can refer to translating from one-to-many languages (Dong et al., 2015), from many-to-one (Zoph and Knight, 2016) and many-to-many (joh, 2017). Within the many-to-many paradigm, existing approaches can be further divided into shared or language-specific encoder-decoders.
Shared Encoder-Decoder. joh (2017) feed a single encoder and decoder with multiple input and output languages. Given a set of languages, a shared architecture has a universal encoder and a universal decoder that are trained on all initial language pairs at once. The model shares parameters, vocabulary and tokenization among languages to ensure that no additional ambiguity is introduced in the representation. This architecture provides a simple framework to develop multilingual systems because it does not require modifications of a standard neural machine translation model, and information is easily shared among the different languages through common parameters. Despite the model's advantages in transfer learning, the use of a shared vocabulary and embedding representation forces the model to employ a vocabulary that includes tokens from all the alphabets used. Additionally, recent work (Arivazhagan et al., 2019), that imposes representational invariance across language, shows increasing the number of languages varies the quality of the languages already in the system (generally enhancing low-resource pairs but being detrimental for highresource pairs).
Language-specific Encoder-Decoders which may or may not share parameters at some point.
Sharing parameters. Firat et al. (2016b) proposed extending the bilingual recurrent neural machine translation architecture (Bahdanau et al., 2015) to the multilingual case (Vázquez et al., 2019;Lu et al., 2018) by designing a shared attention-based mechanism between the languagespecific encoders and decoders to create a language independent representation. As the language specific components rely on the shared modules, modifying those components to add a new language or add further data to the system would require retraining the whole system (similarly to the previous shared approach). (Lakew et al., 2018) proposes a model based on the addition of new languages to an already trained system by vocabulary adaptation and transfer learning. While limited, it requires some retraining to adapt the model to the new task.
No sharing. The system proposed by Escolano et al. (2019) is trained on language-specific encoders and decoders based on joint training with-out parameter or vocabulary-sharing and on enforcing a compatible representation between the jointly trained languages. The advantage of the approach is that it does not require retraining to add new languages and increasing the number of languages does not vary the quality of the languages already in the system. However, the system has to be trained on a multi-parallel corpus and it does not scale well when there is a large number of languages in the initial system, since all encoders and decoders have to be trained simultaneously.

Proposed Method
Our proposed approach trains a separate encoder and decoder for each of the N languages available without requiring multi-parallel corpus. We do not share any parameter across these modules, which allows to add new languages incrementally without retraining the entire system.

Definitions
We next define the notation that we will be using when describing our approach. We denote the encoder and the decoder for the ith language in the system as e i and d i , respectively. For languagespecific scenarios, both the encoder and decoder are considered independent modules that can be freely interchanged to work in all translation directions.

Language-Specific Proposed Procedure
In what follows, we describe the proposed training procedure in two steps: joint training and adding new languages.

Joint Training
The straightforward approach is to train independent encoders and decoders for each language. The main difference from standard pairwise training is that, in this case, there is only one encoder and one decoder for each language, which will be used for all translation directions involving that language. The training algorithm for this language-specific procedure is described in Algorithm 1.
For each translation direction s i,j in the training schedule S with language i as source and language j as target, the system is trained using the language-specific encoder e i and decoder d j .
Adding New Languages Since parameters are not shared between the independent encoders and decoders, the joint training enables the addition of for i ← 0 to N do 7: for j ← 0 to N do 8: if si,j ∈ S then 9: li, lj = get parallel batch(i, j) 10: train (si,j(ei, dj), li, lj) new languages without the need to retrain the existing modules. Let us say we want to add language N + 1. To do so, we must have parallel data between N + 1 and any language in the system. For illustration, let us assume that we have L N +1 −L i parallel data. Then, we can set up a new bilingual system with language L N +1 as source and language L i as target. To ensure that the representation produced by this new pair is compatible with the previously jointly trained system, we use the previous L i decoder (d li ) as the decoder of the new L N +1 -L i system and we freeze it. During training, we optimize the cross-entropy between the generated tokens and L i reference data but update only the parameters of to the L N +1 encoder (e l N +1 ). By doing so, we train e l N +1 not only to produce good quality translations but also to produce similar representations to the already trained languages. Following the same principles, the L N +1 decoder can also be trained as a bilingual system by freezing the L i encoder and training the decoder of the L i − L N +1 system by optimizing the cross-entropy with the L N +1 reference data.

Experiments in Multilingual Machine Translation
In this section we report machine translation experiments in different settings. Since the main difference between the shared and the languagespecific encoders-decoders lies in whether they retrain the entire system when adding new languages, we accordingly design our experiments to compare the systems under this condition.

Data and Implementation
We used 2 million sentences from the EuroParl corpus (Koehn, 2005) in German, French, Spanish and English as training data, with parallel sentences among all combinations of these four languages (without being multi-parallel). For Russian-English, we used 1 million training sentences from the Yandex corpus 2 . As validation and test set, we used newstest2012 and newstest2013 from WMT 3 , which is multi-parallel across all the above languages. All data were preprocessed using standard Moses scripts (Koehn et al., 2007) We evaluate our approach in 3 different settings: (i) the initial training, covering all combinations of German, French, Spanish and English; (ii) adding new languages, tested with Russian-English in both directions; and (iii) zero-shot translation, covering all combinations between Russian and the rest of the languages. Additionally we compare two configurations which consists in using non-tied or tied embeddings. In the language-specific approach tied embeddings consists in using language-wise word embeddings: for one language, we use the same word embeddings. Whereas, in the case of non-tied, the encoder and the decoder of each language have different word embeddings. Tied embeddings in the shared system means that both encoder and decoder share the same word embeddings.
All experiments were done using the Transformer provided by Fairseq (Ott et al., 2019) 4 . We used 6 layers, each with 8 attention heads, an embedding size of 512 dimensions, and a vocabulary size of 32k subword tokens with Byte Pair Encoding (Sennrich et al., 2016) (in total for the shared encoders/decoders and per pair for language-specific encoder-decoders). Dropout was 0.1 for the shared approach and 0.3 for language-specific encoders/decoders. Both approaches were trained with an effective batch size of 32k tokens for approximately 200k updates, using the validation loss for early stopping. We used Adam (Kingma and Ba, 2015) as the optimizer, with learning rate of 0.001 and 4000 warmup steps. Table 1 and 2 show comparisons between the shared and language-specific encoders-decoders.
Initial Training Table 1 shows that the language-specific encoder-decoders outperforms the shared approach in all cases. On average, our proposed approach is better than the shared approach with a difference of 0.40 BLEU points.
Adding New Languages Table 2 shows that, when adding a new language into the system, the language-specific encoder-decoders outperform the shared architecture by 2.92 BLEU points for Russian-to-English and by 3.64 BLEU in the opposite direction. It is also worth mentioning that the Russian data is from a different domain than the frozen English modules used for training (Yandex corpus and EuroParl, respectively). As such, the language specific encoder-decoders are able to outperform the shared architecture when adding a new language and a new domain by learning from the previous information in the frozen modules. Note that additionally, retraining the shared encoder-decoder to add a new language took an entire week, whereas the incremental training with the language-specific encoder-decoders was performed in only one day.
Zero-shot The shared encoder-decoder clearly outperforms the language-specific encoder-  Note that employing tied embeddings has a larger impact in the language-specific architecture than in the shared one. In fact, it has been key for closing the performance gap between languagespecific and shared architectures.
In our case, we believe that we do not suffer from this problem because, within an initial set of N languages, we train N * N −1 systems using pair-wise corpus (without requiring multi-parallel corpus as previous works (Escolano et al., 2019)). Due to our proposed joint training, once we have trained our initial system, we end up with only N encoders and N decoders (2 * N ).

Conclusions
In this paper, we present a novel method to train language-specific encoders-decoders without sharing any parameters at all. More relevantly, our system allows to incrementally add new languages into the system without having to retrain it and without varying the translation quality of initial languages in the system. When adding a new language, the language-specific encoder-decoders outperform the shared ones by 3.28 BLEU on average and, most importantly, the training of this new language was done in only one day, as opposed to the week taken by the shared system.