Multilingual Machine Translation Systems at WAT 2021: One-to-Many and Many-to-One Transformer based NMT

In this paper, we present the details of the systems that we have submitted for the WAT 2021 MultiIndicMT: An Indic Language Multilingual Task. We have submitted two separate multilingual NMT models: one for English to 10 Indic languages and another for 10 Indic languages to English. We discuss the implementation details of two separate multilingual NMT approaches, namely one-to-many and many-to-one, that makes use of a shared decoder and a shared encoder, respectively. From our experiments, we observe that the multilingual NMT systems outperforms the bilingual baseline MT systems for each of the language pairs under consideration.


Introduction
In recent years, the Neural Machine Translation (NMT) systems (Vaswani et al., 2017;Sutskever et al., 2014; have consistently outperformed the Statistical Machine Translation (SMT) (Koehn, 2009) systems. One of the major problems with NMT systems is that they are data hungry, which means that they require a large amount of parallel data to give better performance. This becomes a very challenging task while working with low-resource language pairs for which a very less amount of parallel corpora is available. Multilingual NMT (MNMT) systems (Dong et al., 2015;Johnson et al., 2017) alleviate this issue by using the phenomenon of transfer learning among related languages, which are the languages that are related by genetic and contact relationships. (Kunchukuttan and Bhattacharyya, 2020) have shown that the lexical and orthographic similarity among languages can be utilized to improve translation quality between Indic languages when limited parallel corpora is available. Another advantage of using MNMT systems is that they support zero-shot translation, that is, translation among two languages for which no parallel corpora is available during training.
A MNMT system can also drastically reduce the total number of models required for a large scale translation system by making use of a single many-to-many MNMT model instead of having to train a separate translation system for each of the language pairs. This reduces the amount of computation and time required for training. Among various MNMT approaches, using a single shared encoder and decoder will further reduce the number of parameters and allow related languages to share vocabulary. In this paper, we describe the two MNMT systems that we have submitted for the WAT 2021 MultiIndicMT: An Indic Language Multilingual Task (Nakazawa et al., 2021) as team 'CFILT', namely one-to-many for English to Indic languages and many-to-one for Indic languages to English. This task covers 10 Indic languages which are Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil and Telugu. 2 Related Work Dong et al. (2015) was the first to introduce MNMT. The authors used a one-to-many model where a separate decoder and an attention mechanism was used for each target language. Firat et al. (2016) extended this to a many-to-many setting using a shared attention mechanism. In Zoph and Knight (2016) a multi-source translation approach was proposed where multiple encoders were used, each having a separate attention mechanism. Lee et al. (2017) proposed a CNN-based character level approach where a single encoder was shared across all the source languages.
A second line of work on MNMT uses a single shared encoder and decoder (Ha et al., 2016;Johnson et al., 2017) irrespective of the number of languages on the source or the target side. An en-bn en-gu en-hi en-kn en-ml en-mr en-or en-pa en-ta en-te uedin  -16  62  61  61  61  ---62  CVIT-PIB  92  58  267  -43  114  94  101  116 45  (2017)'s approach where in one-to-many and many-to-many models a language specific token is prepended to the input sentence to indicate the target language that the model should translate to. We use transformer (Vaswani et al., 2017) architecture which has proven to give superior performance over the RNN based models Sutskever et al., 2014;.

Our Approach
The various types of multilingual models that we have implemented are one-to-many and many-toone, each of which are discussed below.

One-to-Many
In a one-to-many multilingual model, the translation task involves a single source language and two or more target languages. One of the ways to achieve this is by making use of a single encoder for the source language and separate decoders for each of the target languages. The disadvantage of this method is that, as there are multiple decoders, the size of the model increases. Another way to achieve this is to use a single encoder and a single shared decoder. An advantage of this method is that the representations learnt by some language pair can further be utilized by the some other language pair. For example, the representations learnt during the training of the English-Hindi language pair can help training the English-Marathi language pair. Also, in this approach, a language specific token is prepended to the input sentence to indicate the model to which target language the input sentence should be translated.

Many-to-One
This approach is similar to the one-to-many approach. The major point of difference is that there are multiple source languages and a single target language. As a result, here we use a single shared encoder and a single decoder. Also, as the target language is same for all the source languages, it is optional to prepend a token to the input sentence unlike in the one-to-many approach which has multiple target languages for a given source language.

Experiments
In this section, we discuss the details of the system architecture, dataset, preprocessing, models and the training setup. Table 4 lists the details of the transformer architecture used for all the experiments.

Data
The dataset provided for the shared task by WAT 2021 was used for all the experiments. We did not use any additional data to train the models. Table 1 lists the datasets used for each of the English-Indic   language pairs along with the number of parallel sentences. The validation and test sets have 1,000 and 2,390 sentences, respectively and are 11-way parallel.

Preprocessing
We used Byte Pair Encoding (BPE) (Sennrich et al., 2016) technique for data segmentation, that is, break up the words into sub-words. This technique is especially helpful for Indic languages as they are morphologically rich. Separate vocabularies are used for the source and target side languages. For training the one-to-many and many-to-one models, the data of all the 10 Indic languages is combined before learning the BPE codes. 48000, 48000 and 8000 merge operations are used for learning the BPE codes of the one-to-many, many-to-one and bilingual baseline models, respectively.

Baseline Models
The baseline MT models are bilingual MT models based on the vanilla transformer architecture. We have trained 20 separate bilingual MT models, 10 for English to each Indic language and 10 for each Indic language to English.

Models and Training
For this task, we built two separate MNMT systems, a one (English) to many (10 Indic languages) model and a many (10 Indic languages) to one (English)  Table 4: System architecture details model. In our one-to-many model, we used the transformer architecture with a single encoder and a single shared decoder. The encoder used the English vocabulary and the decoder used a shared vocabulary of all the Indic languages. In our manyto-one model, we used the transformer architecture with a single shared encoder and a single decoder.
Here the encoder used a shared vocabulary of all the Indic languages and English vocabulary is used for the decoder. In both of these MNMT models, we prepended a language specific token to the input sentence.
We used the fairseq (Ott et al., 2019) library for implementing the multilingual systems. For training, we used Adam optimizer with betas '(0.9,0.98)'. The initial learning rate used was 0.0005 and the inverse square root learning rate scheduler was used with 4000 warm-up updates. The dropout probability value used was 0.3 and the criterion used was label smoothed cross entropy with label smoothing of 0.1. We used an update frequency, that is, after how many batches the backward pass is performed, of 8 for the multilingual models and 4 for the bilingual baseline models.
During decoding we used the beam search algorithm with a beam length of 5 and length penalty of 1. The many-to-one model was trained for 160 epochs and the one-to-many model was trained for 145 epochs. The model with the best average BLEU score was chosen as the best model. The average BLEU score for a MNMT model was calculated by taking the average of the BLEU scores obtained across all the language pairs.

Results and Analysis
The Bilingual Evaluation Understudy (BLEU) (Papineni et al., 2002) metric, the Rank-based Intuitive Bilingual Evaluation Score (RIBES) (Isozaki et al., 2010) metric and Adequacy-Fluency Metrics (AMFM) (Banchs et al., 2015) are used to report the results. Table 2 and 3 lists the results for all our experiments.
The baseline results are obtained by training bilingual models and then we have used automatic evaluation procedures same as those performed in WAT 2021. The one-to-many and many-to-one results are those reported by WAT 2021 on our submitted translation files.
We observe that for all language pairs in both the translation directions, the MNMT models give superior performance as compared to the bilingual NMT models. For relatively high resource language pairs like English-Hindi and English-Bengali the increase in BLEU score is less while for relatively low resource language pairs like English-Kannada and English-Oriya the increase in BLEU score is substantial. From the above observation it follows that low resource language pairs benefit much more from multilingual training than high resource language pairs. An increase of up to 8.93 BLEU scores (for Kannada to English) is observed using MNMT systems over the bilingual baseline NMT systems.

Conclusion
In this paper, we have discussed our submission to the WAT 2021 MultiIndicMT: An Indic Language Multilingual Task. We have submitted two separate MNMT models: a one-to-many (English to 10 Indic languages) model and a many-to-one (10 Indic languages to English) model. We evaluated our models using BLEU and RIBES scores and observed that the MNMT models outperform the separately trained bilingual NMT models across all the language pairs. We also observe that for the lower resource language pairs the improvement in performance is much more as compared to that for the higher resource language pairs.