IITP-MT at WAT2021: Indic-English Multilingual Neural Machine Translation using Romanized Vocabulary

This paper describes the systems submitted to WAT 2021 MultiIndicMT shared task by IITP-MT team. We submit two multilingual Neural Machine Translation (NMT) systems (Indic-to-English and English-to-Indic). We romanize all Indic data and create subword vocabulary which is shared between all Indic languages. We use back-translation approach to generate synthetic data which is appended to parallel corpus and used to train our models. The models are evaluated using BLEU, RIBES and AMFM scores with Indic-to-English model achieving 40.08 BLEU for Hindi-English pair and English-to-Indic model achieving 34.48 BLEU for English-Hindi pair. However, we observe that the shared romanized subword vocabulary is not helping English-to-Indic model at the time of generation, leading it to produce poor quality translations for Tamil, Telugu and Malayalam to English pairs with BLEU score of 8.51, 6.25 and 3.79 respectively.


Introduction
In this paper, we describe our submission to the MultiIndicMT shared task at the 8th Workshop on Asian Translation 1 (WAT 2021) (Nakazawa et al., 2021). The objective of this shared task is to build Machine Translation (MT) models between 10 Indic languages (Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, Telugu) and English. We submit two Multilingual Neural Machine Translation models (MNMT): one for XX → EN and one for EN → XX (here XX denotes a set of all 10 Indic languages).
Multilingual Machine Translation (Dong et al., 2015;Firat et al., 2016;Johnson et al., 2017;Aharoni et al., 2019;Freitag and Firat, 2020) has gained * Equal contribution 1 Our Team ID: IITP-MT popularity in recent times due to the ability to train a single model which is capable of translating between multiple language pairs. The main benefit of multilingual model is transfer learning. When a low resource language pair is trained together with a high resource pair, the translation quality of a low resource pair may improve (Zoph et al., 2016;Nguyen and Chiang, 2017). This method of training is more suitable for Indic languages as they are similar to each other (Dabre et al., 2017(Dabre et al., , 2020 and relatively under-resourced when compared with European languages (Sen et al., 2018). Romanization is the process of converting characters that are written in various scripts into Latin script. Amrhein and Sennrich (2020) showed that in a transfer learning setting, romanization improves the transfer between related languages that use different scripts. We train two MNMT models, which translate between Indic languages and English with all Indic data romanized. The models are evaluated using the BLEU (Papineni et al., 2002), RIBES (Isozaki et al., 2010) and AMFM (Banchs et al., 2015) metrics.
The paper is organized as follows. In section 2, we briefly mention some notable works on multilingual NMT and romanized NMT. In section 3, we describe the systems submitted along with preprocessing and romanization of Indic data. Results are described in section 4. Finally, the work is concluded in section 5.

Related Works
Multilingual Machine Translation enabled the ability to deploy a single model for multiple language pairs without training multiple models. Dong et al. (2015) proposes a multi-task learning framework to translate one source language into multiple target languages by adding language specific decoders. Their method has shown improvements over base-line models which are trained for individual language pairs. Firat et al. (2016) proposes a many-tomany model for multi-way, multilingual translation using shared attention and language specific encoders and decoders. However, with this setting, model parameters will increase as the number of languages increases. Johnson et al. (2017) use shared encoder-decoder model in which multiple languages share both encoder and decoder also the attention module. This is achieved by combining multiple language pairs data into a single corpus and adding a language tag to every source sentence to specify its target language. This method enables the zero-shot translation, in which the model can generate sentences belonging to a language pair that is not seen at training time. Aharoni et al. (2019) show that multilingual NMT models are capable of handling large number of language pairs. Freitag and Firat (2020) proposes that the use of multi-way alignment information will improve the translation quality of language pairs for which training data is scarce in multilingual settings.
Improving the quality of NMT models with monolingual data is a common approach nowadays, especially in low resource settings. Backtranslation Sennrich et al. (2016) is an effective approach to make use of target monolingual data. In this approach, with the help of existing target-tosource MT system target is translated into source and resulting synthetic parallel corpus is combined with clean corpus and used to train source-totarget NMT system. Multi-task learning framework (Zhang and Zong, 2016;Domhan and Hieber, 2017) is another way to utilize monolingual data to improve the performance of NMT.
Recent studies (Du and Way, 2017;Gheini and May, 2019;Briakou and Carpuat, 2019) show that the romanization will improve the performance of NMT system. However these approaches apply romanization at source side only. Amrhein and Sennrich (2020) showed that romanization can be applied on the target side also followed by an additional, learned deromanization step.
In this work, we follow Johnson et al. (2017) method to train multilingual NMT models. We romanize Indic data and use it to train our models. We also follow back-translation approach (Sennrich et al., 2016) to create synthetic parallel data. We report the results of the models which are trained on combined synthetic and clean parallel corpus.

System Description
This section describes datasets, preprocessing and experimental setup of our models.

Datasets
We use MultiIndicMT parallel corpus 2 consisting of following languages: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, Telugu and English. It contains the parallel corpora for 10 Indic languages which are translated into English. We also use PMI monolingual corpus 3 to generate synthetic data with back-translation (Sennrich et al., 2016)

Preprocessing and Romanization
We use a Python based transliteration tool 4 to romanize all Indic language data. This tool supports all Indic language scripts that are used in the experiments. It also has deromanization support which maps Latin script into various Indic scripts. We romanize all Indic language data (Amrhein and Sennrich, 2020) (both parallel and monolingual corpora are romanized) and merge all parallel corpora into single corpus. This combined parallel corpus used to train baseline models.
We follow back-translation (Sennrich et al., 2016) approach to generate synthetic parallel corpora. We merge monolingual corpora of all Indic languages and generate synthetic English data using baseline XX → EN model. The resulting synthetic English -Clean Indic parallel corpus is merged with clean English-Indic parallel corpus and used to further train baseline EN → XX model. We also generate synthetic Indic languages data using monolingual English data. We duplicate the monolingual English data 10 times and the baseline EN → XX model is used to generate synthetic Indic data. The reason to duplicate English data is to get equal size synthetic parallel corpus for all Indic languages. The resulting synthetic Indic -Clean English parallel corpus is merged with clean Indic-English parallel corpus and used to further train baseline XX → EN model.
For the training of EN → XX model, we add language tag to start of every source sentence (Johnson et al., 2017) to denote to which language 5 the source should be translated to. We do not use language tags for XX → EN model as the target is English always. All the training data is shuffled before feeding to the models. The training corpus statistics are shown in Table 2. The combined Development set contains 10,000 sentences and is the same for all models. Table 3 shows the contribution of each language pair in the combined training corpus. Hindi-English pair being the most contributing pair with almost 30% and Odia-English pair being least contributing pair with 3.3%, in both directions.

Experimental Setup
We train two multilingual models namely XX → EN (Indic languages to English) and EN → XX (English to Indic languages). All the models are 5 , 2015) is used for training with 8,000 warm up steps with initial learning rate of 2. We split the training data of baseline models into subwords with the unigram language model (Kudo, 2018) using SentencePiece (Kudo and Richardson, 2018) implementation. We create two subword vocabularies, one for English and one for all romanized Indic data 6 . The size of English subword vocabulary is 60K and of Indic languages is 100K, for both the models. We use OpenNMT toolkit (Klein et al., 2017) 7 to train our models with batch size of 2048 tokens. Models are evaluated on development sets after every 10,000 steps and checkpoints are created. The baseline models are trained for 100,000 steps and the last checkpoint is used to create a synthetic corpus with the backtranslation approach as described in Section 3.2. After creating synthetic parallel corpora, baseline models are further trained for another 200,000 steps 8 on combined synthetic and clean parallel corpora (see Table 2). Finally, all checkpoints that are created by the model using the combined corpora are averaged 9 and considered as the best parameters for each model and used to test our models. We 6 All Indic languages data is merged after romanization and created subword vocabulary on combined corpus. 7 https://github.com/OpenNMT/OpenNMT-py/tree/1.2.0 8 We stop the training as there is no improvement in terms of perplexity of models on training data. 9 OpenNMT-py provides script to average model weights.

Results and Analysis
The official BLEU (Papineni et al., 2002), RIBES (Isozaki et al., 2010) and AMFM (Banchs et al., 2015) scores of the multilingual models are shown in Table 4. We observe that the XX → EN model performance is consistent across all language pairs in terms of all the three scores. HI-EN being the most contributing pair (see Table 3), achieves the BLEU score of 40.08 points. Even the language pair with the least amount of data (OR-EN) yield a BLEU score of 31.19 points. However, we do not observe the same with EN → XX model. The performance of EN → XX model is inconsistent with achieving a high BLEU score of 34.48 points (EN-HI) and least BLEU score of 3.79 (ML-EN). We observe same in terms of RIBES score also. However, AMFM scores of EN → XX model are quite consistent despite having less BLEU and RIBES scores for some language pairs. Sen et al. (2018) observe that, in the multilingual setting where a single decoder has to handle information about more languages (7 in their case), the performance of the model is limited because of different vocabulary and different linguistic features. In our case, we romanize all data and feed it to the model. Still the EN → XX model is unable to produce good quality translations. We believe that the main reason for such low quality transla-tions is the romanized subword vocabulary, which is shared across 10 different languages, is not helping decoder at the time of generation. There can be two possible ways to fix this issue. One is, using a larger target vocabulary size as 100K subword vocabulary is not giving good results in our case. Another is, creating separate vocabularies for each language instead of combining them together and creating a joint vocabulary, while the data being romanized.

Conclusion
In this paper, we describe our submission to the MultiIndicMT shared task to WAT 2021. We submit two multilingual NMT models: many-to-one (10 Indic languages to English) and one-to-many (English to 10 Indic languages). We romanize all Indic language data to convert all languages' tokens in roman script. We also generate synthetic data using the back-translation approach. We train our models on the romanized data sets which is a combination of clean corpora and synthetic backtranslated corpora. We evaluate our models using BLEU, RIBES and AMFM scores and observed that many-to-one model achieves highest BLEU score of 40.08 for Hindi-English pair and one-tomany model achieves highest BLEU score of 34.48 for English-Hindi pair. However, the shared subword vocabulary at target side for the one-to-many model lead to the poor performance of the oneto-many model especially in Tamil, Telugu and Malayalam to English pairs by achieving BLEU score of 8.51, 6.25 and 3.79 respectively.