IIIT Hyderabad Submission To WAT 2021: Efficient Multilingual NMT systems for Indian languages

This paper describes the work and the systems submitted by the IIIT-Hyderbad team in the WAT 2021 MultiIndicMT shared task. The task covers 10 major languages of the Indian subcontinent. For the scope of this task, we have built multilingual systems for 20 translation directions namely English-Indic (one-to- many) and Indic-English (many-to-one). Individually, Indian languages are resource poor which hampers translation quality but by leveraging multilingualism and abundant monolingual corpora, the translation quality can be substantially boosted. But the multilingual systems are highly complex in terms of time as well as computational resources. Therefore, we are training our systems by efficiently se- lecting data that will actually contribute to most of the learning process. Furthermore, we are also exploiting the language related- ness found in between Indian languages. All the comparisons were made using BLEU score and we found that our final multilingual sys- tem significantly outperforms the baselines by an average of 11.3 and 19.6 BLEU points for English-Indic (en-xx) and Indic-English (xx- en) directions, respectively.


Introduction
Good translation systems are an important requirement due to substantial government, business and social communication among people speaking different languages. Neural machine translation (Sutskever et al., 2014;Bahdanau et al., 2014;Vaswani et al., 2017) is the current state-of-the-art approach for Machine Translation in both academia and industry. The success of NMT heavily relies on substantial amounts of parallel sentences as training data (Koehn and Knowles, 2017) which is again an arduous task for low resource languages like Indian languages (Philip et al., 2021). Many techniques have been devised to improve the translation quality of low resource languages like back translation (Sennrich et al., 2015), dual learning (Xia et al., 2016), transfer learning (Zoph et al., 2016;Kocmi and Bojar, 2018), etc. Also, using the traditional approaches, one would still need to train a separate model for each translation direction. So, building multilingual neural machine translation models by means of sharing parameters with high-resource languages is a common practice to improve the performance of low-resource language pairs (Firat et al., 2017;Johnson et al., 2017;Ha et al., 2016). Low resource language pairs perform better when combined opposed to the case where the models are trained separately due to sharing of parameters. It also enables training a single model that supports translation from multiple source languages to a single target language or from a single source language to multiple target languages. This approach mainly works by combining all the parallel data in hand which makes the training process quite complex in terms of both time and computational resources (Arivazhagan et al., 2019). Therefore, we are training our systems by efficiently selecting data that will actually contribute to most of the learning process. Sometimes, this learning is hindered in case of language pairs that do not show any kind of relatedness among themselves. But on the other hand, Indian languages exhibit a lot of lexical and structural similarities on account of sharing a common ancestry (Kunchukuttan and Bhattacharyya, 2020). Therefore, in this work, we have exploited the lexical similarity of these related languages to build efficient multilingual NMT systems. This paper describes our work in the WAT 2021 MultiIndicMT shared task (cite). The task   (Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil and Telugu) and English. The objective of this shared task is to build translation models for 20 translation directions (English-Indic and Indic-English). This paper is further organized as follows. Section 2 describes the methodology behind our experiments. Section 3 talks about the experimental details like dataset pre-processing and training details. Results and analysis have been discussed in Section 4, followed by conclusion in Section 5.

Exploiting Language Relatedness
India is one of the most linguistically diverse countries of the world but underlying this vast diversity in Indian languages are many commonalities. These languages exhibit lexical and structural similarities on account of sharing a common ancestry or being in contact for a long period of time (Bhattacharyya et al., 2016). These languages share many common cognates and therefore, it is very important to utilize the lexical similarity of these languages to build good quality multilingual NMT systems. To do this, we are using the two different approaches namely Unified Transliteration and Sub-word Segmentation proposed by (Goyal et al., 2020).

Unified Transliteration
The major Indian languages have a long written tradition and use a variety of scripts but correspondences can be established between equivalent characters across scripts. These scripts are derived from the ancient Brahmi script. In order to achieve this, we transliterated all the Indian languages into a common Devanagari script (which in our case is the script for Hindi) to share the same surface form. This unified transliteration is a string homomorphism, replacing characters in all the languages to a single desired script.

Subword Segmentation
Despite sharing a lot of cognates, Indian languages do not share many words at their non-root level. Therefore, the more efficient approach is to exploit Indian languages at their sub-word level which will ensure more vocabulary overlap. Therefore, we are converting every word to sub-word level using the very well known technique Byte Pair Encoding (BPE) (Sennrich et al., 2015). This technique is applied after the unified transliteration in order to ensure that languages share same surface form (script). BPE units are variable length units which provide appropriate context for translation systems involving related languages. Since their vocabularies are much smaller than the morpheme and wordlevel models, data sparsity is also not a problem. In a multilingual scenario, learning BPE merge rules will not only find the common sub-words between multiple languages but it also ensures consistency of segmentation among each considered language pair.

Data Selection Strategy
Since the traditional approaches of training a multilingual system simply work by combining all the parallel dataset in hand, making it infeasible in terms of both time as well as computational resources. Therefore, in order to select only the relevant domains, we are incrementally adding all the domains in decreasing order of their vocab overlap with the PMI domain (Haddow and Kirefu, 2020). Detection of dip in the BLEU score (Papineni et al., 2002) is considered as the stopping criteria for our strategy. The vocab overlap between any two domains is calculated using the formula shown below: Here, Vocab d1 & Vocab d2 represents vocabulary of domain 1 and domain 2 respectively. Vocab overlap of each domain with PMI is shown in Table 1.

Back Translation
Back translation (Sennrich et al., 2015)is a widely used data augmentation method where the reverse direction is used to translate sentences from target side monolingual data into the source language. This synthetic parallel data is combined with the actual parallel data to re-train the model leading to better language modelling on the target side, regularization and target domain adaptation. Back  translation is particularly useful for low resource languages. We use back translation to augment our multilingual models. The back translation data is generated by multilingual models in the reverse direction, hence some implicit multilingual transfer is incorporated in the back translated data also. For the scope of this paper, we have used monolingual data of the PMI given on the WAT website.

Multilingual NMT and Fine-tuning
Multilingual model enables us to translate to and from multiple languages using a shared word piece vocabulary, which is significantly simpler than training a different model for each language pair. We used the technique proposed by Johnson et al. (2017) where he introduced a "language flag" based approach that shares the attention mechanism and a single encoder-decoder network to enable multilingual models. A language flag or token is part of the input sequence to indicate which direction to translate to. The decoder learns to generate the target given this input. This approach has been shown to be simple, effective and forces the model to generalize across language boundaries during training. It is also observed that when language pairs with little available data and language pairs with abundant data are mixed into a single model, translation quality on the low resource language pair is significantly improved. Furthermore, We are also fine tuning our multilingual system on PMI (multilingual) domain by the means of transfer learning b/w the parent and the child model.

Dataset and Preprocessing
We are using the dataset provided in WAT 2021 shared task. Our experiments mainly use PMI (Haddow and Kirefu, 2020), CVIT (Siripragada et al., 2020) and IIT-B (Kunchukuttan et al., 2017) parallel dataset, along with monolingual data of PMI for further improvements Table 2. We used Moses (Koehn et al., 2007) toolkit for tokenization and cleaning of English and Indic NLP library (Kunchukuttan, 2020) for normalizing, tokenization and transliteration of all Indian languages. For our bilingual model we used BPE segmentation with 16K merge operation and for multilingual models we learned the Joint-BPE on source and target side with 16K merges (Sennrich et al., 2015).

Training
For all of our experiments, we use the OpenNMTpy (Klein et al., 2017) toolkit for training the NMT systems. We used the Transformer model with 6 layers in both the encoder and decoder, each with 512 hidden units. The word embedding size is set to 512 with 8 heads. The training is done in batches of maximum 4096 tokens at a time with dropout set to 0.3. We use Adam (Kingma and Ba, 2014) optimizer to optimize model parameters. We validate the model every 5,000 steps via BLEU (Papineni et al., 2002) and perplexity on the development set. We are training all of our models with early stopping criteria based on validation set accuracy. During testing, we rejoin translated BPE segments and convert the translated sentences back to their original language scripts. Finally, we evaluate the accuracy of our translation models using BLEU.

Results and Analysis
We report the Bleu score on the test set provided in the WAT 2021 MultiIndic shared task. Table 3 and Table 4 represents the results for different experiments we have performed for En-XX and XX-En directions respectively. The rows corresponding to PMI + CVIT + Back Translation + Fine tuning on PMI multilingual is our final system submitted for this shared task (Bleu scores shown in the table for this task are from automatic evaluation system). We observe that Multilingual system of PMI outperforms the bilingual baseline model of PMI by significant margins. The reason for this is the abil-En-XX en-hi en-pa en-gu en-mr en-bn en-or en-kn en-ml en-ta   ity to induce learning from multiple languages; also there is increase in vocab overlap using our technique of exploiting language relatedness. Further we tried to improve the performance of system using the relevant domains by incrementally adding different domains based on vocab overlap to the already existing system. We observed a decrease in Bleu score after adding the IIT-B corpus and therefore we stopped our incremental training at that point. Further we can see that our final multilingual model using back translation and fine tuning outperforms all other systems. Our submission also got evaluated with AMFM scores which can be found in the WAT 2021 evaluation website.

Conclusion
This paper presents the submissions by IIIT Hyderabd on the WAT 2021 MultiIndicMT shared Task. We performed experiments by combining different pre-processing and training techniques in series to achieve competitive results. The effectiveness of each technique is demonstrated. Our final submission able to achieve the second rank in this task according to automatic evaluation.