Language Relatedness and Lexical Closeness can help Improve Multilingual NMT: IITBombay@MultiIndicNMT WAT2021

Multilingual Neural Machine Translation has achieved remarkable performance by training a single translation model for multiple languages. This paper describes our submission (Team ID: CFILT-IITB) for the MultiIndicMT: An Indic Language Multilingual Task at WAT 2021. We train multilingual NMT systems by sharing encoder and decoder parameters with language embedding associated with each token in both encoder and decoder. Furthermore, we demonstrate the use of transliteration (script conversion) for Indic languages in reducing the lexical gap for training a multilingual NMT system. Further, we show improvement in performance by training a multilingual NMT system using languages of the same family, i.e., related languages.


Introduction
Neural Machine Translation (Sutskever et al., 2014;Bahdanau et al., 2015;Wu et al., 2016) has become a de-facto for automatic translation of language pairs. NMT systems with Transformer (Vaswani et al., 2017) based architectures have achieved competitive accuracy on data-rich language pairs like English-French. However, NMT systems are datahungry, and only a few pairs of languages have abundant parallel data. For low resource setting, techniques like transfer learning (Zoph et al., 2016) and utilization of monolingual data in an unsupervised setting (Artetxe et al., 2018;Lample et al., 2017Lample et al., , 2018 have shown support for increasing the translation accuracy. Multilingual Neural Machine Translation is an ideal setting for low resource MT (Lakew et al., 2018) since it allows sharing of encoder-decoder parameters, word embeddings, and joint or separate vocabularies. It also enables zero-shot translations, i.e., translating between language pairs that were not seen during training (Johnson et al., 2017a).
To summarize our approach and contributions, we (i) present a multilingual NMT system with shared encoder-decoder framework, (ii) show results on many-to-one translation, (iii) use transliteration to a common script to handle the lexical gap between languages, (iv) show how grouping of languages in regard to their language family helps multilingual NMT and (v) use language embeddings with each token in both encoder and decoder.
2 Related work

Neural Machine Translation
Neural Machine Translation architectures consist of encoder layers, attention layers, and decoder layers. NMT framework takes a sequence of words as an input; the encoder generates an intermediate representation, conditioned on which, the decoder generates an output sequence. The decoder also attends to the encoder states. Bahdanau et al. (2015) introduced the encoder-decoder attention to allow the decoder to soft-search the parts of the source sentence to predict the next token. The encoderdecoder can be a LSTM framework (Sutskever et al., 2014;Wu et al., 2016), CNN (Gehring et al., 2017), or Transformer layers (Vaswani et al., 2017). A Transformer layer comprises of self-attention that bakes the understanding of input sequence with positional encoding and passes on to the next component, feed-forward neural network, layer normalization, and residual connections. The decoder in the transformer has an additional encoder-attention layer that attends to the output states of the transformer encoder.
NMT is data-hungry, and only a few pairs of languages have abundant parallel data. In recent years, NMT has been accompanied by several techniques to improve the performance of both low & high resource language pairs. Back-translation (Sennrich et al., 2016b) is used to augment the parallel data with synthetically generated parallel data by passing monolingual datasets to the previously trained models. Currently, NMT systems also perform on-the-fly back-translation to train the model simultaneously. Tokenization methods like Byte Pair Encoding (Sennrich et al., 2016a) are used in almost all NMT models. Pivoting (Cheng et al., 2017) and Transfer Learning (Zoph et al., 2016) have leveraged the language relatedness by indirectly providing the model with more parallel data from related language pairs.

Multilingual Neural Machine Translation
Multilingual NMT trains a single model utilizing data from multiple language-pairs to improve the performance. There are different approaches to incorporate multiple language pairs in a single system, like multi-way NMT, pivot-based NMT, transfer learning, multi-source NMT and, multilingual NMT (Dabre et al., 2020). Multilingual NMT came into picture because many languages share certain amount of vocabulary and share some structural similarity. These languages together can be utilized to improve the performance of NMT systems. In this paper, our focus is to analyze the performance of multi-source NMT. The simplest approach is to share the parameters of NMT model across multiple language pairs. These kinds of systems work better if languages are related to each other. In Johnson et al. (2017b), the encoder, decoder, and attention are shared for the training of multiple language pairs and a target language token is added at the beginning of target sentence while decoding. Firat et al. (2016) utilizes a shared attention mechanism to train multilingual models. Recently many approaches have been proposed, where monolingual data of multiple languages is utilized to pre-train a single model using different objectives like masked language modeling and denoising (Lample and Conneau, 2019;Song et al., 2019;. Multilingual pre-training followed by multilingual finetuning has also proven to be beneficial (Tang et al., 2020).

Language Relatedness
Telugu, Tamil, Kannada, and Malayalam are Dravidian languages whose speakers are predominantly found in South India, with some speakers in Sri Lanka and a few pockets of speakers in North India. The speakers of these languages constitute around 20% of the Indian population (Kunchukuttan and Bhattacharyya, 2020). Dravidian languages are agglutinative, i.e., long and complex words are formed by stringing together morphemes without changing them in spelling or phonetics. Most Dravidian languages have clusivity distinction. Hindi, Bengali, Marathi, Gujarati, Oriya, Punjabi are Indo-Aryan languages and are primarily spoken in North and Central India and the neighboring countries of Pakistan, Nepal, and Bangladesh. The speakers of these languages constitute around 75% of the Indian population. Both Dravidian and Indo-Aryan language families follow the Subject(S)-Object(O)-Verb(V) order.
Grouping languages concerning their families have inherent advantages because they form a closely related group with several linguistic phenomenons shared amongst them. Indo-Aryan languages are morphologically rich and have huge similarities when compared to English. A language group also share vocabularies at both word and character level. They contain similarly spelled words that are derived from the same root. '

Transliteration
Indic languages share a lot of vocabulary, but most languages utilize different scripts. Nevertheless, these scripts have phoneme overlap and can be converted easily from one to another using a simple rule-based system. To convert all Indic language data into the same script, we use IndicNLP 1 which maps different Unicode range for the conversion. The conversion of all Indic language scripts to the same script helps with better shared vocabulary and leads to smaller subword vocabulary (Ramesh et al., 2021).

System overview
In this section, we describe the details of the submitted systems to MultiIndicMT task at WAT2021. We report results for four types of models: • Bilingual: Trained only using parallel data for a particular language pair (bilingual models).
• All-En: Multilingual many-to-one system trained using all available parallel data of all language pairs.
• IA-En: Multilingual many-to-one system trained using Indo-Aryan languages from the provided parallel data.
• DR-En: Multilingual many-to-one system trained using Dravidian languages from the provided parallel data.
To train our multilingual models, we use shared encoder-decoder transformer architecture. To handle the lexical gap between Indic languages in multilingual models, we convert the data of all Indic languages to a common script. We choose the common script as Devanagari (arbitrary choice). We also perform a comparative study of systems when the encoder and decoder are shared only between related languages. To perform this comparative study, we group the provided set of languages in two parts based on the language families they belong to, i.e, the system is trained from Indo-Aryan (group) to English, and Dravidian (group) to English. Indo-Aryan-to-English contains Bengali, Gujarati, Hindi, Marathi, Oriya, Punjabi to English, and Dravidian-to-English contains Kannada, Malayalam, Tamil, Telugu to English. We use shared subword vocabulary of the languages involved while training multilingual models, and a common vocabulary of source and target languages to train bilingual models.

Experimental details 4.1 Dataset
Our models are trained using only the parallel data provided for the task. The size of the parallel data available and its source of origin are summarized in Table 1. The validation and test data provided in the task is n-way and contains 1000 sentences for validation and 2390 sentences in test set.

Data preprocessing
We tokenize English language data using moses tokenizer (Koehn et al., 2007), and Indian language data using IndicNLP 2 library. For multilingual models, we transliterate (script mapping) all Indic language data into Devanagari script using the IndicNLP library. Our aim here is to convert data of all languages into the same script, hence the choice of Devnagari as a common script is arbitrary. We use fastBPE 3 to learn BPE (Byte pair encoding) (Bojanowski et al., 2017). For bilingual models, we use 60000 BPE codes over the combined tokenized data of both languages. The number of BPE codes is set to 100000 for All-En, and 80000 for DR-En and IA-En.

Experimental Setup
We use six layers in the encoder, six layers in the decoder, 8 attention heads in both encoder and decoder, and 1024 embedding dimension. The encoder and decoder are trained using Adam (Kingma and Ba, 2015) optimizer with inverse square root learning rate schedule. We use the same setting as used in Song et al. (2019) for warmup phase, in which the learning rate is increased linearly for some initial steps starting from 1e − 7 to 0.0001, warmup phase is set to 4000 steps. We use minibatches of size 2000 tokens and set the dropout to 0.1 (Gal and Ghahramani, 2016). Maximum sentence length is set to 100 after applying BPE. At decoding time, we use greedy decoding. For experiments, we are using mt steps from MASS 4 codebase. Our models are trained using only parallel data provided in the task, we are not training the model using any kind of pretraining objective. We train bilingual models for 100 epochs and multilingual models for 150 epochs. The epoch size is set to 200000 sentences. Due to resource constraints, we train our model for fixed number of epochs, it does not guarantee convergence. Similar to MASS (Song et al., 2019), language embeddings are added to each token in the encoder and decoder to distinguish between languages. These language embeddings are learnt during training.

Results and Discussion
We report BLEU scores for our four settings: bilingual, All-En (multilingual many-to-one), IA-En (multilingual many-to-one Indo-Aryan to English), and DR-En (multilingual many-to-one Dravidian to English) in Table 2. We use multi-bleu.perl 5 to calculate BLEU scores of baseline models. BLEU score is calculated using the tokenized reference and hypothesis files as followed by organizers in
The BLEU score in table 2 highlights that the multilingual model outperforms the simpler bilingual models. Although we did not submit bilingual models in the shared task submission, we use it here as a baseline to compare with multilingual models. Moreover, upon grouping languages based on their language families, significant improvement in BLEU scores is observed due to less confusion and better learning of the language representations in shared encoder-decoder architecture. We ob-  serve that the BLEU score increases by 14 percent on average when the languages are grouped based on their families (IA-En & DR-En) and by 7 percent when all languages are combined in a single multilingual model (All-En) as compared to the bilingual models. The IA-En and DR-En BLEU scores being better than both bilingual and multilingual (All-En) models encourage the exploitation of linguistic insights like languages relatedness and lexical closeness among language families. Table 3 shows the percentage of vocabulary overlap in two languages. We get the vocabulary of each language using the source language part of the BPE processed parallel train set files as used in All-En experiment. The vocabulary size for each language is different. Equation 1 states how the value in each cell is calculated. V 1, V 2 are the vocabularies of lang1 & lang2 respectively. The numerator is the count of intersection of the two vocabularies and denominator is the count of the vocabulary of lang1.
Almost all indic languages provided in the task bn, gu, (hi,mr), or, pa, kn, ml, ta, te, use different scripts except hi and mr. Both hi and mr utilize the same script (devnagari). It is clear from Table 3 that transliteration to a common script helps in increasing the shared vocabulary and helps the model to leverage the benefit of the lexical similarity be-tween languages.

Conclusion
In this paper, we study the influence of sharing encoder-decoder parameters between related languages in multilingual NMT by performing experiments with the grouping of languages based on language family. Furthermore, we also perform experiments of multilingual NMT with all Indic language data converted to the same script, which helps the model in learning better translation by utilizing the benefit of better shared vocabulary.
In the future, we plan to utilize monolingual data from (Kakwani et al., 2020) to improve multilingual NMT further.