ANVITA Machine Translation System for WAT 2021 MultiIndicMT Shared Task

This paper describes ANVITA-1.0 MT system, architected for submission to WAT2021 MultiIndicMT shared task by mcairt team, where the team participated in 20 translation directions: English→Indic and Indic→English; Indic set comprised of 10 Indian languages. ANVITA-1.0 MT system comprised of two multi-lingual NMT models one for the English→Indic directions and other for the Indic→English directions with shared encoder-decoder, catering 10 language pairs and twenty translation directions. The base models were built based on Transformer architecture and trained over MultiIndicMT WAT 2021 corpora and further employed back translation and transliteration for selective data augmentation, and model ensemble for better generalization. Additionally, MultiIndicMT WAT 2021 corpora was distilled using a series of filtering operations before putting up for training. ANVITA-1.0 achieved highest AM-FM score for English→Bengali, 2nd for English→Tamil and 3rd for English→Hindi, Bengali→English directions on official test set. In general, performance achieved by ANVITA for the Indic→English directions are relatively better than that of English→Indic directions for all the 10 language pairs when evaluated using BLEU and RIBES, although the same trend is not observed consistently when AM-FM based evaluation was carried out. As compared to BLEU, RIBES and AM-FM based scoring placed ANVITA relatively better among all the task participants.

Developing quality machine translation system for the Indian languages still remains a major challenge, as large number of Indian languages are individually resource poor which greatly impacts translation quality. However some of the recent developments do show that careful utilization of multilingualism and/or monolingual corpora, translation quality can be boosted (Johnson et al., 2017;Sennrich et al., 2015).The purpose of WAT 2021 MultiIndicMT shared task is to validate the utility of MT techniques that focus on multilingualism and/or monolingual data in the context of Indian languages.
Our ANVITA-1.0 is realized as a Multilingual Neural Machine Translation(MNMT) system based on Transformer architecture (Vaswani et al., 2017). As transformer is sensitive to training noise (Liu et al., 2018), we have rigorously cleaned up the training corpus by applying set of heuristics. For better transfer of translation knowledge among the language pairs, ANVITA-1.0 used multilingual NMT approach and trained two models, one for the English→Indic and one for the Indic→English with shared encoder-decoder similar to MNMT models described by Johnson et.al (Johnson et al., 2017). Additionally, we employed back-translation (Sennrich et al., 2015) and transliteration techniques between related languages (Li et al., 2019) for selective data augmentation followed by model ensemble for better generalization. As Indian languages are morphologically rich, instead of word level tokenization, ANVITA-1.0 employed subword level tokenization, sentence piece (Kudo and Richardson, 2018) before putting up for training. Details are mentioned in the subsequent sections.
ANVITA-1.0 achieved highest AM-FM score for English→Bengali, 2nd for English→Tamil and 3rd for English→Hindi, Bengali→English directions on the official WAT 2021 MultiIndicMT test set. Overall, as compared to BLEU, RIBES and Adequacy-Fluency based scoring relatively placed us better in the ranking chart.

Related Work
A comprehensive survey covering challenges, design choices and other aspects related to Multilingual Neural Machine Translation(MNMT) was presented by Dabre et.al (Dabre et al., 2020 Liu et al. (2018) and Pinnis (2018) have proposed some heuristics for rigorous filtering of noise from parallel corpora. Li et al. (2019) have proposed combining parallel corpora by transliteration of related languages(grammar similarity) which improves performance. Back translation (Sennrich et al., 2015) is considered by many as one of the effective mechanism for enhancing MT performance.

Data sets
ANVITA-1.0 was primarily trained using Mul-tiIndicMT WAT 2021 1 corpora. Additionally AI4Bharat 2 monolingual corpora was used for generating synthetic parallel data by back translation. No other additional corpora or linguistic resources were used in ANVITA-1.0.

System Overview
This section describes ANVITA-1.0 MT system and its subsystems with reasonable details.

Data Preprocessing
This section presents set of preprocessing steps employed by ANVITA-1.0.

Data Filtering
Like most automatically curated corpora and corpora compiled from such curated corpus, Multi-IndicMT WAT 2021 corpora is also not free from noises. A quick glance through the corpora provided with a rough assessment of noises present and aided in employing set of heuristics to filter out many of those noisy sentence pairs. This is all the more critical as transformer based models are sensitive to noises (Liu et al., 2018). Rigorous distillation of training corpora was carried by employing set of heuristics similar to as described by Bei Li (Li et al., 2019). The heuristics applied for filtering out noises from MultiIndicMT WAT 2021 corpora are as given below.
• Filter out sentence pair, in which either source or target sentence is empty.
• Filter out sentence pair, in which either source or target sentence length greater than 800 characters.
• Filter out sentence pair in which length of source and target sentence ratio is greater than 2.5.
• Filter out sentence pair in which length of source and target sentence ratio is less than 0.4.  • Filter out sentence pair, if source sentence has at least 10 characters of other language.
• Filter out sentence pair, if source sentence has at least 60% characters of other language (used utf-8 ranges for other language character identification).
Approximately 15% of the total sentence pairs, amounting to 1.5 million sentence pairs were tagged as noisy after applying the above heuristics and were filtered out from the MultiIndicMT WAT 2021 training corpora. Detailed corpus statistics after filtering operation is given in Table-2. Final training data size after filtering turned out to be 8731036 sentence pairs. Data filtering improved both translation performance and convergence rate.

Tokenization at Sub-word Level
To effectively make use of the morphological richness property of Indian languages, sub-word level tokenization is employed instead of word or character level tokenization.

English→Indic:
Sentence piece tokenizer (Kudo and Richardson, 2018) was used with 80K joint vocabulary of 10 target Indic languages, 16K vocabulary of English and character coverage of 1.0.

Tagging of Source Sentences
To guide the input-output sequence mapping task better under multilingual setting, all sentences at the source side were tagged with language pair information using special tokens and placed at the beginning of each source sentence (Johnson et al., 2017). Special language tokens consisted of 4 characters and all having special symbols. Special symbols were used to avoid overlapping of language tokens with data tokens and token lengths were decided based on minimum number of characters required to tag 10 language pairs distinctly. Language tokens were used only at the source side during training of both the models i.e Indic→English and English→Indic models. Table-3 lists out the language tokens used.

Data Augmentation
Data augmentation has become a de-facto step for low resource MT. Following strategies were applied for augmenting data in ANVITA-1.0.

Related Language Transliteration
As most of the languages fall under low resource category, we employed related-language transliteration strategy for the top three low resource languages. Relatedness is decided based on similarities between languages (Li et al., 2019). Top three low resource languages as found in MultiIndicMT WAT 2021 corpora are Oriya(or), Kannada(kn), and Gujarati(gu). To the best of our knowledge, related languages of these three low resource Indian languages are listed in Table-4. Relatively high resource related language training data were transliterated into low resource language using transliterated method as described by Ahmad Bhat et    (Bhat et al., 2014) and added to the low resource language training data. For instance, Bengali sentences were transliterated into Oriya and augmented with Oriya training data. As Marathi and Hindi languages both share the same script, so in order to avoid script overlapping, we mapped characters of Marathi sentences to Unicode Block 0D80-0DFF. This seems to have reduced sharing of translation knowledge and impacted results. However this needs to be verified further through experimentation.

Back Translation
Back translation (Sennrich et al., 2015) is considered as one of the effective mechanism for enhancing MT performance, specially involving low resource languages. As most of languages in the task involved are low resource, back translation was applied for the top four low resource languages observed in the MultiIndicMT WAT 2021 corpora namely Oriya, Kannada, Punjabi, and Gujarati. We extracted monolingual corpora of 6 lakh sentences for each of the four low resource language pair from the AI4Bharat (Kakwani et al., 2020) corpora for the purpose. Statistics of the final training corpora after data augmentation is shown in Table-5.

Model Training
ANVITA-1.0 was trained based on Transformer architecture and for better sharing of knowledge among Indian languages, specially for re-source poor languages, two multilingual models were trained in (a) One-to-Many fashion for English→Indic and (b) Many-to-One fashion for Indic→English with shared encoder-decoder, similar to as described by Johnson et.al (Johnson et al., 2017).
Ensembling of multiple models, which are diverse in nature, have shown improvement of translation performance and better generalization (Li et al., 2019). Due to time and resource limitations, we could not work out on diverse models. However, we ensemble last 5 checkpoints i.e (560000-600000 iterations).

Experimental Details
ANVITA-1.0 used OpenNMT-py 2.0 (Klein et al., 2017) toolkit for training. Training configuration are 600000 steps for Indic→English, 440000 steps for English→Indic, with batch size of 4096, dropout 0.1, batch type tokens, adam optimizer, warmup steps 8000, word embedding size 512, encoder layers 6, decoder layers 6, heads 8,feed forward dimension of 2048, rnn size 512 and noam as learning rate decay method. ANVITA-1.0 was trained on NVIDIA DGX machine having 4 V100 GPU cards, each having 32GB of GPU memory. Training of Indic→English took approximately 96 hours and English→Indic took approximately 72 hours.

Evaluation and Results
Translation quality of ANVITA-1.0 was assessed by the organizer (Nakazawa et al., 2021) on the official WAT 2021 MultiIndicMT test set using BLEU, RIBES (Isozaki et al., 2010) and Adequacy-Frequency (Banchs et al., 2015) based metrics. The official evaluation results as declared by the organizer for all the 20 translation directions are shown in Table-6 and Table-     tions, training data size seems to be positively correlated with the translation performance. The exceptions are possibly due to implicit transfer of translation knowledge among the related languages.

Conclusion and Future Directions
The overall translation performance achieved by ANVITA-1.0 for the Indic→English directions are encouraging. Data augmentation largely aided the relatively lower resource languages well. Transfer of translation knowledge through shared encoderdecoder seems to be aided the related language better and data filtering improved the overall performance. RIBES and AM-FM based scoring placed us relatively better than BLEU.
Translation performance figures for the Indic→English directions achieved by ANVITA-1.0 are relatively better than that of English→Indic directions for all language pairs, when evaluated using BLEU and RIBES, though the same trend is not observed consistently when AM-FM based evaluation was carried out. Potential reasons could be One to Many mapping is relatively harder to learn as compared to Many to One mapping with shared decoder. One of the future direction would be to closely investigate whether having shared encoder but separate decoders helps for One-to-Many models in the Indic context. Though we have applied a large number of data filtering heuristics, we noticed that training data was still not free from noises. So another potential future direction would be to explore more effective data filtering techniques and its impacts on MT performance. Exploration of additional data augmentation strategies and effective transfer of translation knowledge, their shares in improving MT performance would be a critical direction when it comes to handling low resource languages. Having more diverse parallel corpora for the Indian languages will help Indic MT tasks and automated methods for compilation of large and diverse Indic corpus is a much needed one.