Samsung R&D Institute Poland submission to WAT 2021 Indic Language Multilingual Task

This paper describes the submission to the WAT 2021 Indic Language Multilingual Task by Samsung R&D Institute Poland. The task covered translation between 10 Indic Languages (Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil and Telugu) and English. We combined a variety of techniques: transliteration, filtering, backtranslation, domain adaptation, knowledge-distillation and finally ensembling of NMT models. We applied an effective approach to low-resource training that consist of pretraining on backtranslations and tuning on parallel corpora. We experimented with two different domain-adaptation techniques which significantly improved translation quality when applied to monolingual corpora. We researched and applied a novel approach for finding the best hyperparameters for ensembling a number of translation models. All techniques combined gave significant improvement - up to +8 BLEU over baseline results. The quality of the models has been confirmed by the human evaluation where SRPOL models scored best for all 5 manually evaluated languages.


Introduction
Samsung R&D Poland Team researched effective techniques that worked especially well for lowresource languages: transliteration, iterative backtranslation followed by tuning on parallel corpora. We successfully applied these techniques during the WAT2021 competition (Nakazawa et al., 2021). Especially for the competition we also applied custom domain-adaptation techniques which substantially improved the final results.
Most of the applied techniques and ideas are commonly used for works on Indian languages machine translation (Chu and Wang, 2018) (Dabre et al., 2020).
This document is structured as follows. In section 2 we describe the sources and techniques of corpora preparation used for the training. In sections 3 and 4 we describe the model architecture and techniques used in training, tuning and ensembling and finally Section 5 presents the results we gained on every stage of the training.
All trainings were performed on Transformer models. We used standard Marian NMT 1 v.1.9 framework.

Multilingual trainings
Multilingual models trained for the competition use a target language tag at the beginning of sentence to select the direction of the translation.

Transliteration
Indian languages use a variety of scripts. Using transliteration between scripts of similar languages may improve the quality of multilingual models as described in (Bawden et al., 2019) (Goyal and Sharma, 2019). The transliteration we applied was to replace Indian letters of various scripts to their equivalents in Devanagari script. We used indic-NLP 2 library to perform the transliteration.
In our previous experiments with Indian languages we noticed an overall improvement of the quality for multi-indian models, so we used transliteration in all trainings. However, additional experiments on transliteration during the competition were not conclusive. The results for trainings on raw corpora, without transliteration were similar (see Table 1).

Parallel Corpora Filtering
The base corpus for all trainings was the concatentaion of complete bilingual corpora provided by the organizers (further referenced as bitext) (11M lines in total). No filtering or preprocessing (but the transliteration) were performed on this corpus. The corpus included parallel data from: CVIT-PIB, PMIndia, IITB 3.0, JW, NLPC, UFAL EnTam, Uka Tarsadia None of these techniques applied on parallel corpora had led to quality improvement which is why we decided to continue with the basic non-filtered corpora as the base for future trainings.

Backtranslation
Backtranslation of monolingual corpora is a commonly used technique for improving machine translation. Especially for low-resource languages where only small bilingual corpora are available (Edunov et al., 2018). Training on backtranslations enriches the target language model, which improves the overall translation quality. The synthetic backtranslated corpus was joined with the original bilingual corpus for the trainings.
Using backtranslations of the full monolingual corpuses led to the improvement of results on translation on Indian to English directions by 1.2 BLEU on average. There was no improvement in the opposite directions. See Tables 5 and 6.

Domain adaptation
We enriched the parallel training corpora with backtranslated monolingual data selecting only sentences similar to PMI domain to increase the rate of in-domain data in the training corpus. We used two different techniques to select the in domain sentences for backtranslation. With these techniques we trained two separate families of MT models.
Domain adaptation by fastText (FT) -We applied the domain adaptation described in (Yu et al., 2020). Following the hints from the paper, we trained the fastText (Joulin et al., 2017) model using balanced corpus containing sentences from PMIndia labelled as in-domain and CCAligned sentences labelled as out-domain. Using the trained model we filtered the parallel as well as monolingual corpora.
Domain adaptation by language model (LM) As the second approach to select a subset of best PMI-like sentences from monolingual generaldomain AI4Bharat  corpora available for the task, we used the approach described in (Axelrod et al., 2011). For each of 10 Indian languages two RNN language models were constructed using Marian toolkit: in-domain trained with a particular part of PMI corpus and out-of-domain created using a similar number of lines from a mix of all other corpora available for that language respectively. All these models were regularized with exponential smoothing of 0.0001, dropout of 0.2 along with source and target word token dropout of 0.1. For the AI4Bharat mono corpus sentence ranking, we used a cross-entropy difference between scores of previously mentioned models as suggested in (Axelrod et al., 2011), normalized by the line length. Only sentences with a score above arbitrarily chosen threshold were selected for further processing. We noticed a significant influence of domain adaptation while selecting mono corpora used for backtranslation (see Table  3).

Multi-Agent Dual Learning
For some of trainings, we used the simplified version of Multi-Agent Dual Learning (MADL) (Wang et al., 2019), proposed in Kim et al. (2019), to generate additional training data from the parallel corpus. We generated n-best translations of both the source and the target sides of the parallel data, with strong ensembles of, respectively, the forward and the backward models. Next, we picked the best translation from among n candidates w.r.t. the sentence-level BLEU score. Thanks to these steps, we tripled the number of sentences by combining three types of datasets: 1. original source -original target, 2. original source -synthetic target,

synthetic source -original target,
where the synthetic target is the translation of the original source with the forward model, and the synthetic source is the translation of the original target with the backward model.

Postprocessing
In comparison to our competitors we noticed significantly weaker performance on the En-Or direction. After the analysis we found out that the generated corpora contain sequences of characters (U+0B2F-U+0B3C, U+0B5F) not present in the devset corpora. Replacing these sequences with sequence (U+0B5F-U+0B3E) gave a significant improvement for En-Or of about +4 BLEU.

NMT System Overview
All of our systems are trained with the Marian NMT 3 (Junczys-Dowmunt et al., 2018) framework.

Baseline systems for preliminary experiments
First experiments were performed with transformer models (Vaswani et al., 2017), which we will now refer to as transformer-base. The only difference is that we used 8 encoder layers and 4 decoder layers instead of default configuration 6-6. The model has default embedding dimension of 512 and a feed-forward layer dimension of 2048. We also used layer normalization (Ba et al., 2016) and tied the weights of the target-side embedding and the transpose of the output weight matrix, as well as source-and target-side embeddings (Press and Wolf, 2017). Optimizer delay was used to simulate batches of size up to 200GB, Adam (Kingma and Ba, 2017) was used as an optimizer, with a learning rate of 0.0003 and linear warm-up for the initial 48, 000 updates with subsequent inverted squared decay. No dropout was applied.

Final configuration
After the first experiments further trainings were performed on a transformer-big model. It has bigger dimensions than the transformer-base: an embedding dimension of 1024 and a feed-forward 3 github.com/marian-nmt/marian layer dimension of 4096. The transformer-big trainings were regularized with a dropout between transformer layers of 0.1 and a label smoothing of 0.1 unlike the transformer-base which was trained without a dropout.

Preliminary trainings
During preliminary trainings, we tested which techniques of filtering/backtranslation/MADL work best for the task. Preliminary trainings were performed for all 20 directions on a single transformerbase model with no dropout. There was no clear answer, which of the techniques work best. Generally, adding CCAligned corpus worsened the results. Training only on a big CCAligned corpus (15M lines) gave similar results to training on small PMIndia corpus (300k lines). For further trainings we decided to use the most promising techniques: filtered backtranslation (both methods fastText and Language Model) and MADL.
The preliminary training for one transformerbase model lasted 50 hours on two V100 GPUs -13 epochs. A summary of the preliminary results are gathered in Table 1 4

.2 Pretraining with backtranslations
For the final trainings we prepared various corpora with backtranslations filtered with a domaintransfer. We applied two methods of domain-  transfer described in previous sections: fastText and language model. Trainings were performed on separate transformer-big models. One many-toone model for 10 directions to-English and second one-to-many for 10 directions from-English. The whole pretraining for one transformer-big model lasted 200 hours on four V100 GPUs -8 epochs. Further tunings took additional 20 hours of processing.

Tuning with bitext
The best two pretrained models with domaintransfer (LM filtered and FT filtered) were the baselines to start the tuning with the parallel corpora. During the bitext tuning we used all bilingual data provided by organizers except CCAligned corpus -11M sentences in total. Tuning of baselines with the original parallel corpora improved the average BLEU of pretrained models by 0.97-1.85 BLEU (see Table 3)

Finetuning with PMIndia
We performed several attempts to finetune the final results with different corpora:  models were taken into process of mixing the best ensemble.

Ensembling
To further boost the translation quality, we used ensembles of models during decoding. Two separate ensembles were formed and tuned, one for transliterated Indian to English, the other in the opposite direction. Each ensemble consisted of: a number of Neural Translation Models, derived from various stages of training and model tuningup to as much as 9 NMT were used during weightoptimization; and a single Neural Language Model, either English or common Indian (based on all languages, transliterated into Hindi), depending on the direction. The tuning of ensemble weights was performed on the Development set and consisted of the following stages: • Expectation-Maximization of posterior emission probability for a mixture of models (Kneser and Steinbiss, 1993), based on NMT log-scores of Development sentencepairs, obtained using marian-score; this procedure, as well as being fast due to not requiring actual decoding, also worked well in practice, despite being based on interpolation n/a n/a in the linear probability domain, as opposed to log-domain interpolation used in Marian; • tuning single weights of the ensemble (bisectioning procedure, performed for a limited number of iterations; weights were tuned in the arbitrary order), based on BLEU scores of translated Development set (before normalization and tokenization); • (only for Indian-to-English) a sweep of normalization-factor, also on BLEU. 4 Individual tuning for target languages of Englishto-Indian directions was originally planned, but wasn't eventually used for submission, mostly due to lack of time, however visual inspection of the partial results also showed that some weights varied wildly, so devset over-fitting could be suspected at this point; normalization-factor optimization was planned to be performed after the aforementioned optimization, so consequently it was also skipped for English-to-Indian directions. Postsubmission tests showed an average improvement of ca. 0.2 BLEU, when using tuning for individual Indian target languages, but the gain was strongly dominated by the improvement on a single direction (En→Hi).
We experimented with several beam sizes increasing it up to 40. For the final submission we chose the size of 16. The larger beam gave little or no improvement at a cost of slowing down the decoding. For very large ensembles of 10 big models the decoding of the whole devset for 10 directions (10k lines) lasts about 25 minutes on a single V100 GPU. Table 4 presents the impact of tuning on BLEU scores on both devset and testset, in relation to a few manually selected setups, namely bestsingle-model, uniform and expert-selected "50-25-25%" ensemble. The final weight selection improved translation of the Indian-to-En directions by ca. 0.5 BLEU, compared to the expert ensemble or ca. 1.2 BLEU compared to best single model; on En-to-Indian directions, the improvement was <0.1 BLEU or ca. 0.5 BLEU, respectively. The results on the testset differ slightly from our final submissions as, during the ensemble tuning, we used simplified BLEU calculations algorithm (before normalization and tokenization)

Final Results
The detailed results of each stage of the best branch of trainings are gathered in Tables 5 and 6. The ensemble values are the submission evaluation results provided by the organizers.
Tables 7 and 8 contain the results of the models submitted by SRPOL compared with best results of competitors. The tables present values provided by WAT2021 organizers, calculated by 3 different metrics: BLEU (Papineni et al., 2002), RIBES (Isozaki et al., 2010), AMFM (Banchs and Li, 2011) Figure 1 shows the results of the human evaluation. The figure presents the values provided by WAT2021 organizers showing significant advance over the competitors. Especially amount of bad translations (scored 1-2) has been significantly reduced.

English → Indian
Application of all techniques for En→In directions gave the overall improvement of 3.6 BLEU from baseline average 18.8 to final 22.4 BLEU. Adding non-filtered backtranslations gave no improvement, probably because general Indian monocorpus is too different from specific language used in PMIndia. However, after domain adaptation of the training corpus we gained improvement of 1 BLEU. Most of the improvement was gained by finetuning on parallel corpora (1.5 BLEU) and PMI corpora (0.6 BLEU). Final ensembling process gave the average improvement of 0.5 BLEU.

Indian → English
Application of all techniques for In→En directions gave the overall improvement of 8 BLEU from baseline average 31.8 to final 39.8 BLEU. Adding non-filtered backtranslations gave 1.2 BLEU improvement but most of the improvement had been gained by domain adaptation which gave surprisingly high improvement of 3.6 BLEU. Further improvement was gained by finetuning on parallel corpora (1.9 BLEU) and PMI corpora (0.3 BLEU). The final ensembling process gave additional improvement of 1.0 BLEU.

Conclusions
We presented an effective approach to low-resource training consisting of pretraining on backtranslations and tuning on parallel corpora. We successfully applied domain-adaptation techniques which significantly improved translation quality measured by BLEU. We presented an effective approach for finding best hyperparameters for the ensembling number of single translation models. We applied transliteration, but the final results did not confirm that this approach is effective, at least for that particular task.
We tried several filtering techniques for parallel corpora but the results showed no improvement. This may be a confirmation that the parallel corpora provided by the competition organizers are of high quality which is hard to improve.
Probably for the same reason domain-adaptation on parallel corpora didn't improve the results. However domain-adaptation worked surprisingly well for monolingual corpora.