Rakuten’s Participation in WAT 2021: Examining the Effectiveness of Pre-trained Models for Multilingual and Multimodal Machine Translation

This paper introduces our neural machine translation systems’ participation in the WAT 2021 shared translation tasks (team ID: sakura). We participated in the (i) NICT-SAP, (ii) Japanese-English multimodal translation, (iii) Multilingual Indic, and (iv) Myanmar-English translation tasks. Multilingual approaches such as mBART (Liu et al., 2020) are capable of pre-training a complete, multilingual sequence-to-sequence model through denoising objectives, making it a great starting point for building multilingual translation systems. Our main focus in this work is to investigate the effectiveness of multilingual finetuning on such a multilingual language model on various translation tasks, including low-resource, multimodal, and mixed-domain translation. We further explore a multimodal approach based on universal visual representation (Zhang et al., 2019) and compare its performance against a unimodal approach based on mBART alone.


Introduction
This paper introduces our neural machine translation (NMT) systems' participation in the 8th Workshop on Asian Translation (WAT-2021) shared translation tasks (Nakazawa et al., 2021). We participated in the (i) NICT-SAP's IT and Wikinews, (ii) Japanese-English multimodal translation, (iii) Multilingual Indic, and (iv) Myanmar-English translation tasks.
Recent advances in language model pre-training have been successful in advancing the state-of-theart in various natural language processing tasks. Multilingual approaches such as mBART (Liu et al., 2020) are capable of pre-training a full sequenceto-sequence model through multilingual denoising objectives, which leads to significant gains in downstream tasks, such as machine translation. Building upon our success with utilizing mBART25 in the 2020 edition of WAT (Wang and Htun, 2020), we put more focus on multilingual and multimodal translation this year. In particular, instead of performing bilingual finetuning on mBART for each language pair, we train a single, multilingual NMT model that is capable of translating multiple languages at once. As first proposed by Tang et al. (2020), we apply multilingual finetuning to mBART50 for the NICT-SAP task (involving 4 Asian languages) and Multilingual Indic task (involving 10 Indic languages). Our findings show the remarkable effectiveness of mBART pre-training on these tasks. On the Japanese-English multimodal translation task, we compare a unimodal text-based model, which is initialized based on mBART, with a multimodal approach based on universal visual representation (UVR) (Zhang et al., 2019). Last, we continue our work on Myanmar-English translation by experimenting with more extensive data augmentation approaches. Our main findings for each task are summarized in the following: • NICT-SAP task: We exploited mBART50 to improve low-resource machine translation on news and IT domains by finetuning them to create a mixed-domain, multilingual NMT system.
• Multimodal translation: We investigated multimodal NMT based on UVR in the constrained setting, as well as a unimodal textbased approach with the pre-trained mBART model in the unconstrained setting.
• Multilingual Indic task: We used the pretrained mBART50 models, extended them for various Indic languages, and finetuned them on the entire training corpus followed by finetuning on the PMI dataset. Language  hi  id  ms  th   Train  ALT  18,088  IT  254,242 158,472 506,739 74,497   Dev  ALT  1,000  IT  2,016  2,023  2,050  2,049   Test  ALT  1,018  IT  2,073  2,037  2,050  2,050   Table 1: Statistics of the NICT-SAP datasets. Each language is paired with English.

Split Domain
• Myanmar-English translation: We designed contrastive experiments with different data combinations for Myanmar↔English translation and validated the effectiveness of data augmentation for low-resource translation tasks.

Task Description
This year, we participated in the NICT-SAP translation task, which involves two different domains: IT domain (Software Documentation) and Wikinews domain (ALT). These are considered low-resource domains for Machine Translation, combined with the fact that it involves four low-resource Asian languages: Hindi (hi), Indonesian (id), Malay (ms), and Thai (th). For training, we use parallel corpora from the Asian Language Treebank (ALT) (Thu et al., 2016) for the Wikinews domain and OPUS 1 (GNOME, KDE4, and Ubuntu) for the IT domain. For development and evaluation, we use the datasets provided by the organizer: SAP software documentation (Buschbeck and Exel, 2020) 2 and ALT corpus. 3 Table 1 shows the statistics of the datasets.

Data Processing
We tokenized our data using the 250,000 Senten-cePiece model (Kudo and Richardson, 2018) from mBART (Liu et al., 2020), which was a joint vocabulary trained on monolingual data for 100 languages from XLMR (Conneau et al., 2020). Moreover, we prepended each source sentence with a domain indicator token to distinguish the ALT (<2alt>) and IT domain (<2it>).
We collect parallel corpora from all the language pairs involved in this task, namely {hi,id,ms,th}↔en. Following mBART, we prepend source and target language tokens to each source and target sentences, respectively. The size of each dataset varies across language pairs. For instance, the size of the Malay training corpus for the IT domain is roughly 5× larger than that of Thai. To address this data imbalance, we train our model with a temperature-based sampling function following Arivazhagan et al. (2019): where B i,j corresponds to the parallel corpora for a language pair (i, j) and T the temperature for sampling.

Model
We use the pre-trained mBART50 model (Tang et al., 2020) as our starting point for finetuning our translation systems. Unlike the original mBART work that performed bilingual finetuning (Liu et al., 2020), Tang et al. (2020) proposed multilingual finetuning where the mBART model is finetuned on many directions at the same time, resulting in a single model capable of translating many languages to many other languages. In addition to having more efficient and storage maintenance benefits, such an approach greatly helps low-resource language pairs where little to no parallel corpora are available. While the mBART50 has great coverage of 50 languages, we found that it does not include all languages involved in this task, particularly Malay. Following Tang et al. (2020), who extended mBART25 to create mBART50, we extended mBART50's embedding layers with one additional randomly initialized vector for the Malay language token. 4 We use the same model architecture as mBART50, which is based on Transformer (Vaswani et al., 2017). The model was finetuned for 40,000 steps with Adam (Kingma and Ba, 2015) using β1 = 0.9, β2 = 0.98, and = 1e −6 . We use a maximum batch size of 512 tokens and gradients were accumulated every 4 mini-batches on each GPU. We ran our experiments on 4 NVIDIA    Quadro RTX 6000 GPUs. Table 2 shows the details of our experimental settings. Table 3 and Table 4 show our experimental results in terms of BLEU (Papineni et al., 2002) and AMFM (Banchs et al., 2015) scores, respectively. We first show our multilingual finetuning results on the released mBART50 model (Tang et al., 2020), 5 which was pre-trained as a denoising autoencoder on the monolingual data from XLMR (Conneau et al., 2020) (mBART50 -pre-trained). Compared to one submission from previous year's WAT from Dabre and Chakrabarty (2020), which is a multilingual many-to-many model without any pre-training, we observe a significant improvement from multilingual finetuning across all language pairs for both domains. For instance, we obtain the largest im-5 https://github.com/pytorch/fairseq/t ree/master/examples/multilingual provement of 25.23 BLEU points for id→en on the ALT domain. These findings clearly show that multilingual models greatly benefit from pre-training as compared to being trained from scratch, and more so for low resource languages.

Results
Second, Tang et al. (2020) released a many-tomany multilingual translation that was finetuned from mBART on publicly available parallel data for 50 languages, including all language pairs in this task, except Malay. We adapt this model by performing a further finetuning on the NICT-SAP dataset (mBART50 -ft.nn). On average, this model further improves BLEU by 2.37 points on ALT and 1.98 points on IT.
Finally, we trained three independent models with different random seeds to perform ensemble decoding. This is our final submission, which achieves the first place in AMFM scores on this year's leaderboard for 7 translation directions for ALT (all except en→ms) and 6 directions for IT (all except for en→hi and en→id).
For the human evaluation on the IT task, our systems obtained 4.24 adequacy score for en→id and 4.05 for en→ms, which were the highest among all participants this year. We refer readers to the overview paper (Nakazawa et al., 2021) for the complete evaluation results.

Task Description
Multimodal neural machine translation (MNMT) has recently received increasing attention in the NLP research fields with the advent of visuallygrounded parallel corpora. The motivation of Japanese↔English multimodal task is to improve translation performance with the aid of heterogeneous information (Nakazawa et al., 2020). In particular, we performed the experiments based on the benchmark Flickr30kEnt-JP dataset , where manual Japanese translations are newly provided to the Flickr30k Entities image captioning dataset (Plummer et al., 2015) that consists of 29,783 images for training and 1,000 images for validation, respectively. For each image, the original Flickr30k has five sentences, while the extended Flickr30kEnt-JP has corresponding Japanese translation in parallel 6 .
In terms of input sources, this multimodal task has been divided into four sub-tasks: constrained and unconstrained Japanese↔English translation tasks. In the constrained setting, we investigated the MNMT models with universal visual representation (UVR) (Zhang et al., 2019), which is obtained from the pre-trained bottom-up attention model (Anderson et al., 2018). In contrast, we also explored the capability of unimodal translation (i.e., text modality only) under the unconstrained setting, where the pre-trained mBART25 model (Liu et al., 2020) was employed as the external resource.

Data Processing
Text preparation For the constrained setting, we firstly exploited Juman analyzer 7 for Japanese and Moses tokenizer for English. Then, we set the vocabulary size to 40,000 to train the byte-pair encoding (BPE)-based subword-nmt 8 (Sennrich et al., 2016) model. Moreover, we merged the source and target sentences and trained a joint vocabulary for the NMT systems. Under the unconstrained setting, we used the same 250,000 vocabulary as in the pre-trained mBART model for the text input to mBART finetuning, which was automatically tokenized with a SentencePiece model (Kudo and Richardson, 2018) based on BPE method.
Universal visual retrieval For the constrained setting particularly, we propose to extract the precomputed global image features from the raw Flickr30k images using the bottom-up attention Faster-RCNN object detector that is pre-trained on the Visual Genome dataset (Krishna et al., 2017).  Specifically, we adopted the pre-trained model 9 to extract the spatial image features corresponding to 36 bounding boxes regions per image, which were then encoded into a global image feature vector by taking the global average pooling of them. In practice, we followed (Zhang et al., 2019) and presented the UVR relying on image-monolingual annotations (i.e., source sentences). To retrieve the universal visual information from the source sentences, the sentence-image pairs have been transformed into two topic-image lookup tables from the Flickr30kEnt-JP dataset for Japanese→English and English→Japanese tasks, respectively. Note that no image information has been learned in our unconstrained models due to the text-only property.

Model
In this section, we will elaborate on our proposed model architectures for the constrained and unconstrained tasks, respectively.

Multimodal model with UVR
Following (Zhang et al., 2019), we built the multimodal models based on the standard Transformer (Vaswani et al., 2017) with an additional cross-attention layer in the encoder, followed by a gating mechanism that fused the visual modality and text modality information. In particular, visual representation retrieved from the topic-image lookup table has been encoded by a self-attention network that is in parallel with the source sentence encoder. Then, a cross attention mechanism has been applied to append the image representation to the text representation. Using a learnable weighting gate λ ∈ [0, 1), we obtained the aggregated multimodal representation corresponding to the significance distribution of either modality, which would be used as input to the decoder for predicting target translations. The hyper-parameter setting is shown in Table 4. mBART25 finetuning Regardless of the image representation, we also finetuned on the Flickr30kEnt-JP corpus using the mBART25 pretrained model under the unconstrained task setting. Following (Liu et al., 2020), we used the same mBART25-large model 10 and finetuned for 40,000 steps with early stopping control if the validation loss has not been improved for 3 iterations. We used the learning rate schedule of 0.001 and maximum of 4000 tokens in a batch, where the parameters were updated after every 2 epochs. More details of model hyper-parameters setting can be found in Table 4.
We trained the MNMT models and finetuned the mBART25 models using the Fairseq toolkit (Ott et al., 2019) on 4 V100 GPUs. Finally, the best performing models on the validation sets were selected and applied for decoding the test sets. Furthermore, we trained three independent models with different random seeds to perform ensemble decoding.

Results
In Table 5, we show the evaluation scores that the multimodal NMT with universal visual representation and mBART25 finetuning models achieve. In the constrained setting (a.k.a, task (a)), we observed that the MNMT single model (MNMT sin. ) decoding results unexceptionally lagged behind that of the ensemble decoding (MNMT ens. ) in both directions. Without any other resources except pretrained image features, our best submissions of NNMT with UVR win the first place in BLEU as well as human adequacy scores on the WAT leaderboard for the Japanese→English task (a). Moreover, the MNMT ens. model can outperform the mBART25 finetuning model (mBART sin. ) using external models/embeddings by 0.17 BLEU score in the English→Japanese task (a), which validates the effectiveness of exploring visual information for machine translation.
Under the unconstrained setting, the textonly mBART sin. models achieved significant im-

Task Description
The Multilingual Indic task covers English (en) and 10 Indic (in) Languages: Bengali (bn), Gujarati (gu), Hindi (hi), Kannada (kn), Malayalam (ml), Marathi (mr), Oriya (or), Punjabi (pa), Tamil (ta) and Telugu (te). Multilingual solutions spanning 20 translation directions, en↔in were encouraged in form of many2many, one2many and many2one models. We train one2many for en→in and many2one for in→en directions. We use the parallel corpora provided by the organizer for training, validation, and evaluation. Table 6 shows the statistics of the entire training data and PMI dataset specific statistics (Haddow and Kirefu, 2020).

Data Processing
We normalize entire Indic language data using Indic NLP Library 11 version 0.71. After that, we use the 250,000-token SentencePiece model from mBART and prepend source and target tokens to   Table 7: BLEU results on the Multilingual Indic task each source and target sentence, respectively. We then binarize the data using Fairseq (Ott et al., 2019) framework. Following Section 2.2, we also train with temperature-based sampling to address dataset imbalance.

Model
Similar to our use of the pre-trained mBART50 model from Section 2.3, we use multilingual finetuning and model extension for Oriya, Punjabi, and Kannada using randomly initialized vectors. We use the same model architecture as mBART50 and run Adam optimization using β1 = 0.9, β2 = 0.98, and = 1e −6 . We use a maximum batch size of 512 tokens and gradients were accumulated every 4 mini-batches on each GPU. We ran our experiments on 8 NVIDIA V100 GPUs. Table 2 shows the details of our experimental settings.
We finetune one2many pre-trained mBART50 (mBART50 -ft.1n) for en→in on entire training set for six epochs. We further adapt this model on PMI dataset given as part of the training set for nine epochs. Similarly, we finetune many2many pretrained mBART50 (mBART50 -ft.nn) for in→en on entire training set for six epochs and adaptation on PMI dataset for one epoch. Table 7 shows our experimental results in terms of BLEU scores. As a baseline, we compare our models with the organizer's bilingual base Transformer model trained on the PMI dataset (ORGANIZER). We observe an average improvement of 7.4 BLEU points over this baseline across all en→in pairs by finetuning the mBART50 -ft.1n model for 6 epochs. Further adaptation on the PMI dataset for 12 epochs results in an average improvement of 1.6 BLEU points. For en→ml, we observe a drop from 10.94 to 8.13 on adaptation. Similarly, we observe an average improvement of 15.76 BLEU points over baseline across all in→en pairs by finetuning the mBART50 -ft.nn model for 4 epochs. Further adaptation on the PMI dataset for a single epoch results in an average improvement of 0.88 BLEU points. Table 8 and 9 show official AMFM and human evaluation results (top three systems for ten translation directions) respectively. Our systems ranked second 6 times out of the 10 directions for which human evaluation results are available, while SRPOL has consistently outperformed all systems. This demonstrates the efficacy of using mBART models for multilingual models. Complete evaluation results are available in the overview paper (Nakazawa et al., 2021). 12 5 Myanmar-English Translation Task

Task Description
In the ALT+ tasks, we conducted experiments on the Myanmar-English parallel data which was provided by the organizers and consist of two corpora, the ALT corpus (Ding et al., 2019(Ding et al., , 2020

Data Processing
For the ALT+ tasks, the ALT and UCSY training datasets were merged first. For cleaning, we removed redundant whitespaces and double quotation marks. We tokenized English sentences using Moses (Koehn et al., 2007) and Myanmar sentences using Pyidaungsu Myanmar Tokenizer 14 with syllable and word level segments, which were then fed into a SentencePiece model to produce subword 13 http://lotus.kuee.kyoto-u.ac.jp/WAT/m y-en-data/ 14 https://github.com/kaunghtetsan275/p yidaungsu units. Slightly different from previous approach (Wang and Htun, 2020), we generated three English datasets with different types: (i) original, (ii) clean, and (iii) clean and tokenized versions. For Myanmar, we have four types: (i) original, (ii) clean, (iii) word-level tokenized, and (iv) syllablelevel tokenized. Table 10 describes the resulting datasets with different preprocessing steps.

Model
For training, we generated multiple training datasets by using different combinations of the datasets in Table 10: • D 3 = {P 1 , P 3 , P 4 , P 6 , P 7 } • D 4 = {P 3 , P 4 , P 6 , P 7 } For both directions on each dataset, we trained individual Transformer models using the Marian 15 toolkit. We created two different parameter configurations as shown in Table 11. We used the first configuration (Config. 1) on D 1 and the second configuration (Config. 2) on the rest (D 2 , D 3 , and D 4 ). Note that our second configuration has a larger vocabulary size and increased regularization (dropout, label smoothing). All experimental models in this task were trained on 3 GP104 machines with 4 GeForce GTX 1080 GPUs in each, and the experimental results will be shown and analyzed in the following section. Table 12 presents the results of our experiments on the given ALT test dataset evaluation for two directions. As our baseline, we trained on the original training set (D 1 ) without further preprocessing and using the first model configuration. After using data augmentation, we observed consistent

Conclusion
We presented our submissions (team ID: sakura) to the WAT 2021 shared translation tasks in this paper. We showed the remarkable effectiveness of pre-trained models in improving multilingual and multimodal neural machine translation. On multilingual translation, models initialized with mBART50 achieved substantial performance gains on both NICT-SAP and Multilingual Indic tasks. On multimodal translation, a text-only model with mBART25 pre-training improves upon an MNMT model based on UVR. Finally, we extended our data augmentation approaches on the Myanmar-English translation tasks and obtained further improvements.