NICT-2 Translation System at WAT-2021: Applying a Pretrained Multilingual Encoder-Decoder Model to Low-resource Language Pairs

In this paper, we present the NICT system (NICT-2) submitted to the NICT-SAP shared task at the 8th Workshop on Asian Translation (WAT-2021). A feature of our system is that we used a pretrained multilingual BART (Bidirectional and Auto-Regressive Transformer; mBART) model. Because publicly available models do not support some languages in the NICT-SAP task, we added these languages to the mBART model and then trained it using monolingual corpora extracted from Wikipedia. We fine-tuned the expanded mBART model using the parallel corpora specified by the NICT-SAP task. The BLEU scores greatly improved in comparison with those of systems without the pretrained model, including the additional languages.


Introduction
In this paper, we present the NICT system (NICT-2) that we submitted to the NICT-SAP shared task at the 8th Workshop on Asian Translation (WAT-2021) (Nakazawa et al., 2021). Because the NICT-SAP task expects to perform translations with little parallel data, we developed a system to improve translation quality by applying the following models and techniques.
Pretrained model: An encoder-decoder model pretrained using huge monolingual corpora was used.
We used a multilingual bidirectional auto-regressive Transformer (mBART) (i.e., multilingual sequence-to-sequence denoising autoencoder ) model, which supports 25 languages. Because it includes English and Hindi, but does not include Indonesian, Malay, and Thai, we expanded it to include the unsupported languages and additionally pretrained it on these five languages. 1 1 The mBART-50 model (Tang et al., 2020) supports 50 languages including Indonesian and Thai. However, Malay Multilingual models: We tested multilingual models trained using multiple parallel corpora to increase resources for training.
Domain adaptation: We tested two domain adaptation techniques. The first technique is training multi-domain models. Similar to multilingual models, this technique trains a model using the parallel corpora of multiple domains. The domains are identified by domain tags in input sentences. The second technique is adaptation based on fine-tuning. This method fine-tunes each domain model (using its domain corpus) from a model trained by a mixture of multi-domain corpora.
Our experimental results showed that the pretrained encoder-decoder model was effective for translating low-resource language pairs. However, the effects of multilingual models and domain adaptation became low when we applied the pretrained model.
The following sections are organized as follows. We first summarize the NICT-SAP shared task in Section 2, and briefly review the pretrained mBART model in Section 3. Details of our system is explained in Section 4. In Section 5, we present experimental results. Finally, we conclude our paper in Section 6.

NICT-SAP Shared Task
The NICT-SAP shared task was to translate text between English and four languages, that is, Hindi (Hi), Indonesian (Id), Malay (Ms), and Thai (Th), for which the amount of data in parallel corpora is relatively low. The task contained two domains.
The data in the Asian Language Translation (ALT) domain (Thu et al., 2016) consisted of translations obtained from WikiNews. The ALT is not supported by either the mBART model or mBART-50 models. Therefore, we applied additional pretraining to the mBART model.  data is a multilingual parallel corpus, that is, it contains the same sentences in all languages. The training, development, and test sets were provided from the WAT organizers.
The data in the IT domain consisted of translations of software documents. The WAT organizers provided the development and test sets (Buschbeck and Exel, 2020). For the training set, we obtained GNOME, KDE, and Ubuntu sub-corpora from the OPUS corpus (Tiedemann, 2012). Therefore, the domains for the training and dev/test sets were not identical.
The data sizes are shown in Table 1. There were fewer than 20K training sentences in the ALT domain. Between 73K and 504K training sentences were in the IT domain. Note that there were inadequate sentences in the training sets. We filtered out translations that were longer than 512 tokens, or where source/target sentences were three times longer than the target/source sentences if they had over 20 tokens.

mBART Model
In this section, we briefly review the pretrained mBART model .
The mBART model is a multilingual model of bidirectional and auto-regressive Transformers (BART; ). The model is based on the encoder-decoder Transformer (Vaswani et al., 2017), in which the decoder uses an autoregressive method ( Figure 1).
Two tasks of BART are trained in the mBART model. One is the token masking task, which restores masked tokens in input sentences. The other is the sentence permutation task, which predicts the original order of permuted sentences. Both tasks learn using monolingual corpora.
To build multilingual models based on BART, mBART supplies language tags (as special tokens) at the tail of the encoder input and head of the decoder input. Using these language tags, a mBART  model can learn multiple languages.
The published pretrained mBART model 2 consists of a 12-layer encoder and decoder with a model dimension of 1,024 on 16 heads. This model was trained on 25 languages in the Common Crawl corpus (Wenzek et al., 2019). Of the languages for the NICT-SAP task, English and Hindi are supported by the published mBART model, but Indonesian, Malay, and Thai are not supported.
The tokenizer for the mBART model uses bytepair encoding (Sennrich et al., 2016) of the Sen-tencePiece model (Kudo and Richardson, 2018)  We expanded the mBART model to support these three languages, and additionally pretrained the model on the five languages in the NICT-SAP task. The corpus for additional pretraining was extracted from Wikipedia dump files as follows. Unlike the XLM models (Lample and Conneau, 2019), which were also pretrained using Wikipedia corpora, we divided each article into sentences in our corpus, to train the sentence permutation task. Additionally, we applied sentence filtering to clean each language. Fine-tuning on Machine Translation Figure 1: Example of mBART pretraining and fine-tuning for the machine translation task from Indonesian to English (arranged from ).
the dump files using WikiExtractor 4 while applying the NFKC normalization of Unicode.
2. Sentence splitting was performed based on sentence end marks, such as periods and question marks. However, because Thai does not have explicit sentence end marks, we applied a neural network-based sentence splitter (Wang et al., 2019), which was trained using in-house data.
3. We selected valid sentences, which we regarded as sentences that consisted of five to 1,024 letters, where and 80% of the letters were included in the character set of the target language. In the case of Hindi, for example, we regarded a sentence as valid if 80% of the letters were in the set of Devanagari code points, digits, and spaces.
The number of sentences for the mBART additional pretraining is shown in Table 2. We sampled 7M English sentences to balance the 4 https://github.com/attardi/ wikiextractor sizes of the other languages because the number of English sentences was disproportionately large (about 150M sentences).
We first expanded the word embeddings of the published mBART large model using random initialization and trained it. This is similar to the training procedure for mBART-50 (Tang et al., 2020), except for the corpora and hyperparameters. The settings for the additional pretraining are shown in Table 3.
We conducted the additional pretraining using the Fairseq translator (Ott et al., 2019)

Other Options
We fine-tuned the pretrained model using the NICT-SAP parallel corpora shown in Table 1. We also used Transformer base models (six layers, the model dimension of 512 on 8 heads) for comparison without the pretrained model. In addition to the effect of the pretrained models, we investigated the effects of multilingual models and domain adaptation.

Multilingual Models
Similar to the multilingual training of mBART, the multilingual model translated all the language pairs using one model by supplying source and target language tags to parallel sentences.
By contrast, bilingual models were trained using the corpora of each language pair. When we use the mBART model, we supplied source and target language tags to parallel sentences, even for the bilingual models.

Domain Adaptation
We tested two domain adaptation methods; multidomain models and fine-tuning-based methods. Both methods utilize parallel data of the other domains.
Similar to the multilingual models, we trained the multi-domain models by supplying domain tags (this time, we used <__WN__> for the ALT domain and <__IT__> for the IT domain) at the head of sentences in the source language.
The fine-tuning method did not use domain tags. First, a mixture model was trained using a mixture of multiple domain data. Next, domain models were fine-tuned from the mixture model  using each set of domain data. Therefore, we created as many domain models as the number of domains.

Experiments
The models and methods described above were fine-tuned and tested using the hyperparameters in Table 4. Tables 5 and 6 show the official BLEU scores (Papineni et al., 2002) for the test set in the ALT and IT domains, respectively. Similar results were obtained on the development sets, but they were omitted in this paper. We submitted the results using the pretrained mBART model, which were good on the development sets, on average.
The results are summarized as follows; • For all language pairs in both domains, the BLEU scores with our extended mBART model were better than those under the same conditions without the pretrained models.
When we focus on Indonesian, Malay, and Thai, which were not supported in the original mBART model, the BLEU scores of the submitted results were increased over 8 points from the baseline results in the ALT domain. We conclude that language expansion and additional pretraining were effective for translating new languages.
For verification, we checked sentences in the test sets and the corpus for the pretrained model (c.f., Table 2). There were no identical sentences in the two corpora in the ALT domain. (Between 0% and 10% of the test sentences were included in the IT domain.) Therefore, these improvements were not caused by the memorization of the test sentences in the pretrained model.   • The multilingual models were effective only without pretrained models. For example, for English to Hindi translation in the ALT domain, the BLEU scores improved from 12.26 to 22.31 when we used multilingual models without the pretrained model. However, they degraded from 34.97 to 33.43 when we used multilingual models with the pretrained model.
The multilingual models were effective under low-resource conditions because the size of parallel data increased during training. However, they were ineffective if the models had learned sufficiently in advance, like pretrained models.
• Regarding domain adaptation, the fine-tuning method was better than the multi-domain models in many cases without the pretrained model.

Conclusions
In this paper, we presented the NICT-2 system submitted to the NICT-SAP task at WAT-2021. A feature of our system is that it uses the mBART pretrained model. Because the published pretrained model does not support Indonesian, Malay, and Thai, we expanded it to support the above languages using additional training on the Wikipedia corpus. Consequently, the expanded mBART model improved the BLEU scores, regardless of whether multilingual models or domain adaptation methods were applied.