TMU NMT System with Japanese BART for the Patent task of WAT 2021

In this paper, we introduce our TMU Neural Machine Translation (NMT) system submitted for the Patent task (Korean Japanese and English Japanese) of 8th Workshop on Asian Translation (Nakazawa et al., 2021). Recently, several studies proposed pre-trained encoder-decoder models using monolingual data. One of the pre-trained models, BART (Lewis et al., 2020), was shown to improve translation accuracy via fine-tuning with bilingual data. However, they experimented only Romanian!English translation using English BART. In this paper, we examine the effectiveness of Japanese BART using Japan Patent Office Corpus 2.0. Our experiments indicate that Japanese BART can also improve translation accuracy in both Korean Japanese and English Japanese translations.


Introduction
Neural Machine Translation (NMT) has achieved high translation accuracy in large-scale data conditions. However, translation accuracy of NMT drops in the lack of bilingual data (Koehn and Knowles, 2017). There are several approaches such as backtranslation (Sennrich et al., 2016) and transfer learning (Zoph et al., 2016) to address this problem. Furthermore, in addition to these methods, there are some approaches to use pre-trained models using only monolingual data.
BERT (Devlin et al., 2019), which is the most typical pre-trained model, can boost the accuracy of many downstream tasks compared to models without BERT via fine-tuning with the task-specific training data. However, applying BERT to NMT in fine-tuning form like the other tasks requires two-stage optimization and does not provide significant improvement (Imamura and Sumita, 2019). Recently, several studies proposed pre-trained encoder-decoder models using a monolingual data.  proposed BART, which is one of the pre-trained encoder-decoder models. They demonstrated that BART works well for not only comprehension tasks such as GLEU (Wang et al., 2018) and SQuAD (Rajpurkar et al., 2016) but also text generation tasks such as text summarization and translation. However, they reported only the effect of English BART, so they did not investigate BART trained by monolingual data of another language. Furthermore, in the translation task, they experimented with only Romanian→English translation, which have subword overlap. Therefore, the effect in translations between language pairs without subword overlapping is not clear. Furthermore, they did not experiment in translation direction where the source language matches the language of the pre-trained model.
Additionally, we consider that fine-tuning pretraining models such as BART in translation task is similar to transfer learning (Zoph et al., 2016). Transfer learning in NMT is a method that trains the network of the parent language pair (the parent model) as the initial network and then fine-tunes it for the child language pair (the child model).
In the terminology of transfer learning, the pretrained BART and fine-tuned model are the parent model and child model, respectively. Previous studies have shown that transfer learning works most efficiently when the source languages of the parent and child models are syntactically similar (Dabre et al., 2017;Nguyen and Chiang, 2017). Therefore, we hypothesize that BART is more effective when the language pair for fine-tuning is syntactically similar to the pre-training language.
In this study, we examine the effects of Japanese BART on the translation task. We use Korean/Japanese and English/Japanese bilingual data of Japan Patent Office Patent Corpus 2.0 (JPO corpus) for fine-tuning. We also experiment in both translation directions of Ko Ja and En Ja.

Related Work
There are some approaches pre-trained encoder models like BERT (Devlin et al., 2019) to the NMT task. Imamura and Sumita (2019) used BERT as an encoder and demonstrated the effectiveness of two-stage optimization, which first trains parameters without BERT encoder, and then fine-tunes all parameters. Zhu et al. (2020) used BERT representations as input embedding and showed more effectiveness than using BERT as the encoder.
Recently, several studies proposed pre-trained encoder-decoder models such as MASS (Song et al., 2019) and BART , and these models can improve the translation accuracy via fine-tuning with bilingual data. MASS (Song et al., 2019) uses monolingual data from both the source and target languages for pre-training when applying to the NMT. On the contrary, BART  uses only monolingual data of target language, unlike MASS.  trained multilingual BART (mBART) using monolingual data of 25 languages. They indicated that mBART initialization leads significant gains in low resource settings. However, Wang and Htun (2020) showed that mBART cannot obtain improvements in the Patent task.

Implementation
In this study, we use Japanese BART 1 base v1.1 (JaBART) trained using Japanese Wikipedia sentences (18M sentences). For fine-tuning, we do not use an additional encoder like in 's method. Instead, we add randomly initialized embeddings for each unknown subword in JaBART to both encoder and decoder. We share the embeddings of characters that match across  languages, such as numbers and units. We also train baseline models consisting of the same architecture as that of JaBART. We use the same hyperparameters indicated in Table 2 for both finetuning JaBART and training the baseline model. We fine-tune and train the models using the fairseq implementation 2 .

Data
To train and fin-tune the models, we use Ko-Ja and En-Ja datasets of JPO corpus. Korean and English have almost no subword overlaps with Japanese, because these languages use Hangul, Latin alphabets, and Hiragana/Katakana/Kanji characters, respectively. For Japanese pre-processing, we use JaBART tokenizer. For Korean and English, we tokenize sentences using MeCab-ko 3 and Moses scripts 4 , respectively. Then, we apply the Senten-cePiece (Kudo and Richardson, 2018) with a 32k vocabulary size.  Table 3: BLEU / RIBES scores of each single and ensemble of three models. The scores of single are the average of the three models. We indicate the best scores in bold. The scores of ∆ indicate the gains of the fine-tuned JaBART's BLEU score over the baseline model. velopment, and test 5 data statics. Table 3 shows that the BLEU and RIBES scores of each single and ensemble model.

Results
In the single model, the fine-tuned JaBART achieves the highest scores for dev and test data in both language pairs and translation directions of Ko Ja and En Ja. Specifically, the BLEU scores of the dev and test data reveal improvements of 0.440-1.350 and 1.013-1.250 from the baseline models, respectively. The RIBES scores also reveal improvements of 0.001-0.007, but there is no significant difference between the fine-tuned BART and baseline models.
In the ensemble model 6 , the fine-tuned JaBART improves the BLEU and RIBES scores approximately 0.440-0.850 and 0.001-0.008, respectively, in the dev and test of Ko Ja and Ja→En translations. However, in En→Ja translation, the BLEU score of the fine-tuned JaBART decreases 0.09 in the dev and improves 0.240 in the test data. Thus, in the ensemble scenario, the fine-tuned JaBART model can improve translation accuracy except for En→Ja translation. 5 In this study, we use test-n data, a union of test-n1, test-n2, and test-n3 data, for evaluation. 6 We submitted the En Ja ensemble models as the target for human evaluation.

Discussions
We hypothesize that JaBART is more effective when the language pair for fine-tuning is syntactically similar to the pre-training language, as in transfer learning. In our experimental settings, Korean and English are syntactically similar and different languages with Japanese, respectively 7 . Therefore, we expect that JaBART is more effective in the Ko Ja translations than in the En Ja translations. However, Table 3 shows no significant differences in ∆ scores between the Ko Ja and En Ja translations. These results indicate that syntactic similarity does not affect the enhancement in the final BLEU scores.

Conclusions
In this paper, we described our NMT system submitted to the Patent task (Ko Ja and En Ja) of the 8th Workshop on Asian Translation. We compared the baseline and fine-tuned JaBART models, and demonstrated that the fine-tuned JaBART achieves consistent improvements of BLEU scores in language pairs with no subword overlapping, and irrespective of translation directions.
Contrary to our hypothesis, our experiments indicated no significant difference in the translation accuracy depending on the syntactic similarity. However, we consider that there are some differences in another aspect such as training process per epoch and network representations. Therefore, we attempt to analyze BART fine-tuned using language pairs with varying syntactic proximities in detail in the future.