Leveraging Highly Accurate Word Alignment for Low Resource Translation by Pretrained Multilingual Model

Jingyi Zhu, Minato Kondo, Takuya Tamura, Takehito Utsuro, Masaaki Nagata


Abstract
Recently, there has been a growing interest in pretraining models in the field of natural language processing. As opposed to training models from scratch, pretrained models have been shown to produce superior results in low-resource translation tasks. In this paper, we introduced the use of pretrained seq2seq models for preordering and translation tasks. We utilized manual word alignment data and mBERT-based generated word alignment data for training preordering and compared the effectiveness of various types of mT5 and mBART models for preordering. For the translation task, we chose mBART as our baseline model and evaluated several input manners. Our approach was evaluated on the Asian Language Treebank dataset, consisting of 20,000 parallel data in Japanese, English and Hindi, where Japanese is either on the source or target side. We also used in-house 3,000 parallel data in Chinese and Japanese. The results indicated that mT5-large trained with manual word alignment achieved a preordering performance exceeding 0.9 RIBES score on Ja-En and Ja-Zh pairs. Moreover, our proposed approach significantly outperformed the baseline model in most translation directions of Ja-En, Ja-Zh, and Ja-Hi pairs in at least one of BLEU/COMET scores.
Anthology ID:
2023.mtsummit-research.28
Volume:
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track
Month:
September
Year:
2023
Address:
Macau SAR, China
Editors:
Masao Utiyama, Rui Wang
Venue:
MTSummit
SIG:
Publisher:
Asia-Pacific Association for Machine Translation
Note:
Pages:
336–347
Language:
URL:
https://aclanthology.org/2023.mtsummit-research.28
DOI:
Bibkey:
Cite (ACL):
Jingyi Zhu, Minato Kondo, Takuya Tamura, Takehito Utsuro, and Masaaki Nagata. 2023. Leveraging Highly Accurate Word Alignment for Low Resource Translation by Pretrained Multilingual Model. In Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track, pages 336–347, Macau SAR, China. Asia-Pacific Association for Machine Translation.
Cite (Informal):
Leveraging Highly Accurate Word Alignment for Low Resource Translation by Pretrained Multilingual Model (Zhu et al., MTSummit 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.mtsummit-research.28.pdf