Byte-based Multilingual NMT for Endangered Languages

Mengjiao Zhang; Jia Xu

Byte-based Multilingual NMT for Endangered Languages

Abstract

Multilingual neural machine translation (MNMT) jointly trains a shared model for translation with multiple language pairs. However, traditional subword-based MNMT approaches suffer from out-of-vocabulary (OOV) issues and representation bottleneck, which often degrades translation performance on certain language pairs. While byte tokenization is used to tackle the OOV problems in neural machine translation (NMT), until now its capability has not been validated in MNMT. Additionally, existing work has not studied how byte encoding can benefit endangered language translation to our knowledge. We propose a byte-based multilingual neural machine translation system (BMNMT) to alleviate the representation bottleneck and improve translation performance in endangered languages. Furthermore, we design a random byte mapping method with an ensemble prediction to enhance our model robustness. Experimental results show that our BMNMT consistently and significantly outperforms subword/word-based baselines on twelve language pairs up to +18.5 BLEU points, an 840% relative improvement.

Anthology ID:: 2022.coling-1.388
Volume:: Proceedings of the 29th International Conference on Computational Linguistics
Month:: October
Year:: 2022
Address:: Gyeongju, Republic of Korea
Editors:: Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, Seung-Hoon Na
Venue:: COLING
SIG:
Publisher:: International Committee on Computational Linguistics
Note:
Pages:: 4407–4417
Language:
URL:: https://aclanthology.org/2022.coling-1.388/
DOI:
Bibkey:
Cite (ACL):: Mengjiao Zhang and Jia Xu. 2022. Byte-based Multilingual NMT for Endangered Languages. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4407–4417, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):: Byte-based Multilingual NMT for Endangered Languages (Zhang & Xu, COLING 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.coling-1.388.pdf

PDF Cite Search Fix data