Byte-based Multilingual NMT for Endangered Languages

Mengjiao Zhang, Jia Xu


Abstract
Multilingual neural machine translation (MNMT) jointly trains a shared model for translation with multiple language pairs. However, traditional subword-based MNMT approaches suffer from out-of-vocabulary (OOV) issues and representation bottleneck, which often degrades translation performance on certain language pairs. While byte tokenization is used to tackle the OOV problems in neural machine translation (NMT), until now its capability has not been validated in MNMT. Additionally, existing work has not studied how byte encoding can benefit endangered language translation to our knowledge. We propose a byte-based multilingual neural machine translation system (BMNMT) to alleviate the representation bottleneck and improve translation performance in endangered languages. Furthermore, we design a random byte mapping method with an ensemble prediction to enhance our model robustness. Experimental results show that our BMNMT consistently and significantly outperforms subword/word-based baselines on twelve language pairs up to +18.5 BLEU points, an 840% relative improvement.
Anthology ID:
2022.coling-1.388
Volume:
Proceedings of the 29th International Conference on Computational Linguistics
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
4407–4417
Language:
URL:
https://aclanthology.org/2022.coling-1.388
DOI:
Bibkey:
Cite (ACL):
Mengjiao Zhang and Jia Xu. 2022. Byte-based Multilingual NMT for Endangered Languages. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4407–4417, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):
Byte-based Multilingual NMT for Endangered Languages (Zhang & Xu, COLING 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.coling-1.388.pdf
Code
 mengjiaozhang/byte-based-multilingual-nmt