BPE vs. Morphological Segmentation: A Case Study on Machine Translation of Four Polysynthetic Languages

Manuel Mager, Arturo Oncevay, Elisabeth Mager, Katharina Kann, Thang Vu


Abstract
Morphologically-rich polysynthetic languages present a challenge for NLP systems due to data sparsity, and a common strategy to handle this issue is to apply subword segmentation. We investigate a wide variety of supervised and unsupervised morphological segmentation methods for four polysynthetic languages: Nahuatl, Raramuri, Shipibo-Konibo, and Wixarika. Then, we compare the morphologically inspired segmentation methods against Byte-Pair Encodings (BPEs) as inputs for machine translation (MT) when translating to and from Spanish. We show that for all language pairs except for Nahuatl, an unsupervised morphological segmentation algorithm outperforms BPEs consistently and that, although supervised methods achieve better segmentation scores, they under-perform in MT challenges. Finally, we contribute two new morphological segmentation datasets for Raramuri and Shipibo-Konibo, and a parallel corpus for Raramuri–Spanish.
Anthology ID:
2022.findings-acl.78
Volume:
Findings of the Association for Computational Linguistics: ACL 2022
Month:
May
Year:
2022
Address:
Dublin, Ireland
Venues:
ACL | Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
961–971
Language:
URL:
https://aclanthology.org/2022.findings-acl.78
DOI:
10.18653/v1/2022.findings-acl.78
Bibkey:
Cite (ACL):
Manuel Mager, Arturo Oncevay, Elisabeth Mager, Katharina Kann, and Thang Vu. 2022. BPE vs. Morphological Segmentation: A Case Study on Machine Translation of Four Polysynthetic Languages. In Findings of the Association for Computational Linguistics: ACL 2022, pages 961–971, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
BPE vs. Morphological Segmentation: A Case Study on Machine Translation of Four Polysynthetic Languages (Mager et al., Findings 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.findings-acl.78.pdf