Optimal Word Segmentation for Neural Machine Translation into Dravidian Languages

Prajit Dhar, Arianna Bisazza, Gertjan van Noord


Abstract
Dravidian languages, such as Kannada and Tamil, are notoriously difficult to translate by state-of-the-art neural models. This stems from the fact that these languages are morphologically very rich as well as being low-resourced. In this paper, we focus on subword segmentation and evaluate Linguistically Motivated Vocabulary Reduction (LMVR) against the more commonly used SentencePiece (SP) for the task of translating from English into four different Dravidian languages. Additionally we investigate the optimal subword vocabulary size for each language. We find that SP is the overall best choice for segmentation, and that larger dictionary sizes lead to higher translation quality.
Anthology ID:
2021.wat-1.21
Volume:
Proceedings of the 8th Workshop on Asian Translation (WAT2021)
Month:
August
Year:
2021
Address:
Online
Venues:
ACL | IJCNLP | WAT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
181–190
Language:
URL:
https://aclanthology.org/2021.wat-1.21
DOI:
10.18653/v1/2021.wat-1.21
Bibkey:
Cite (ACL):
Prajit Dhar, Arianna Bisazza, and Gertjan van Noord. 2021. Optimal Word Segmentation for Neural Machine Translation into Dravidian Languages. In Proceedings of the 8th Workshop on Asian Translation (WAT2021), pages 181–190, Online. Association for Computational Linguistics.
Cite (Informal):
Optimal Word Segmentation for Neural Machine Translation into Dravidian Languages (Dhar et al., WAT 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.wat-1.21.pdf
Data
PMIndia