Exploring Low-resource Neural Machine Translation for Sinhala-Tamil Language Pair

Ashmari Pramodya


Abstract
At present, Neural Machine Translation is a promising approach for machine translation. Transformer-based deep learning architectures in particular show a substantial performance increase in translating between various language pairs. However, many low-resource language pairs still struggle to lend themselves to Neural Machine Translation due to their data-hungry nature. In this article, we investigate methods of expanding the parallel corpus to enhance translation quality within a model training pipeline, starting from the initial collection of parallel data to the training process of baseline models. Grounded on state-of-the-art Neural Machine Translation approaches such as hyper-parameter tuning, and data augmentation with forward and backward translation, we define a set of best practices for improving Tamil-to-Sinhala machine translation and empirically validate our methods using standard evaluation metrics. Our results demonstrate that the Neural Machine Translation models trained on larger amounts of back-translated data outperform other synthetic data generation approaches in Transformer base training settings. We further demonstrate that, even for language pairs with limited resources, Transformer models are able to tune to outperform existing state-of-the-art Statistical Machine Translation models by as much as 3.28 BLEU points in the Tamil to Sinhala translation scenarios.
Anthology ID:
2023.ranlp-stud.10
Volume:
Proceedings of the 8th Student Research Workshop associated with the International Conference Recent Advances in Natural Language Processing
Month:
September
Year:
2023
Address:
Varna, Bulgaria
Editors:
Momchil Hardalov, Zara Kancheva, Boris Velichkov, Ivelina Nikolova-Koleva, Milena Slavcheva
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd., Shoumen, Bulgaria
Note:
Pages:
87–97
Language:
URL:
https://aclanthology.org/2023.ranlp-stud.10
DOI:
Bibkey:
Cite (ACL):
Ashmari Pramodya. 2023. Exploring Low-resource Neural Machine Translation for Sinhala-Tamil Language Pair. In Proceedings of the 8th Student Research Workshop associated with the International Conference Recent Advances in Natural Language Processing, pages 87–97, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
Cite (Informal):
Exploring Low-resource Neural Machine Translation for Sinhala-Tamil Language Pair (Pramodya, RANLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.ranlp-stud.10.pdf