Sinhala Transliteration: A Comparative Analysis Between Rule-based and Seq2Seq Approaches

Widanalage Mario Yomal De Mel, Kasun Imesha Wickramasinghe, Nisansa de Silva, Surangika Dayani Ranathunga


Abstract
Due to reasons of convenience and lack of tech literacy, transliteration (i.e., Romanizing native scripts instead of using localization tools) is eminently prevalent in the context of low-resource languages such as Sinhala, which have their own writing script. In this study, our focus is on Romanized Sinhala transliteration. We propose two methods to address this problem: Our baseline is a rule-based method, which is then compared against our second method where we approach the transliteration problem as a sequence-to-sequence task akin to the established Neural Machine Translation (NMT) task. For the latter, we propose a Transformer based Encode-Decoder solution. We witnessed that the Transformer-based method could grab many ad-hoc patterns within the Romanized scripts compared to the rule-based method.
Anthology ID:
2025.indonlp-1.19
Volume:
Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages
Month:
January
Year:
2025
Address:
Abu Dhabi
Editors:
Ruvan Weerasinghe, Isuri Anuradha, Deshan Sumanathilaka
Venues:
IndoNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
166–173
Language:
URL:
https://aclanthology.org/2025.indonlp-1.19/
DOI:
Bibkey:
Cite (ACL):
Widanalage Mario Yomal De Mel, Kasun Imesha Wickramasinghe, Nisansa de Silva, and Surangika Dayani Ranathunga. 2025. Sinhala Transliteration: A Comparative Analysis Between Rule-based and Seq2Seq Approaches. In Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages, pages 166–173, Abu Dhabi. Association for Computational Linguistics.
Cite (Informal):
Sinhala Transliteration: A Comparative Analysis Between Rule-based and Seq2Seq Approaches (De Mel et al., IndoNLP 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.indonlp-1.19.pdf