2025
pdf
bib
abs
Machine Translation and Transliteration for Indo-Aryan Languages: A Systematic Review
Sandun Sameera Perera
|
Deshan Koshala Sumanathilaka
Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages
This systematic review paper provides an overview of recent machine translation and transliteration developments for Indo-Aryan languages spoken by a large population across South Asia. The paper examines advancements in translation and transliteration systems for a few language pairs which appear in recently published papers. The review summarizes the current state of these technologies, providing a worthful resource for anyone who is doing research in these fields to understand and find existing systems and techniques for translation and transliteration.
pdf
bib
abs
IndoNLP 2025 Shared Task: Romanized Sinhala to Sinhala Reverse Transliteration Using BERT
Sandun Sameera Perera
|
Lahiru Prabhath Jayakodi
|
Deshan Koshala Sumanathilaka
|
Isuri Anuradha
Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages
The Romanized text has become popular with the growth of digital communication platforms, largely due to the familiarity with English keyboards. In Sri Lanka, Romanized Sinhala, commonly referred to as “Singlish” is widely used in digital communications. This paper introduces a novel context-aware back-transliteration system designed to address the ad-hoc typing patterns and lexical ambiguity inherent in Singlish. The proposed system com bines dictionary-based mapping for Singlish words, a rule-based transliteration for out of-vocabulary words and a BERT-based language model for addressing lexical ambiguities. Evaluation results demonstrate the robustness of the proposed approach, achieving high BLEU scores along with low Word Error Rate (WER) and Character Error Rate (CER) across test datasets. This study provides an effective solution for Romanized Sinhala back-transliteration and establishes the foundation for improving NLP tools for similar low-resourced languages.
pdf
bib
abs
Evaluating Transliteration Ambiguity in Adhoc Romanized Sinhala: A Dataset for Transliteration Disambiguation
Sandun Sameera Perera
|
Deshan Koshala Sumanathilaka
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era
This paper introduces the first Transliteration disambiguation (TD) dataset for Romanized Sinhala, informally known as Singlish, developed to address the challenge of transliteration ambiguity in backwards transliteration tasks. The dataset covers 22 ambiguous Romanized Sinhala words, each mapping to two distinct Sinhala meanings, and provides 30 Romanized sentences per word: ten for each meaning individually and ten containing both meanings in context. Sentences were initially collected through web scraping and later post-processed using the Claude language model, which offers strong support for Sinhala, alongside a rule-based Romanization process to ensure linguistic quality and consistency. To demonstrate its applicability, the dataset was used to evaluate four existing back-transliteration systems, highlighting their performance in resolving context-sensitive ambiguities. Baseline evaluations confirm the dataset’s effectiveness in assessing transliteration systems’ ability to handle transliteration ambiguity, offering a valuable resource for advancing TD and transliteration research for Sinhala.