Evaluating Transliteration Ambiguity in Adhoc Romanized Sinhala: A Dataset for Transliteration Disambiguation

Sandun Sameera Perera, Deshan Koshala Sumanathilaka


Abstract
This paper introduces the first Transliteration disambiguation (TD) dataset for Romanized Sinhala, informally known as Singlish, developed to address the challenge of transliteration ambiguity in backwards transliteration tasks. The dataset covers 22 ambiguous Romanized Sinhala words, each mapping to two distinct Sinhala meanings, and provides 30 Romanized sentences per word: ten for each meaning individually and ten containing both meanings in context. Sentences were initially collected through web scraping and later post-processed using the Claude language model, which offers strong support for Sinhala, alongside a rule-based Romanization process to ensure linguistic quality and consistency. To demonstrate its applicability, the dataset was used to evaluate four existing back-transliteration systems, highlighting their performance in resolving context-sensitive ambiguities. Baseline evaluations confirm the dataset’s effectiveness in assessing transliteration systems’ ability to handle transliteration ambiguity, offering a valuable resource for advancing TD and transliteration research for Sinhala.
Anthology ID:
2025.ranlp-1.107
Volume:
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era
Month:
September
Year:
2025
Address:
Varna, Bulgaria
Editors:
Galia Angelova, Maria Kunilovskaya, Marie Escribe, Ruslan Mitkov
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd., Shoumen, Bulgaria
Note:
Pages:
934–942
Language:
URL:
https://aclanthology.org/2025.ranlp-1.107/
DOI:
Bibkey:
Cite (ACL):
Sandun Sameera Perera and Deshan Koshala Sumanathilaka. 2025. Evaluating Transliteration Ambiguity in Adhoc Romanized Sinhala: A Dataset for Transliteration Disambiguation. In Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era, pages 934–942, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
Cite (Informal):
Evaluating Transliteration Ambiguity in Adhoc Romanized Sinhala: A Dataset for Transliteration Disambiguation (Perera & Sumanathilaka, RANLP 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.ranlp-1.107.pdf