Improving Informally Romanized Language Identification

Adrian Benton; Alexander Gutkin; Christo Kirov; Brian Roark

doi:10.18653/v1/2025.emnlp-main.117

Improving Informally Romanized Language Identification

Adrian Benton, Alexander Gutkin, Christo Kirov, Brian Roark

Abstract

The Latin script is often used to informally write languages with non-Latin native scripts. In many cases (e.g., most languages in India), the lack of conventional spelling in the Latin script results in high spelling variability. Such romanization renders languages that are normally easily distinguished due to being written in different scripts – Hindi and Urdu, for example – highly confusable. In this work, we increase language identification (LID) accuracy for romanized text by improving the methods used to synthesize training sets. We find that training on synthetic samples which incorporate natural spelling variation yields higher LID system accuracy than including available naturally occurring examples in the training set, or even training higher capacity models. We demonstrate new state-of-the-art LID performance on romanized text from 20 Indic languages in the Bhasha-Abhijnaanam evaluation set (Madhani et al., 2023a), improving test F1 from the reported 74.7% (using a pretrained neural model) to 85.4% using a linear classifier trained solely on synthetic data and 88.2% when also training on available harvested text.

Anthology ID:: 2025.emnlp-main.117
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2318–2336
Language:
URL:: https://aclanthology.org/2025.emnlp-main.117/
DOI:: 10.18653/v1/2025.emnlp-main.117
Bibkey:
Cite (ACL):: Adrian Benton, Alexander Gutkin, Christo Kirov, and Brian Roark. 2025. Improving Informally Romanized Language Identification. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2318–2336, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Improving Informally Romanized Language Identification (Benton et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.117.pdf
Checklist:: 2025.emnlp-main.117.checklist.pdf

PDF Cite Search Checklist Fix data