Criteria for Useful Automatic Romanization in South Asian Languages

Isin Demirsahin, Cibu Johny, Alexander Gutkin, Brian Roark


Abstract
This paper presents a number of possible criteria for systems that transliterate South Asian languages from their native scripts into the Latin script, a process known as romanization. These criteria are related to either fidelity to human linguistic behavior (pronunciation transparency, naturalness and conventionality) or processing utility for people (ease of input) as well as under-the-hood in systems (invertibility and stability across languages and scripts). When addressing these differing criteria several linguistic considerations, such as modeling of prominent phonological processes and their relation to orthography, need to be taken into account. We discuss these key linguistic details in the context of Brahmic scripts and languages that use them, such as Hindi and Malayalam. We then present the core features of several romanization algorithms, implemented in a finite state transducer (FST) formalism, that address differing criteria. Implementations of these algorithms have been released as part of the Nisaba finite-state script processing library.
Anthology ID:
2022.lrec-1.718
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
6662–6673
Language:
URL:
https://aclanthology.org/2022.lrec-1.718
DOI:
Bibkey:
Cite (ACL):
Isin Demirsahin, Cibu Johny, Alexander Gutkin, and Brian Roark. 2022. Criteria for Useful Automatic Romanization in South Asian Languages. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6662–6673, Marseille, France. European Language Resources Association.
Cite (Informal):
Criteria for Useful Automatic Romanization in South Asian Languages (Demirsahin et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.718.pdf