Extensions to Brahmic script processing within the Nisaba library: new scripts, languages and utilities

Alexander Gutkin, Cibu Johny, Raiomond Doctor, Lawrence Wolf-Sonkin, Brian Roark


Abstract
The Brahmic family of scripts is used to record some of the most spoken languages in the world and is arguably the most diverse family of writing systems. In this work, we present several substantial extensions to Brahmic script functionality within the open-source Nisaba library of finite-state script normalization and processing utilities (Johny et al., 2021). First, we extend coverage from the original ten scripts to an additional ten scripts of South Asia and beyond, including some used to record endangered languages such as Dogri. Second, we augment the language layer so that scripts used by multiple languages in distinct ways can be processed correctly for more languages, such as the Bengali script when used for the low-resource language Santali. We document key changes to the finite-state engine required to support these new languages and scripts. Finally, we add new script processing utilities, including lightweight script-level reading normalization that (unlike existing visual normalization) does not preserve visual invariance, and a fixed-input transliteration mechanism specifically tailored to Brahmic text entry with ASCII characters.
Anthology ID:
2022.lrec-1.692
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
6450–6460
Language:
URL:
https://aclanthology.org/2022.lrec-1.692
DOI:
Bibkey:
Cite (ACL):
Alexander Gutkin, Cibu Johny, Raiomond Doctor, Lawrence Wolf-Sonkin, and Brian Roark. 2022. Extensions to Brahmic script processing within the Nisaba library: new scripts, languages and utilities. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6450–6460, Marseille, France. European Language Resources Association.
Cite (Informal):
Extensions to Brahmic script processing within the Nisaba library: new scripts, languages and utilities (Gutkin et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.692.pdf