Finite-state script normalization and processing utilities: The Nisaba Brahmic library

This paper presents an open-source library for efficient low-level processing of ten major South Asian Brahmic scripts. The library provides a flexible and extensible framework for supporting crucial operations on Brahmic scripts, such as NFC, visual normalization, reversible transliteration, and validity checks, implemented in Python within a finite-state transducer formalism. We survey some common Brahmic script issues that may adversely affect the performance of downstream NLP tasks, and provide the rationale for finite-state design and system implementation details.


Introduction
The Unicode Standard separates the representation of text from its specific graphical rendering: text is encoded as a sequence of characters, which, at presentation time are then collectively rendered into the appropriate sequence of glyphs for display. This can occasionally result in manytoone map pings, where several distinctlyencoded strings can result in identical display. For example, Latin script letters with diacritics such as "é" can gener ally be encoded as either: (a) a pair of the base let ter (e.g., "e" which is U+0065 from Unicode's Ba sic Latin block, corresponding to ASCII) and a dia critic (in this case U+0301 from the Combining Dia critical Marks block); or (b) a single character that represents the grapheme directly (U+00E9 from the Latin1 Supplement Unicode block). Both encod ings yield visually identical text, hence text is of ten normalized to a conventionalized normal form, such as the wellknown Normalization Form C (NFC), so that visually identical words are mapped to a conventionalized representative of their equiv alence class for downstream processing. Critically, NFC normalization falls far short of a complete handling of such manytoone phenomena in Uni code.
In addition to such normalization issues, some scripts also have wellformedness constraints, i.e., not all strings of Unicode characters from a single script correspond to a valid (i.e., legible) grapheme sequence in the script. Such constraints do not ap ply in the basic Latin alphabet, where any permuta tion of letters can be rendered as a valid string (e.g., for use as an acronym). The Brahmic family of scripts, however, including the Devanagari script used to write Hindi, Marathi and many other South Asian languages, do have such constraints. These scripts are alphasyllabaries, meaning that they are structured around orthographic syllables (aks ̣ara) as the basic unit. 1 One or more Unicode characters combine when rendering one of thousands of leg ible aks ̣ara, but many combinations do not corre spond to any aks ̣ara. Given a token in these scripts, one may want to (a) normalize it to a canonical form; and (b) check whether it is a wellformed sequence of aks ̣ara.
Brahmic scripts are heavily used across South Asia and have official status in India, Bangladesh, Nepal, Sri Lanka and beyond (Cardona andJain, 2007; Steever, 2019). Despite evident progress in localization standards (Unicode Consortium, 2019) and improvements in associated technolo gies such as input methods (Hinkle et al., 2013) and character recognition (Pal et al., 2012), Brahmic script processing still poses important challenges due to the inherent differences between these writ ing systems and those which historically have been more dominant in information technology (Sinha, 2009; Bhattacharyya et al., 2019. In this paper, we present Nisaba, an opensource software library, 2 which provides processing utili ties for ten major Brahmic scripts of South Asia: Bengali, Devanagari, Gujarati, Gurmukhi, Kan nada, Malayalam, Oriya (Odia), Sinhala, Tamil, and Telugu. In addition to string normaliza tion and wellformedness processing, the library also includes utilities for the deterministic and re versible romanization of these scripts, i.e., translit eration from each script to and from the Latin script (Wellisch, 1978). While the resulting roman izations are standardized in a way that may or may not correspond to how native speakers tend to ro manize the text in informal communication (see, e.g., Roark et al., 2020), such a default romaniza tion can permit easy inspection of an approximate version of the linguistic strings for those who read the Latin script but not the specific Brahmic script being examined.
As a whole, the library provides important utili ties for language processing applications of South Asian languages using Brahmic scripts. The de sign is based on the observation that, while there are considerable superficial differences between these scripts, they follow the same encoding model in Unicode, and maintain a very similar char acter repertoire having evolved from the same source -the Brāhmī script (Salomon, 1996; Fe dorova, 2012. This observation lends itself to the scriptagnostic design (outlined in §4) that, unlike other approaches reviewed in §2, is based on the weighted finite state transducer (WFST) formal ism (Mohri, 2004). The details of our system are provided in §5.

Related Work
The computational processing of Brahmic scripts is not a new topic, with the first applications dating back to the early formal syntactic work by Datta (1984). With an increased focus on the South Asian languages within the NLP commu nity, facilitated by advances in machine learning and the increased availability of relevant corpora, multiple script processing solutions have emerged. Some of these toolkits, such as statistical ma chine translationbased BrahmiNet (Kunchukut tan et al., 2015), are modelbased, while oth ers, such as URoman (Hermjakob et al., 2018), IndicNLP (Kunchukuttan, 2020, and Akshar mukha (Rajan, 2020), employ rules. The main fo cus of these libraries is script conversion and ro manization. In this capacity they were success fully employed in diverse downstream multilin gual NLP tasks such as neural machine transla tion (Zhang et al., 2020; Amrhein andSennrich, 2020), morphological analysis (Hauer et al., 2019; Murikinati et al., 2020, named entity recogni tion (Huang et al., 2019) and partofspeech tag ging (Cardenas et al., 2019).
Similar to the software mentioned above, our li brary does provide romanization, but unlike some of the packages, such as URoman, we guarantee reversibility from Latin back to the native script. Similar to others we do not focus on faithful in vertible transliteration of named entities which typically requires modelbased approaches (Se quiera et al., 2014). Unlike the IndicNLP pack age, our software does not provide morphologi cal analysis, but instead offers significantly richer script normalization capabilities than other pack ages. These capabilities are functionally sepa rated into normalization to Normalization Form C (NFC) and visual normalization. Additionally, our library provides extensive scriptspecific well formedness grammars. Finally, in contrast to these other approaches, grammars in our library are maintained separately from the code for compila tion and application, allowing for maintenance of existing scripts and languages plus extension to new ones without having to modify any code. This is particularly important given that Unicode stan dards do change over time and there remain many languages left to cover.
To the best of our knowledge this is the first publicly available general finitestate grammar ap proach for lowlevel processing of multiple Brah mic scripts since the early formal syntactic work by Datta (1984) and is the first such library de signed based on an observation by Sproat (2003) that the fundamental organizing principles of the Brahmic scripts can be algebraically formalized. In particular, all the core components of our li brary (inverse romanization, normalization and wellformedness) are compactly and efficiently represented as finite state transducers. Such for malization lends itself particularly well to runtime or offline integration with any finite state process ing pipeline, such as decoder components of in put methods (Ouyang et al., 2017; Hellsten et al., 2017, text normalization for automatic speech recognition and texttospeech synthesis (Zhang et al., 2019), among other natural language and speech applications.

Brahmic Scripts: An Overview
The scripts of interest have evolved from the an cient Brāhmī writing system that was recorded from the 3rd century BCE and fell out of use by the 5th century CE (Salomon, 1996; Strauch, 2012; Fedorova, 2012. The main unit of lin ear graphemic representation in Brahmic scripts is known by its traditional Sanskritderived name aks ̣ara. As Bright (1999) notes, it is often trans lated as "syllable" although it does not bear di rect correspondence to a syllable of speech, but rather to an orthographic syllable. The structure, or "grammar" of an aks ̣ara is based on the follow ing common principles: an aks ̣ara often consists of a consonant symbol , by default bearing an unmarked inherent vowel or attached diacritic (de pendent) vowel sign ( ); but it may also be an independent vowel symbol , or a consonant sym bol with its inherent vowel "muted" by a special virama diacritic ∅ ( ∅ ). In any of these preceding scenarios, the base consonant can be replaced by a consonant cluster where all but the last conso nant lose their inherent vowel. When the individ ual component consonants of the cluster combine to form a composite form, precluding the use of an overt virama diacritic, this is known as a "conso nant conjunct" (e.g., ∅ ∅ vs [ ] 3 ) (Fe dorova, 2013; Bright, 1999; Coulmas, 1999; Share and Daniels, 2016. The elements of the aks ̣ara grammar described above can be grouped into several natural classes. The sizes of the core classes are shown in Ta ble 1 for each writing system and its correspond ing ISO 15924 identifier in uppercase format (ISO, 2004). The major classes are the independent vow els (e.g., the Devanagari diphthong औ), the depen dent vowel diacritics (e.g., the Gujarati ◌ી), and the consonants (e.g., the Gurmukhi ੜ). Another im portant class consists of the coda consonant sym 3 Here, surrounding the consonants in square brackets will serve to indicate that the enclosed consonants form a conjunct together.  bols, like anusvara, chandrabindu, and visarga, which modify the aks ̣ara as a whole (and follow and vowel signs in the memory representation). Fi nally, there is a class of special characters, such as the religious symbol Om ॐ, that behave like inde pendent aks ̣ara. 4 Unicode Normalization Unicode defines sev eral normalization forms which are used for check ing whether the two Unicode strings are equiv alent to each other (Unicode Consortium, 2019). In our library we support Normalization Form C (NFC) which is well suited for comparing visu ally identical strings. This normalization gener ally converts strings to the equivalent form that uses composite characters. Table 2 shows two ex amples of legacy sequences corresponding canon ically equivalent forms for Devanagari.
Visual Normalization As was mentioned above, an aks ̣ara may be represented by multiple Unicode character sequences and the goal of NFC normal ization is to convert them to their unique canonical form. However, there are many Unicode character sequences that fall outside the scope of NFC algo rithm. We provide visual normalization that, in ad dition to providing the NFC functionality, also sup ports transforming such legacy sequences. Some of the rules are provided as "Do Not Use" tables by the Unicode Consortium (2019) that recommends transformations from legacy sequences to their cor responding canonical form, such as Devanagari { अ (U+0905), ॅ (U+0945) } → ॲ (U+0972). We also included transformations for visually identical se quences (under many implementations) which are commonly found on the Web, such as Devanagari Wellformedness Check A wellformedness ac ceptor verifies whether the given text is readable in a particular script or not. It would be hard for the native reader to visually parse the text if the script rules are not followed. For example, the reader Script ID Visual Character(s) Translit. does not expect two vowels signs on a single con sonant and such a thing may not even be possible to reasonably draw. Furthermore, unlike the Latin script, acronyms are not written using arbitrary let ter sequences, they are formed only as a sequence of aks ̣ara. Our approach verifies whether the text is a sequence of wellformed aks ̣ara using the gram mar described above.  Table 3. These additions are crucial because they allow us to reverse the romanizations to get the original Brahmic strings back reliably. This property allows various data processing pipelines to use the romanized text as an internal representation and convert it back to the original native script at the output stage.

Reversible ISO Transliteration
Languagespecific Logic Several South Asian languages often share the same script with some, often minor, languagespecific differences. Our library supports languagespecific customizations that can be combined with languageagnostic script logic. For example, the modern Bengali-Assamese script (Beng) is shared by both Bengali and Assamese languages, among others (Brandt and Sohoni, 2018). For both of these languages our library provides customizations, 6 such as the transformations required for visual normal ization of Assamese that transform Bengali let ter ra into its Assamese equivalent when it par ticipates in a consonant conjunct (which gener ally occurs when following or preceding virama, e.g.,

The FiniteState Approach
The Brahmic script manipulation operations described above have a natural intepretation grounded in formal language theory. We treat the text corpus in a given script as a set of strings over some finite alphabet Σ that defines a set of admissable script symbols. The set of zero or more strings is known as language which, in its simplest (regular) form, can be succintly described (or recognized) by a finite state automaton (FSA) or acceptor (Yu, 1997). Two simple FSAs that represent the Gujarati word દસ are shown in Figure 1, where the top automaton represents the word over an alphabet of Unicode code points for Gujarati, while the bottom one represents the same string over the corresponding byte symbols in UTF8 encoding (Unicode Consortium, 2019). Our library supports both representations. The aks ̣ara grammar outlined in the previous section can be expressed via elementary formal op erations on the FSAs that describe grammar con stituents. Such settheoretic operations include union (∪), concatenation (+) and closure, where closure is defined as an arbitrary natural number of concatenations of a language over Σ with it self, either accepting an empty input { } or not, denoted * (Kleene star) and + (Kleene plus), respectively (Kuich and Salomaa, 1986). These operations represent nontrivial automata which are compiled offline resulting in compact and ef ficient representations. A simplified process for constructing the automaton to perform the well  Figure 3: Romanization of Sinhala words එක ("one") and ෙදක ("two") into ⟨eka⟩ and ⟨deka⟩, respectively.
formed check from the previous section is shown in Figure 2. In this simplified example, the paths through the automaton that define a legal conso nant cluster (line 2 of the algorithm) are repre sented by a subautomaton that recognizes the lan guage that consists of strings formed from the con sonant and virama symbols only, where each con sonant, apart from the last one, must be followed by the virama that removes an inherent vowel. The rest of the operations on the Brahmic scripts, namely the normalization and transliteration, in volve modifications of the Brahmic script inputs. Such operations are naturally expressed by finite state transducers (FSTs), which are a generaliza tion of the FSA concept used to encode string string relations (or transductions), by modifying the automata arcs to have pairs of labels from in put and output alphabets, instead of single labels. A trivial romanization in our representation of the two Sinhala words එක (⟨eka⟩, "one") and ෙදක (⟨deka⟩, "two") is shown in Figure 3. Note the "vocalization" of the final consonant by insertion of a schwa via an input transition. Also note that the path accepting the second word is longer. The word ෙදක consists of three aks ̣ara and requires modification of the inherent vowel by the depen dent vowel in order to produce ⟨de⟩.
The basic operations on the FSAs outlined above also extend to the FST case and allow for similarly succinct final compiled representa tions (Mohri, 2000), such as the simplified con struction of the ISO romanization transducer ℐ for converting from Brahmic scripts to Latin alpha bet, shown in Figure 4. An important extension of FSAs and FSTs are the weighted finite state au tomata (WFSAs) and transducers (WFSTs) (Mohri, 2004(Mohri, , 2009) that equip each arc in the automaton or transducer with a weight, thus allowing optimiza tion and search algorithms to compute the costs of distinct paths, which can be used to determine their relative importance. We use weights in some of our grammars to indicate the relative priority of a par ticular aks ̣ara modification. For example, in Fig  ure 4, the paths corresponding to consonants fol lowed by dependent vowels (line 6) have priority Require: FSTs: consonant,vowel,vowel_sign,coda,standalone,virama. 1: function ℐ(consonant,vowel,vowel_sign,coda,standalone,virama)  (vowel + deweight) ∪ coda ∪ standalone ∪ over the aks ̣arainitial independent vowels (line 9). The two remaining operations on aks ̣ara, namely NFC and visual normalization, are repre sented in our library using the contextdependent rewrite rules from the formal approach pop ularized by Chomsky and Halle (1968). The normalization rules are represented as a sequence { → / __ }, where the source is rewritten as if its left and right contexts are and . For an earlier example from §3, a single NFC normal ization rule rewrites the Devanagari string = "न" (na, U+0928) + "़ " (nukta sign, U+093C) into its canonical composition = "ऩ" (nnna, U+0929). Kaplan and Kay (1994) proposed an algorithm for compiling such sequences into an FST. This approach was further improved and extended to WFSTs by Mohri and Sproat (1996), whose algorithm we use to compile sequences of NFC and visual normalization rules into transducers denoted and . Finally, the transducers representing language specific customizations of a particular script op eration are compiled by composing the generic languageagnostic transducer, such as the Devana gari visual normalizer, with the transducer rep resenting transformations that capture language specific use of the script, e.g., Devanagari for Nepali.

System Details and Demo
The core of the Nisaba Brahmic script manipula tion library resides under the brahmic directory of the distribution. In this section we provide de tails for how to build and use the library and also explore its application to visual normalization of Wikipediabased text in 9 of these scripts.

Prerequisites
We use Bazel (Google, 2020) as a primary build environment. For compiling the Op. Symb.
In addition, the library depends on Thrax 8 , an older relative of Pynini, that provides a custom gram mar manipulation language for WFSTs (Tai et al., 2011; Roark et al., 2012. Although Thrax has been mostly superseded by Pynini, we still rely on some of its utilities for unit testing and its C++ run time components. At their core, both Pynini and Thrax depend on the OpenFst library 9 for the im plementation of most WFST algorithms (Allauzen et al., 2007; Riley et al., 2009. The overall depen dency diagram is shown on the lefthand side of Figure 5 (the minimal dependency on Thrax is in dicated by a dotted arrow). At build time, Bazel pulls in these dependencies remotely from their re spective repositories. Figure 6 presents the sequence of steps to compile the transduc ers, including downloading the repository (line 2), compiling the library and its artifacts (line 5) and running the unit tests (line 7). The artifacts are compiled by Bazel using Pynini and consist of the finite state archive (FAR) files that contain collec tions of WFSTs (Roark et al., 2012). For each of the four Brahmic script operations we generate two FAR files: one for WFSTs over the byte al phabet, and another over the Unicode code point alphabet. 10 Each FAR file contains ten script specific transducers whose names correspond to the uppercase ISO 15924 script codes. Since the transliteration operation is bidirectional, the name of each scriptspecific transliteration transducer has the prefix FROM_ for the nativetoLatin direc tion, and TO_ for the inverse. The numbers of states ( ) and arcs ( ) of the resulting transliteration (ℐ), NFC ( ), visual normalization ( ) transduc ers and wellformedness acceptors ( ) for each script and alphabet type are shown in Table 4.

Offline and Online Usage
Once the transduc ers are compiled, they can be applied offline to the input files using the rewrite-tester tool pro vided by Thrax, as shown in lines 8-13 of the ex ample in Figure 6, where the visual normalization transducer for Kannada that resides in the vi-sual_norm.far archive is applied to words in in put file words.txt. We provide lightweight runtime interfaces for  both Python and C++, their dependencies shown in the center and the righthand side of Figure 5, respectively. The Python interface is provided via several wrappers around the pynini.Fst abstrac tion, with a simple example shown in Figure 7. In addition to performing simple operations on in dividual strings, more WFSTspecific operations, such as transducer composition, are provided by Pynini. The C++ interface is provided by the Grammar helper class, shown in Figure 8, that includes the necessary methods for initializing the WFSTs and performing rewrites (for transducers) and ac ceptance tests (for acceptors). In addition, many more operations on WFSTs are available through the OpenFst library, if required.  these scripts, we normalized publicly available cor pora and measured how frequently words in the samples were modified. The Dakshina dataset (Roark et al., 2020) includes (among other things) collections of monolingual Wikipedia sentences in 12 South Asian languages, 10 of which use Brah mic scripts. We applied visual normalization to the training partitions of the collections in these 10 lan guages, and Table 5 presents the percentage of both types and tokens that were changed by the normal ization. 11 Malayalam is the language with the high est percentage of both types and tokens changed by visual normalization, largely due to frequent con version to chillu letters from alternative encodings. For example, the relatively frequent word തെന്റ ("yours") is normalized to the encoding with the chillu letter ൻ instead of ന.

Conclusion and Future Work
We presented finitestate automatabased utilities for processing the major Brahmic scripts. The fi nite state transducer formalism provides an effi cient and scalable framework for expressing Brah mic script operations and is suitable for many NLP applications, such as those reported in Kumar et al. (2020) and Kakwani et al. (2020), which may ben efit from the reduction in "noise" present in unnor malized text. In the future, we will continue to im prove the support for existing scripts and extend our work to other Brahmic scripts.