A Large-scale Evaluation of Neural Machine Transliteration for Indic Languages

We take up the task of large-scale evaluation of neural machine transliteration between English and Indic languages, with a focus on multilingual transliteration to utilize orthographic similarity between Indian languages. We create a corpus of 600K word pairs mined from parallel translation corpora and monolingual corpora, which is the largest transliteration corpora for Indian languages mined from public sources. We perform a detailed analysis of multilingual transliteration and propose an improved multilingual training recipe for Indic languages. We analyze various factors affecting transliteration quality like language family, transliteration direction and word origin.


Introduction
Transliteration is an essential technology for mul tilingual and crosslingual capabilities in NLP ap plications to handle named entities, support cross script input methods. Transliteration between En glish and Indic languages is important since En glish is widely used in the Indian subcontinent. In dic languages are written in different scripts from various writing systems. We focus on languages using scripts derived from the ancient Brahmi script. Their character sets are very different from the Latin script making transliteration nontrivial.
These scripts are abugida scripts, where the ba sic unit is the akshar which consists of one or more consonants along with a vowel diacritic (Daniels and Bright, 1996). They exhibit a high degree of graphemetophoneme correspondence. There is a large overlap in the logical character sets of these scripts, though the visual appearance of the characters varies. The languages utilizing these scripts are said to exhibit orthographic similar ity on account of various shared characteristics (Kunchukuttan et al., 2018a).
We undertake a systematic, largescale evalua tion of neural machine transliteration for 10 ma jor Indic languages from 2 major language fami lies (IndoAryan and Dravidian languages) spoken by more than a billion speakers. Other than Brah miNet (Kunchukuttan et al., 2015) and Dakshina (Roark et al., 2020), no other previous work has ex plored a wide range of Indic languages; Dakshina only explores transliteration into Indic languages. Our major contributions are: • For a largescale evaluation, we mine 600K transliteration pairs across 10 languages from pub licly available parallel and monolingual sources. This is much larger than existing corpora like MSRNEWS (Banchs et al., 2015), Brahminet (Kunchukuttan et al., 2015), Dakshina (Roark et al., 2020) and other small datasets (Banchs et al., 2015; Kunchukuttan et al., 2018b; Gupta et al., 2012; Khapra et al., 2014. The BrahmiNet and Dakshina datasets span multiple languages; Brah miNet is small and Dakshina by design consists mostly of Indian origin words.
• From the mined corpus, we create a highquality, manually validated testset annotated with foreign and Indian origin words.
• We propose various improvements to the multilingual transliteration system proposed by Kunchukuttan et al. (2018a) for Indian languages, and suggest a recipe for building multilingual transliteration systems for Indic languages.
• We present an evaluation of transliteration sys tems according to various factors like language family, word origin and transliteration direction.

Mining transliteration corpus
This section explains our transliteration mining methods (from parallel and monolingual corpora) and presents an analysis of the mined corpus. We mine transliteration corpora from English to 10 In  Sajjad et al. (2012) to mine transliteration pairs using the default settings.

Mining from Monolingual Corpora
Monolingual text corpora often have borrowed words from other languages (particularly English). We mine such transliteration pairs using only the vocabularies in the source and target languages. Method. We first train initial transliteration mod els using available data in both directions (L e → L x ,L x → L e ) and build vocabularies for both languages (L e , L x ). Given words in L e , we iden tify the most promising transliteration candidates from L x and then rescore these candidates. The scoring is based on editdistance between Double Metaphone 1 representations of the words, which we found works well in practice. We consider scores in L e as well as L x . We use ITRANS 2 conversion from Indic scripts to Latin in order to be able to compute Double Metaphone represen tations on the Indic language side. Note that the phonetic nature of Indic scripts enables conversion of Indic scripts to Double Metaphone that is suf ficient for transliteration mining. Thus, the score for a candidates pair s(e, x) is E(e, T XE (x)) + E(x, T EX (e)), where E is the editdistance func tion and T xy denotes transliteration from x to y.

Characteristics of the Mined corpora
Corpora Statistics. Across 10 languages, we mined~373k and~339k transliteration pairs from the parallel translation and monolingual corpora re spectively. The final train set of 606k word pairs was created after deduplicating and creating train, test and dev splits (See Table 1 for a summary of the mined corpus). We estimate that the training set has 55% nonIndian origin words and 45% In dian origin words. Quality of the mined corpus. We evaluated the quality of mined transliterations via crowdsourc ing. We used an internal, managed crowdsourcing platform to validate the testsets and retained the transliteration pairs judged as correct in the final testset. The testset for every language had translit erations for 1500 English words and all their mined transliterations. This manual evaluation also gave us an estimation of the transliteration mining qual ity. We asked nativespeaker judges for each lan guage to report whether the pair is a transliteration or not. Our guidelines specified that pairs should be marked as valid if the pair is phonetically equiv alent and are canonical spellings. In case no canon ical spelling exists, the judges may mark the pairs solely on only phonetic equivalence. To control for quality, we used 3 judges per pair and used majorityvoting for establishing correctness of a transliteration pairs. We added honeypot pairs to tasks to filter out judges spamming our task. Table 1 also shows the transliteration mining ac curacies (average accuracy of 84.18%). An anal ysis of the errors revealed that an overwhelming majority involved wrong/missing/extra inflections (plurals and/or Indic casemarkers). These word pairs are also partial transliterations which are use ful for learning transliteration models. Test Set Creation. The test and dev sets were cre ated by selecting 1500 English words each that are common across all language corpora along with their transliterations. We ensure that the test and dev set do not have any overlap with training set across languages. The testset were verified via crowdsourcing. The test set contains 928 foreign origin words and 572 Indian origin words. Study of orthographic similarity. Following Kunchukuttan and Bhattacharyya (2020), we es timate the orthographic similarity between lan guages using the nway parallel testset. For every language pair, it is the average Longest Common Subsequence Ratio (LCSR) (Melamed, 1995) be tween word pairs in the test set (See Figure 1) and follows linguistic genealogy. Tamil and Malay alam are most divergent to other languages. Pun jabi is also divergent to other languages, possibly on account of: (a) some of its special characters like tippi and addak, (b) little use of conjunct con sonants unlike other Indian languages.

Analysis: Multilingual Transliteration
We study multilingual transliteration models with the intent of identifying factors that improve mul tilingual models. First, we describe our baseline multilingual model and then introduce different variants to improve the baseline model. Baseline Multilingual model (Kunchukuttan et al., 2018a). It is a characterlevel, attention based, encoderdecoder model with all the model components shared amongst all the languages. We train joint EX (multitarget, English to Indian languages) and XE models (multisource, Indian languages to English) separately. For EX models, we append a special target language token to the input sequence (Johnson et al., 2017). Language Partitioning. To understand the role of orthographic similarity, we investigate two lan guage groupings: (a) all the Indic languages are jointly trained, (b) IndoAryan and Dravidian lan guages are separately trained. Vocabulary. Indic languages use a variety of scripts with a high overlap in the logical character set, but assigned unique characters in the Unicode character set. We investigate if transfer learning works better with a combined vocabulary by map ping logically equivalent characters across scripts for better transfer learning. We use the IndicNLP Library (Kunchukuttan, 2020) for mapping all In dic scripts to the Devanagari script, thus combin ing the vocabularies of all languages. We experi ment with two configurations: (a) disjoint vocabu laries (i.e., different scripts), (b) combined vocab ularies (i.e., same script). Combining the vocabu laries reduces the vocabulary significantly as the number of scripts reduces from 9 to 1. Source language tag. In spite of the high de gree of orthographic similarity between Indian lan guages, there are few cases of languagespecific variations. For instance, the Malayalam script overloads a few characters with multiple sounds, Bengali pronunciation of the aa vowel differs, etc. To make the model sensitive to these language  specific variations in XE models, we add an spe cial source language token in the input sequence.
Addressing divergence between Tamil and other Indic scripts. The Tamil script is highly underspecified and has fewer characters than sounds in the English language (unlike other In dic scripts). When training a multilingual model, there is an inconsistency in learnt mappings be tween Tamil and other Indic scripts. We address this issue by training a Tamilspecific multilingual model for Dravidian languages where all charac ters from other scripts are mapped to the closest character in the Tamil script via deterministic rules using the IndicNLP library.

Experimental Setup
We use Marian (JunczysDowmunt et al., 2018) to train our transliteration models. We use 128 LSTM units for encoder and decoder (1 layer for bilin gual models and 2 layers for multilingual models). The encoder uses a bidirectional LSTM. The input embeddings are also 128 units in size. These hy perparameters were decided based on a parameter sweep on the dev set. We use a batch size of 100 sequences and early stopping with patience=100. We use beamsearch for decoding (beam size=4).

Conclusion
We present a study of transliteration between En glish and Indic languages. We mine a 600k paral lel transliteration corpus having a good coverage of Indian and nonIndian origin words as well as cre ate a manually validated testset. We recommend the following recipe for Indic multilingual translit eration: (a) all training data in the same script,  Table 3 shows the languagewise parallel corpora statistics and Table 4 lists the various parallel cor pora used for transliteration mining. Table 3 also shows the vocab sizes of monolingual corpora used for each language for the monolingual approach.