Aksharantar: Open Indic-language Transliteration datasets and models for the Next Billion Users

Transliteration is very important in the Indian language context due to the usage of multiple scripts and the widespread use of romanized inputs. However, few training and evaluation sets are publicly available. We introduce Aksharantar, the largest publicly available transliteration dataset for Indian languages created by mining from monolingual and parallel corpora, as well as collecting data from human annotators. The dataset contains 26 million transliteration pairs for 21 Indic languages from 3 language families using 12 scripts. Aksharantar is 21 times larger than existing datasets and is the first publicly available dataset for 7 languages and 1 language family. We also introduce the Aksharantar testset comprising 103k word pairs spanning 19 languages that enables a fine-grained analysis of transliteration models on native origin words, foreign words, frequent words, and rare words. Using the training set, we trained IndicXlit, a multilingual transliteration model that improves accuracy by 15% on the Dakshina test set, and establishes strong baselines on the Aksharantar testset introduced in this work. The models, mining scripts, transliteration guidelines, and datasets are available at https://github.com/AI4Bharat/IndicXlit under open-source licenses. We hope the availability of these large-scale, open resources will spur innovation for Indic language transliteration and downstream applications. We hope the availability of these large-scale, open resources will spur innovation for Indic language transliteration and downstream applications.


Introduction
The Indian subcontinent is home to diverse languages across four major language families written in multiple scripts (Daniels and Bright, 1996).In various settings such as instant messaging, web search, and social media, these languages are commonly romanized owing to users' familiarity with 1 meaning transliteration in Sanskrit) the input tools for the Roman script.Often, there is a large diversity in how words are romanized: For instance, even the short word मैं (I) can be romanized in multiple ways: main, mai, mein, mei which overlap with the ways of romanizing another short word, में (in).This widespread usage of romanization and lack of standardization implies that accurate transliteration models form a critical component in the NLP stack for Indian languages used by over 735 million Internet users (KPMG and Google, 2017).Further, accurate transliteration models have been shown to improve machine translation (Durrani et al., 2014b), romanized language models (Khanuja et al., 2021), NER (Klementiev and Roth, 2006), and script unification for multilingual models (Muller et al., 2021).
Given its importance, a transliteration for Indian languages has received considerable research focus (Kumaran et al., 2010; Chen et al., 2018b; Kunchukuttan et al., 2018a; Roark et al., 2020).However, the state-of-the-art (SOTA) results as reported in Roark et al. (2020) on the Roman script to Indian transliteration task have relatively low top-1 accuracy values ranging between 33.2% to 67.6% with an average of 51.8% across 12 languages.We believe that the low accuracy is a result of limited training datasets that are not representative of the diverse variations of romanization.We aim to address this open challenge in a manner similar to recent advances for low-resource languages in machine translation (Ramesh et al., 2022; Costa-jussà et al., 2022) and speech recognition (Bhogale et al., 2022; Radford et al., 2022), namely by mining massive training corpora from web-scale data.
21 times larger than existing publicly available datasets and includes 7 new languages (Assamese, Bodo, Kashmiri, Manipuri, Nepali, Oriya, Sanskrit) and 1 new language family (Sino-Tibetan) for which no transliteration corpora wasa available previously.The parallel transliteration corpora have been mined from Wikidata (Vrandečić and Krötzsch, 2014), Samanantar parallel translation corpora (Ramesh et al., 2022), and IndicCorp monolingual corpora (Doddapaneni et al., 2022) along with a compilation of existing transliteration corpora.In addition, the corpora contain a diverse set of native language words that have been transliterated manually to ensure coverage of native words of different lengths, different n-gram characteristics, and infrequent words -characteristics that mined corpora lack.
Our next major contribution is the Aksharantar testset, an evaluation benchmark dataset for romanized transliteration.Table 1 shows the statistics of the benchmark set, which comprises 103K word pairs spanning 19 languages.The benchmark contains (a) native language words with diverse n-gram characteristics, and (b) named entities of Indic and foreign origin spanning different entity categories.Most publicly available testsets focus on named entities (Chen et al., 2018a), but the representation of native words is important for input tools.While the Dakshina testset (Roark et al., 2020) represents native words, they include only the most frequent words in Wikipedia -which is not representative of all native words.Our testset ensures greater diversity in native language word coverage.Our experiments confirm that our testset is indeed more diverse and challenging, hence making it more suitable for better evaluation of transliteration models.It is known that transliteration of English-origin and native-origin words have their own distinct behavior (Ahmed et al., 2011; Khapra andBhattacharyya, 2009), hence we create testsets for both word classes to enable this fine-grained evaluation of transliteration models.
Our next contribution is a multilingual model for romanized to native script transliteration for Indian languages.Our model gives SOTA performance on the Dakshina benchmark (Roark et al., 2020) for all the 12 languages in common with Dakshina showing an improvement of 15% in accuracy over previous results.It also establishes a strong baseline on the Aksharantar testset.
Our final contribution is a detailed analysis of the model's performance on the rich Aksharantar testset.Ablation studies indicate that the increased data size as well as the manually collected diverse dataset is a major contributor to the improved performance.The fine-grained testsets reveal named entities and low-frequency words as areas for improving transliteration models.
The code and models are available under an MIT license2 , the Aksharantar benchmark and all data we created manually are available under the CC-BY license3 , whereas all the mined data is available under the CC0 license4 .

Related Work
Existing Indic Transliteration Corpora.Very few transliteration corpora exist with Indian language-Roman script transliterations.Refer to Appendix A for detailed listing and statistics.Most significant among these are the Dakshina dataset (Roark et al., 2020) and the BrahmiNet corpus (Kunchukuttan et al., 2015).Dakshina contains native language words sourced from Wikipedia and their romanizations created by native speakers, unlike Aksharantar mostly consists of commonly used and shorter Indic language words.Mining Transliteration Pairs.Irvine et al. (2010) mine the name pairs from Wikipedia using interlanguage links between pages in multiple languages (similar to our use of Wikidata titles' multilingual information).Some approaches mine transliteration pairs from comparable document pairs based on a variety of heuristic signals (Klementiev and Roth, 2006; Udupa et al., 2008, 2009).Sajjad et al. (2012) proposed a generative model for efficient unsupervised/semi-supervised mining of transliteration pairs.We employ unsupervised mining method proposed by Sajjad et al. (2012) to mine transliteration pairs from parallel corpora.Richardson et al. (2013) mine transliteration pairs from monolingual corpora by transliterating the vocabulary of one language using a baseline system and then by filtering the generated data.Multilingual Models.Multilingual models have been shown to improve performance on lowresource languages for many NLP tasks by transfer from high-resource languages and aligning representation of multiple languages in the same vector space (Johnson et al., 2017; Conneau et al., 2020).The transfer could be between genetically-related languages (Nguyen and Chiang, 2017) or contactrelated languages (Goyal et al., 2020).Multilingual models have been explored successfully for

Mining Transliteration pairs
We explore sources for mining transliteration pairs.First, we compile existing publiclyavailable transliteration corpora listed in Appendix A. Then, we explore large-scale mining of transliterations from Wikidata, parallel translation corpora and monolingual corpora.

Mining from Wikidata
Wikidata (Vrandečić and Krötzsch, 2014) is a multilingual, structured database containing items that are either entities, things, concepts, or terms.Each entity has labels that are common names of the items in multiple languages.We restrict ourselves to person and location entities since their labels will be transliterations.We extract such English-Indian language label pairs creating transliteration pairs.Appendix B provides more details on Wikidata mining.The candidate pairs are filtered using a transliteration validator described in Section 4.3.

Mining from Translation Corpora
Parallel sentences can contain transliteration pairs in the form of named entities, loan words, and cognates.To mine these transliteration pairs, we first use an off-the-shelf word-aligner GIZA++ (Och and Ney, 2003) with the default settings to learn the word alignments between parallel sentences.These aligned words can either be translations or transliterations.Then, we use the unsupervised method suggested by Sajjad et al. (2012), as implemented in the transliteration module (Durrani et al., 2014a) of Moses (Koehn et al., 2007), to mine transliteration pairs from these word alignments by distinguishing transliterations and non-transliterations. Please refer to Appendix C for more details.Using this approach, we mine transliteration pairs from the Samanantar parallel corpora (v0.3) (Ramesh et al., 2022), the largest publicly available parallel corpora for Indian languages when we started this project.The above-mentioned process can result in some wrong transliteration pairs being mined (see Table 2).To filter out such pairs, we use a rule-based transliteration validator (described in Section 4.3) which checks the correctness of consonant alignment between transliteration pairs and works well for the kinds of erroneous transliteration pairs mined by the above-mentioned method.

Mining from Monolingual Corpora
Monolingual corpora often contain borrowed words from other languages (particularly English).We mine transliteration pairs between English and Indian languages using only a list of words in the source and target languages.We first train multilingual transliteration models with the same setting described in Section 5 using available data (data from existing sources and mined from parallel translation corpora) in both directions (M ex : en → Indic and M xe : Indic → en).We use the IndicCorp dataset (Doddapaneni et al., 2022) to create a list of words for English and Indic languages (L x ).Given the word w x in L x , we generate its transliteration (w ′ e ) using the M xe model (e.g., ज मने ट → germinat).We find similar new English words (w e ) from the English word list such that there exist at least three common character 4grams between w ′ e and w e (e.g., germinated, germinate, germinating, germinates).The candidate pair (w x , w e ) is scored using models in both directions.s(w x , w e ) = 1 2 {M xe (w x , w e ) + M ex (w e , w x )} We retain candidate transliteration pairs with score (average character-level log probability in both directions) greater than a threshold t=−0.35, which was set after our analysis of transliteration pairs across languages (e.g., germinated, germinate, germinating, germinates).

Quality of the mined data
To validate the quality of the mined corpora, we perform a human evaluation on a subset of mined pairs.We randomly sampled 500 Indic-Roman script-mined pairs equally from IndicCorp and Samanantar corpora in 12 languages.Two passes of validation by different language validators were performed on this data.Annotators were asked to mark the pairs which were valid transliterations.
The accuracy of mining is defined to be the percentage of valid pairs in the subset that was manually judged.We achieved minimum accuracy of 80% per language and average accuracy of 89% across all 12 languages.The results of human evaluation, summarised in Table 3, show that data mined from Samanantar and IndicCorp has high accuracy.
We analyzed the pairs judged as invalid and found that they included the following errors: Vowel errors: a/e being added incorrectly at the end of transliterations, missing vowels, and wrong usage of vowels (e.g., अिमताभ → Amtabha [missing 'i' after 'm' and added 'a' at the end]).Suffix errors: Suffixes wrongly transliterated or missed altogether, leading to partial transliterations (e.g., रोनाल्डोही→ Ronaldo, टोिकया→ Tokyo).
We found that most erroneous pairs were partial transliterations which introduce limited albeit useful noise in the training data.The results of the human judgment and qualitative analysis confirm the high quality of mined transliteration pairs which makes it useful for training transliteration models.

Manual Data Collection
Collating existing sources and mining transliterations from web sources is insufficient for building a representative transliteration dataset because (i) mined corpora are predominantly composed of named entities, (ii) romanized native words in the Dakshina dataset only cover frequent words in languages occurring on Wikipedia and may not ensure sufficient word diversity to account for various transliteration phenomena (since Wikipedia for most Indic languages is small), (iii) mined data only covers 12 languages for which sufficient monolingual/parallel corpora are available and which have high grapheme-to-phoneme correspondence making mining feasible, (iv) we still need a diverse and accurate standard testset for all Indic languages.To address these needs, we collect transliteration pairs in 19 Indic languages from trained annotators across India.This section describes the data collection process wherein (i) Indic words to be romanized are selected to ensure diversity and coverage across languages, and (ii) high-quality, manually curated romanizations for these Indic words are collected at scale by setting up a systematic process to ensure quality control and annotator productivity on a digital platform.We collect multiple romanizations for native script words to capture the variations in romanization of native words.Words in Indic scripts have a more standardized orthography.Our data collection protocol ensures that we can collect diverse romanizations to train our transliteration models.

Sourcing Indic words
Words for manual transliteration in 19 languages were sourced from publicly available datasets.We use IndicCorp (Doddapaneni et al., 2022) to source Indic language words for 11 languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, tamil and Telugu).For 6 languages (Maithili, Konkani, Bodo, Nepali, Kashmiri and Urdu) we use the LDC-IL corpus (Choudhary, 2021).We collect Sanskrit words from religious scriptures such as the Mahabharata (Sukthankar, 2017), while for Manipuri we use Wikipedia.We ensure that these source words are not already covered in the sources mentioned in Section 3. We select native script words for manual transliteration with the goal of ensuring coverage of varying length words, diverse n-grams, common as well as infrequent words, and foreign origin words.We use a combination of the following methods for selecting diverse source words: Most frequent words: To account for the most frequent words in a language, we select the top 5000 words for each language.

N-gram Diversity:
We train a 4-gram character LM over all words for each language using KenLM with Kneser-Ney smoothing (Heafield, 2011; Heafield et al., 2013), whose probabilities are a good indicator of 4-gram frequencies in a given word.We compute log probability scores (normalized by word length and scaled to 0-1 range) for each candidate word using the character LM.The words are then sharded into bins corresponding to the 10 probability deciles.Words are uniformly sampled from each bin, ensuring n-gram diversity in source words, complementing mined corpora which are mostly composed of named entities and head inputs.We sampled a total of 10,000 words per language using this method.Named Entities: We sampled 2000 named entities in English spanning 3 broad categories: names, locations, and organizations, covering Indian and foreign origin words.We sourced Indian and foreign personal names and locations by randomly sampling words from collections on websites dedicated for the same.Organization names are sourced from the stock market library list of 1600+ companies listed in NSE5 .These 2000 named entities consist of 800 names (400 each of Indian and foreign origin), 800 locations (400 each of Indian and foreign origin), and 400 Indian organizations.

Annotation Process and QC
We collect transliterations via a two-step process akin to a maker-checker process.A human annotator creates multiple romanized variants for a native word.To aid transliterators, we provide an automatic rule-based transliteration validator.The automatic validator flags potentially wrong transliterations, helping the transliterator correct mistakes made while entering word variants.The correctness of variants is checked by a human validator, who also has the freedom to enter unique word variants.Through multiple pilot projects, we studied different annotation styles, identified common annotation errors made, and devised a set of basic instructions.Annotators were free to enter all common variants while following the instructions as much as possible.Due to budget constraints, the maximum number of variants is capped to 4 per transliterator and 2 per validator.We keep all vari-ants in Roman script for a given Indic word and create (Roman,Indic) script pairs for all of the variants.More details regarding annotators, annotation instructions, etc. are described in Appendix D.

Automatic Validation
To aid transliterators, we provide an automatic rule-based transliteration validator.The tool flags potentially wrong transliterations, helping the transliterator correct mistakes made while entering word variants.Typically, we found that the transliteration validator helped identify typographical errors and other mistypings and ensured consistency in transliterations.Note that this automatic validator only serves as a guide to transliterators, who can override its checks at their discretion.The transliteration validator is based on the Transliteration Equivalence algorithm for English (Roman script)-Hindi described in Khapra et al. (2014) which checks equivalence of the consonant mappings in a potential transliteration pair.More details are described in Appendix E. In total, we collect 554k transliteration pairs across 19 languages, which are split into 451k pairs for the training set pairs and 103k pairs for the test set.

IndicXlit: A Multilingual Model for Transliteration
With the Aksharantar training set, we train a transliteration model, IndicXlit, for transliterating romanized Indic language input to the native script.IndicXlit is a single multilingual, multiscript transliteration model that supports 21 Indic languages.We train a joint model since: (a) low-resource languages would benefit from transfer learning, (b) previous works show that multilingual transliteration models are better at generating canonical spellings (Kunchukuttan et al., 2018a), and (c) maintenance is easier for a single model.Model Architecture.We use a transformer-based encoder-decoder architecture (Vaswani et al., 2017).It is a multilingual character level transliteration model (Kunchukuttan et al., 2021) in a one-to-many setting, which consumes a romanized character sequence and generates an output character sequence in the Indic language script.The input sequence includes a special target language tag token to specify the target language (Johnson et al., 2017).Model vocabulary, hyper-parameters, and training details are described in Appendix F. The model size is 11 million parameters.
Decoding We use beam search with beam size = 4.In addition, we also re-rank top-4 candidates using a revised score F c generated by combining 2 features, (i) a word-level unigram LM score (P c ), (ii) transliteration score (character-level log probability) (T c ) as shown.
We use α = 0.9 based on tuning the parameter on the development set.
Table 5 shows the statistics of the train and validation splits used to train IndicXlit.

Analysis of IndicXlit quality
We analyze IndicXlit's transliteration quality on the Dakshina and Aksharantar testset.We strictly ensure that there is no word overlap between training and test/validation sets for inference.Note that the testsets considered for overlap computation include the Dakshina testset.We remove a pair (en, t) from the training set if (i) the Roman script word en is present in the romanized validation/test set of any language pair, or (ii) the Indic script word t is present in the Indic language validation/test set of any language pair.We report top-1 word level accuracy as our primary evaluation metric (Chen et al., 2018a).Additionally, we report top-3 and top-5 accuracies as well as F1-score in Appendix G as our secondary evaluation metrics.We observe that major trends on all metrics are consistent.

Quality on Dakshina testset
We compare IndicXlit with the best reported results on the Dakshina testset (in Table 4).Note that the Dakshina testset covers only 12 of the languages that are part of the Aksharantar dataset.
The IndicXlit model substantially improves the results reported by Roark et al. (2020) on the Dakshina dataset, with a 15% improvement in average accuracy across languages.Since the size of training data is a major difference between the two models, it is clear that large-scale mined transliteration pairs help to substantially improve the transliteration quality.Multilingual training also helps improve the transliteration quality.These observations are further supported by ablation results reported in Section 7. The largest improvements are seen for mar (30.3%) and guj (25.7%), possibly because they are similar to the high resource hin language and mar also shares the script with Hindi.The least improvements are seen for tam (4.6%) and tel (8.9%).Table 4: Top-1 accuracies reported on the Dakshina test set.We trained the monolingual and multilingual models on the Dakshina dataset using the same architecture as IndicXlit, so the impact of the dataset can be isolated.

Quality on Aksharantar testset
We report the results of IndicXlit on the Aksharantar testset in Table 6, particularly looking at the accuracy on various sub-testsets to understand model performance on different categories of words.
Frequent words are easier.The performance on the Dakshina dataset and the AK-Freq dataset, both comprised of frequent words in the language, is similar.The AK-Freq testset has the best performance across all subtestsets, suggesting that this test set is easiest to transliterate.These words are shorter on average and might be comprised of common n-grams -explaining the good performance.
Words with diverse n-grams are harder.On the other hand, the AK-Uni testset comprised of uniformly sampled words with diverse n-gram characteristics is much more challenging, with average accuracy being 10 points lower than the AK-Freq testset.This testset presents a challenging usecase for transliteration systems.Lower accuracy on this testset can be attributed to the average length of words and the rarity of the n-grams.Named entities are the hardest.Named entity testsets are the most difficult testsets even though named entities constitute a large fraction of mined training data.Given the larger grapheme-phoneme mismatch for foreign entities, lower performance on this set is not surprising.While performance on Indian named entities is better than that on foreign named entities, their transliteration accuracy is still lower than the uniformly sampled test set.This is surprising and warrants further investigation.Some languages are harder.In terms of languagewise accuracy, the lowest-performing languages are ones using the Arabic script (urd, kas) or those with lesser training data (asm, brx, ori).

Re-ranking helps on average.
Unigram reranking of the candidates helps improve the transliteration accuracy substantially by 12% on an average across languages (See Table 6 for results).LM re-ranking mostly benefits the native language words and high resource languages with a lot of monolingual data for training LMs.Re-ranking doesn't help for named entities.Unigram re-ranking shows limited benefits for named entities.This is not surprising since named entities might not be well represented in the LM given their rarity.Similarly, low-resource languages with limited monolingual data benefit less from LM reranking.Infrequent words thus pose a challenge to the quality of transliteration models.

Error analysis
To understand the errors made by IndicXlit, we analysed the output of the model for 100 randomly sampled words each for Bengali, Gujarati, Hindi, Kannada, Marathi, Punjabi, Telugu from the Dakshina dataset.The most common errors across languages are with respect to vowels (60%) and similar consonants (25%), while other errors (15%) include viz.gemination, acronyms, contextual ambiguity, valid alternatives, and language-specific errors.Appendix H provides a more detailed discussion of the error analysis with examples.

Ablation Studies
We describe various ablation studies carried out on Dakshina testset for 9 languages viz.Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Punjabi, Tamil, Telugu and

Dataset ben guj hin kan mal mar pan tam tel avg
All 54.1 58.5 56.6 71.9 57.9 59.9 41.9 61.1 72.0 59.3No manual 50.9 38.3 54.5 71.1 58.9 60.2 42.9 62.9 71.9 56.9 Table 9: Impact of manually collected pairs (micro-averaged accuracy over all testsets).the design choices of the IndicXlit model described in Section 5. Results of the following research questions are presented in Table 7.
Impact of various transliteration corpora sources.We train separate monolingual models for each language.We initially trained a baseline model by using only the Dakshina training set, followed by successive addition of transliteration pairs collected/mined from various sources.We observe a consistent increase in transliteration quality as transliteration pairs from various sources are added.Particularly, we observe a substantial improvement in performance when we add word pairs mined from monolingual corpora, IndicCorp, which constitutes the largest component of Aksharantar.The addition of manually collected transliteration pairs does not have an impact on these languages and the Dakshina testset since IndicCorp already contains sufficient data to model the frequent words that are part of the Dakshina testset.However, as shown in Table 9, we observe that manually collected data improves the micro-averaged transliteration accuracy over Dakshina and all Aksharantar testsets, viz.AK-Freq, AK-Uni, AK-NEF, AK-NEI.This suggests that manually collected data improves accuracy on other testset categories.Moreover, manual data is necessary for extremely low-resource languages with no data in the public domain and for bootstrapping transliteration mining efforts.Impact of Multilingual Models.We see that multilingual models show a slight improvement over monolingual results on the Dakshina benchmark.In another experiment, we compare monolingual and multilingual models (for 18 languages) using all sources (except manually collected datasets) and observe a substantial increase in accuracy for low-resource languages using multilingual models (Table 8).Thus multilingual models substantially improve performance for low-resource languages, while at least retaining performance on high-resource languages with a single model.Impact of script unification.Indic scripts have a unique Unicode codepoint range, a 1-1 mapping between most characters of different scripts is possible since the Unicode standard accounts for similarities between Indic scripts.This can potentially improve transfer learning between languages.We experiment with single script models converting characters from all Brahmi-derived scripts to Devanagari scripts using the IndicNLP library (Kunchukuttan, 2020).A special language token is added to every input sequence to distinguish the original Indic language, as described in Section 5.After decoding, the Devanagari script output is converted back to the target language's Indic script using the 1-1 mapping.We observe that single-script and multi-script models have similar performance.As these models are already trained on an already ample amount of data, the single-script model doesn't provide additional benefits from transfer learning in addition to multilingual representation learning.
Given the small difference and negligible model size overhead, we opt to use a multi-script model for all Indic languages to simplify the pre-processing of data and incorporation of scripts such as the Arabic script, which cannot be easily mapped to the Devanagari script.Impact of language family specific models.We observe that language-family specific models are slightly better than a pan-Indic model.Given the small difference in quality and the convenience of maintaining and deploying a single model, we choose to train IndicXlit as a pan-Indic model.

Impact of word-level unigram LM re-ranking.
We observe 12% improvement in accuracy by re-ranking the top-4 candidates.This gain is over and above the 15% gain obtained by using the Aksharantar training set.

Conclusion
In this work, we take a major step toward creating publicly available, open datasets and open-source models for transliteration in Indic languages.We introduce Aksharantar, the largest transliteration parallel corpora for 21 languages containing 26 million transliteration pairs, and covering 20 of 22 languages listed in the Indian constitution.We also create a diverse, high-quality testset for romanized to Indic script transliteration, covering word pairs with various characteristics and enabling finegrained analysis of different transliteration usecases.We also build IndicXlit, a transformerbased transliteration model, for romanized input to Indic script transliteration.IndicXlit achieves state-of-the-art results on the Dakshina testset.We also provide baseline results on the new Aksharantar testset along with a qualitative analysis of the model performance.

Limitations
The benchmark for transliteration for the most part contains clean words (grammatically correct, single script, etc.).Data from the real world might be noisy (ungrammatical, mixed scripts, code-mixed, invalid characters, etc.).A better representative benchmark might be useful for such use cases.However, the usecases captured by this benchmark should suffice for the collection of clean transliteration corpora.This also represents a first step for many low-resource languages where no transliteration benchmark exists.
In this work, training data is limited to the 20 languages and test data is limited to 19 languages listed in the 8 th schedule of the Indian constitution.Further work is needed to extend the benchmark to many more widely used languages in India (which has about 30 languages with more than a million speakers).Subsequent to the acceptance of this work, we have also released training and testsets for one more Indic language viz.Dogri (doi) which are available on the project website.
In this work, we describe word-level testsets.However, the typical usecase for transliteration is keyboard input of sentences (or at least a sequence of words).In such cases, the context would be useful to improve transliteration.A sentence-level transliteration benchmark would be useful for evaluation such contextual transliteration models.The Dakshina dataset has sentence-level transliteration testsets for 12 languages.In a project concurrent to this work (Madhani et al., 2023), we have created sentence-level transliteration testsets for 22 Indic languages.
In this work, we have only explored romanized to native script transliteration.However, there is a need for native script to romanized models as well for processing romanized Indic language text that is also prevalent on the web.Subsequent to the acceptance of this work, we have also released an Indic to Roman script IndicXlit model trained on the Aksharantar corpus.This model is also available on the project website.

Ethics Statement
For the human annotations on the dataset, the language experts are native speakers of the languages from the Indian subcontinent.We collaborated with external agencies for the annotation task.The payment was based on their skill set and experience, determined by the external agencies, and ad-hered to the government's norms.The dataset is free from harmful content.The annotators were made aware of the fact that the annotations would be released publicly and the annotations contain no private information.The proposed benchmark builds upon existing datasets.These datasets and related works have been cited.
The annotations are collected on a publicly available dataset and will be released publicly for future use.

B Example of Wikidata mining
Figure 1 describes the structure of wikidata database.As shown in Figure 1, Each entity has labels that are common names of the items in multiple languages.We have the location entity Mumbai with its translitertions in multiple Indian languages.We extract such English-Indian language label pairs creating transliteration pairs.For multiword labels, we use a simple method that worked well: creating all possible word pair candidates from the English and the Indian language labels followed by filtering the candidate pairs using the automatic transliteration validator described in Section 4.3.For example, the multi-word pair "Mahatma Gandhi" will result in 4 candidate pairs { Mahatma महात्मा, Mahatma गां धी, Gandhi महा-त्मा, Gandhi गां धी } We then filter these candidate pairs using the automatic transliteration validator described in section 4.3.It will filter out these two incorrect pairs, { Mahatma महात्मा } and { Gandhi गां धी }.Please refer to Sajjad et al. (2012) for more details.

D Annotation Process in detail
Karya App We use Project Karya (Chopra et al., 2019; Abraham et al., 2020), an open-source crowdsourcing platform making digital language work more inclusive and accessible to the masses using smartphones, as our annotation platform.The app is used for collecting transliteration data from selected annotators.The user interface is shown in Figure 2.
Annotators detail In all, we employ 68 annotators from two data annotation agencies as transliterators and validators, with the latter having more experience in linguistic tasks.The annotators were paid INR 2 (USD 0.026) per native language word.Transliteration annotation task Each transliteration micro-task contains 100 native words to be transliterated and then validated post translitera-

E Automatic Validation Algorithm
The transliteration checker is based on the Transliteration Equivalence algorithm for English (Roman script)-Hindi described in Khapra et al. (2014) which checks equivalence of the consonant mappings in a potential transliteration pair.To achieve this, the algorithm takes two pieces of information: (i) a stop-list of vowels in the two languages, and (ii) a list of consonant mappings between the two languages.We incorporate these rules and extend the above-mentioned approach to other Indic languages with the aid of language experts.Table 12 shows a snippet of consonant mappings for the Kannada language.There is a large overlap in the consonant mapping rules across Indian languages, but we also cater to language-specific exceptions.
The transliteration validator firstly removes vowels and all characters present in a stop-list from the English variant and maps each English consonant to the relevant Indic language consonant according to the consonant mapping table as shown in Table 12.Once all possible Indic language variants of the English word are formulated, they are compared against the original Indic word to check validity of the romanized transliteration.We check the effectiveness of the transliteration validator on transliteration pairs in the Dakshina train set and observe that it achieves a minimum accuracy of 90% across languages as shown in Table 15.This indicates its utility and non-intrusiveness.

F IndicXlit: Model parameters and Training details
Vocabulary The input vocabulary is the set of Roman script characters found in the training set, while the output vocabulary is the union of characters from various Indic language scripts found in the training set.The input and output vocabulary sizes are 28 and 780 characters, respectively.

Model parameters
We experimented with different hyperparameters for the model architecture and following parameters gave the best results on the Dakshina development set.The IndicXlit model has 6 encoder and decoder layers each, 256 dimensional input embeddings, feedforward network (FFN) dimension of 1024, and 4 attention heads.We use GELU activation function (Hendrycks and Gimpel, 2016) in the feedforward layer, and dropout=0.5.We preprocess multi-head attention, encoder attention, and each layer of FFN with layernorm.We add layer normalization to the embeddings (Ba et al., 2016).
Training Details We use Fairseq (Ott et al., 2019) for training our transliteration models, specifically the translation multi simple epoch task.We optimize the cross-entropy loss using the Adam optimizer (Kingma and Ba, 2015) with Adambetas of (0.9, 0.98).We use a peak learning rate of 0.001, 4000 warmup steps, and the inverse-sqrt learning rate scheduler.We use a global batch size of 4096 pairs.Each minibatch contains examples from all language pairs.Due to the skew in data distribution across languages, we use temperature sampling (Arivazhagan et al., 2019) to oversample data from low-resource languages with temperature T = 1.5.We optimize the above mentioned values of the hyperparameters over the Dakshina training and development set.We train the model on 4 A100 GPUs for a maximum of 50 epochs.Table 5 shows the statistics of the train and validation splits used to train IndicXlit.

H Error Analysis in detail
To understand the errors made by IndicXlit, we analysed the output of the model for 100 randomly sampled words each for ben, guj, hin, kan, mar, pan, tel from the Dakshina dataset.Table 16 summarizes the major transliteration errors.Vowels.The most common errors across languages are with respect to vowels, as reported in previous studies (Kunchukuttan et al., 2021).Insertion/deletion of the '◌ा ' vowel diacritic along with confusion between short/long vowel diacritics constitute a large fraction of transliteration errors.Similar consonants.Another common source of errors is confusion between similar consonants, as shown in Table 16.

Figure 1 :
Figure 1: Structure of Wikidata record.labels in multiple languages attribute to transliterations pairs are highlighted in these examples.As described in the section 3.2, these transliteration pairs could be in the form of named entities, loan words, and cognates in parallel translation sentences.Unsupervised method by Sajjad et al. (2012) The unsupervised method suggested by Sajjad et al. (2012) is implemented in the transliteration module(Durrani et al., 2014a) of Moses(Koehn et al., 2007), to mine transliteration pairs from the word alignments by distinguishing transliterations and non-transliterations. Their model structure is motivated as follows: A combination of a transliteration sub-model and a nontransliteration model, combined with interpolation weights.The parameters of t he transliteration model and the interpolation weights are learned during the training whereas the parameters of the non-transliteration sub-model are kept fixed after initialization.Their training procedure ensures that the transliteration model assigns most of the probability mass to transliteration pairs, whereas the non-transliteration sub-model evenly distributes the probability mass across all possible source and target word pairs.Hence, the trained model assigns a higher score to the transliteration pairs and thus helps in identifying such pairs.

Figure 2 :
Figure 2: Annotation UI in the Karya app.
tion.Transliterators are instructed to write transliterations that are natural.A rule-based automatic transliteration validator (Appendix 4.3) is used as the first level of internal quality check for the proposed transliterations.The validator can reject wrong variants and enter new variants for a native script word missed by the transliterator on a similar interface, as shown in Figure 2b.The variants accepted or added by the validator constitute the final set of romanized variants for the input word.

Table 2 :
Examples of incorrect mined pairs from translation corpora.

Table 5 :
Training and validation set statistics for Aksharantar.All numbers are in thousands.
Table 11 describes the examples from Samanantar parallel translation corpora.Transliteration pairs

Table 11 :
Examples of transliteration pairs from the Samanantar parallel translation corpus.

Table 16 :
Summary of errors of IndicXlit outputs.