Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages

We create publicly available language identification (LID) datasets and models in all 22 Indian languages listed in the Indian constitution in both native-script and romanized text. First, we create Bhasha-Abhijnaanam, a language identification test set for native-script as well as romanized text which spans all 22 Indic languages. We also train IndicLID, a language identifier for all the above-mentioned languages in both native and romanized script. For native-script text, it has better language coverage than existing LIDs and is competitive or better than other LIDs. IndicLID is the first LID for romanized text in Indian languages. Two major challenges for romanized text LID are the lack of training data and low-LID performance when languages are similar. We provide simple and effective solutions to these problems. In general, there has been limited work on romanized text in any language, and our findings are relevant to other languages that need romanized language identification. Our models are publicly available at https://ai4bharat.iitm.ac.in/indiclid under open-source licenses. Our training and test sets are also publicly available at https://ai4bharat.iitm.ac.in/bhasha-abhijnaanam under open-source licenses.


Introduction
In this work, we focus on building a language identifier for the 22 languages listed in the Indian constitution.With increasing digitization, there is a push to make NLP technologies like translation, ASR, conversational technologies, etc. (Bose, 2022) available as a public good at population scale (Chandorkar, 2022).A good language identifier is required to help build corpora in low-resource languages.For such languages, language identification is far from a solved problem due to noisy web crawls, small existing datasets, and similarity to high-resource languages (Caswell et al., 2020).
Existing publicly available LID tools like CLD3 1 , LangID 2 (Lui and Baldwin, 2011), Fast-Text 3 (Joulin et al., 2016) and NLLB 4 (NLLB Team et al., 2022) have some shortcomings with respect to Indian languages.They do not cover all the above-mentioned 22 languages.In social media and chats, it is also common to use the roman script for most Indian languages leading to substantial user-generated content in roman script.However, none of the LIDs have any support for the detection of romanized Indian language text (except cld3 support for Latin Hindi).The widespread use of romanization implies that accurate romanized Language Identification models are a critical component in the NLP stack for Indian languages, given that this affects over 735 million internet users (KPMG and Google, 2017).Therefore, our work on developing accurate and effective romanized Language Identification models has the potential to make a significant impact in the NLP space for Indian languages, particularly in the social media and chat application domains.Hence, we undertake the task of creating a LID for these 22 Indian languages.The main contributions of our work are as follows: • We create Bhasha-Abhijnaanam 5 , a language identification test set for native-script as well as romanized text which spans 22 Indic languages.Previous benchmarks for native script do not cover all these languages (NLLB Team et al., 2022;Roark et al., 2020).The Dakshina test set for romanized text covers only 11 languages and there are ambiguous instances in the test set like named entities that cannot be assigned to a particular language (Roark et al., 2020).
• We also train, IndicLID, an LID for all the above-mentioned languages in both native and romanized script.For native-script training data, we sample sentences from diverse sources and oversample low-resource languages.IndicLID native-script model has better language coverage than existing LIDs and is competitive or better than other LIDs with 98% accuracy and at least 6 times better throughput.
• To the best of our knowledge, ours is one of the first large-scale efforts for romanized LID in any language, a task that has not received much attention.A major challenge for romanized text LID is the lack of romanized training data.We show that synthetic romanized training data created via transliteration can help train a reasonably good LID for romanized text.A simple linear classifier does not perform well for romanized text.Hence, we combine a simple but fast text classifier with a slower but more accurate classifier based on a pretrained language model to achieve a good trade-off between accuracy and speed.
Our findings are relevant to other languages that need LID for romanized text.We require native script data and a transliteration model to create the synthetic romanized data for the target language.This romanized data serves as training data for the romanized LID.

Bhasha-Abhijnaanam benchmark
We describe the creation of the Bhasha-Abhijnaanam LID benchmark for 22 Indian languages in native and roman script.Table 1 describes the statistics of the Bhasha-Abhijnaanam benchmark.We build upon existing benchmarks to fill in the coverage and quality gaps and cost-efficiently cover all languages.

Native script test set.
We compile a native script test set comprising 19 Indian languages and 11 scripts from the FLORES-200 devtest (NLLB Team et al., 2022) and Dakshina sentence test set (Roark et al., 2020).We create native text test sets for the remaining three languages (Bodo, Konkani, Dogri) and one script (Manipuri in Meetei Mayek script) not covered in these datasets.For these new languages we first sample the English sentences from Wikipedia and ask in-house, professional translators to translate the sentences to respective languages.This method ensured the quality and accuracy of our test samples, as well as minimizing ) and asked annotators to write the same in roman script.We did not specify any transliteration guidelines and annotators were free to transliterate in the most natural way they deemed fit.We additionally asked annotators to skip the sentence if they find it invalid (wrong language, offensive, truncated, etc.).

Romanized Dakshina testset filtering
The Dakshina romanized sentence test set includes short sentences which are just named entities and English loan words which are not useful for romanized text LID evaluation.To address this issue, we manually validate the Dakshina test sets for the languages we are interested in.We first identified potentially problematic sentences from the romanized Dakshina test set by applying two constraints: (i) sentences shorter than 5 words, and (ii) native LID model is less confident about the native language sentence (prediction score less than 0.8).These sentences were then validated by native language annotators.The annotators were asked to read the roman sentences and determine whether they were named entities or sentences where they could not determine the language.Such entries were filtered out.About 7% of the sentences were filtered.Table 2 describes the filtering statistics.

IndicLID Model
IndicLID is a classifier specifically for Indic languages that can predict 47 classes (24 native-script classes and 21 roman-script classes plus English and Others).We create three classifier variants: a fast linear classifier, a slower classifier finetuned from a pre-trained LM, and an ensemble of the two models which trades off speed v/s accuracy.

Training dataset creation
Native-script training data.We compiled the training data sentences from various sources viz.In-  Team et al., 2022), Wikipedia, Vikaspedia6 and internal sources.To ensure a diverse and representative training dataset, we sampled 100k sentences per language-script combination in a balanced way across all these sources.We used oversampling for languages with less than 100k sentences.We tokenized and normalized the sentences using Indic-NLP library7 (Kunchukuttan, 2020) with default settings.
Romanized training data.There is hardly any romanized corpora for Indian languages in the public domain 8 .Hence, we explored the use of transliteration for creating synthetic romanized data.We create romanized training data by transliterating the native script training data into roman script using the multilingual IndicXlit9 transliteration model (Indic-to-En version) (Madhani et al., 2022), The authors have provided results on the transliteration quality of the IndicXlit model.We rely on this analysis to ensure the quality of generated training data.

Linear classifier
Linear classifiers using character n-gram features are widely used for LIDs (Jauhiainen et al., 2021).We use FastText (Joulin et al., 2016) to train our fast, linear classifier.It is a lightweight and efficient linear classifier that is well-suited for handling large-scale text data.It utilizes character n-gram features which enables it to utilize subword information.This makes it particularly useful for dealing with rare words and allows it to discriminate between similar languages having sim-ilar spellings.We trained separate classifiers for native script (IndicLID-FTN) and roman script (IndicLID-FTR).We chose 8-dimension wordvector models after experimentation as they maintain small model sizes without losing model accuracy (refer Appendix A for results).

Pretrained LM-based classifier
For romanized text, we observed that linear classifiers do not perform very well.Hence, we also experimented with models having larger capacity.Particularly, we finetuned a pretrained LM on the romanized training dataset.We evaluated the following LMs: XLM-R (Conneau et al., 2020), IndicBERT-v2 (Doddapaneni et al., 2022) and MuRIL (Khanuja et al., 2021).The last two LMs are specifically trained for Indian languages and MuRIL also incorporates synthetic romanized data in pre-training.Hyperparameters for finetuning are described in Appendix B. We used IndicBERTbased classifier as the LM-based classifier (henceforth referred to as IndicLID-BERT) since it was amongst the best-performing romanized text classifiers and had maximum language coverage.

Final Ensemble classifier
Our final IndicLID classifier is an pipeline of multiple classifiers.Figure 1 shows the overall workflow of the IndicLID classifier.The pipeline works as described here: (1) Depending on the amount of roman script in the input text, we invoke either the native-text or romanized linear classifier.IndicLID-FTR is invoked for text containing >50% roman characters.
(2) For roman text, if IndicLID-FTR is not confident about its prediction, we redirect the request to the IndicLID-BERT.We resort to this two-stage approach for romanized input to achieve a good trade-off between classifier accuracy and inference speed.The fast IndicLID-FTR's prediction is used if the model is confident about its prediction (probability of predicted class > 0.6 ), else the slower but more accurate IndicLID-BERT is invoked.This threshold provides a good trade-off (See Appendix C for more details).

Results and Discussion
We discuss the performance of various models on the benchmark and analyze the results.To prevent any overlap between the test/valid and train sets, we excluded the Flores-200 test set (NLLB  while sampling native train samples from various sources.Additionally, we removed the training samples from the benchmark samples when collecting sentences for the benchmark test set.We also made sure that there was no overlap between the test and valid sets.To create the romanized training set, we simply transliterated the native training set.As the Dakshina test set (Roark et al., 2020) provided parallel sentences for the native and roman test sets, there was no overlap between the roman train and test sets.

Native script LID
We compare IndicLID-FTN with the NLLB model (NLLB Team et al., 2022) and the CLD3 model.
As we can see in Table 3, the LID performance of IndicLID-FTN is comparable or better than other models.Our model is 10 times faster and 4 times smaller than the NLLB model.The model's footprint can be further reduced by model quantization (Joulin et al., 2016) which we leave for future work.

Roman script LID
Table 4 presents the results of different model variants on the romanized test set (see Appendix D for language-wise results).IndicLID-BERT is significantly better than IndicLID-FTR, but the throughput decreases significantly.The ensemble model (IndicLID) maintains the same LID performance as IndicLID-BERT with a 3x increase in the throughput over IndicLID-BERT.Further speedups in the model throughput can be achieved by creating distilled versions, which we leave for future work.

LID confusion analysis
The confusion matrix for IndicLID is shown in Figure 2. We see that major confusions are between similar languages.Some    The confusion matrix gives further insights into the impact of synthetic training data.Hindi is confused with languages like Nepali, Sanskrit, Marathi and Konkani using the same native script as Hindi (Devanagari).Since a multilingual transliteration model with significant Hindi data was used to create the synthetic romanized training data, it may result in the synthetic romanized forms of these languages being more similar to Hindi than would be the case with original romanized data.Impact of input length Figure 3 plots the LID accuracy for various input length buckets.The LID is most confused for short inputs (<10 words) after which the performance is relatively stable.

Conclusion
We introduce an LID benchmark and models for native-script and romanized text in 22 Indian languages.These tools will serve as a basis for building NLP resources for Indian languages, particularly extremely low-resource ones that are "leftbehind" in the NLP world today (Joshi et al., 2020).Our work takes first steps towards LID of romanized text, and our analysis reveals directions for future work.research on Indic languages.We would like to thank Jay Gala and Ishvinder Sethi for their help in coordinating the annotation work.Most importantly we would like to thank all the annotators who helped create the Bhasha-Abhijnaanam benchmark.

Limitations
The benchmark for language identification for the most part contains clean sentences (grammatically correct, single script, etc.).Data from the real world might be noisy (ungrammatical, mixed scripts, code-mixed, invalid characters, etc.).A better representative benchmark might be useful for such use cases.However, the use cases captured by this benchmark should suffice for the collection of clean monolingual corpora.This also represents a first step for many languages where no LID benchmark exists.
The use of synthetic training data seems to create a gap in performance due to divergence in train/test data distributions.Acquisition of original native romanized text and methods to generate better romanized text are needed.
Note that the romanized LID model does not support Dogri since the IndicXlit transliteration model does not support Dogri.However, since Dogri is written in the Devanagari script using the transliterator for Hindi which uses the same script might be a good approximation to generate synthetic training data.We will explore this in the future.
This work is limited to the 22 languages listed in the 8 th schedule of the Indian constitution.Further work is needed to extend the benchmark to many more widely used languages in India (which has about 30 languages with more than a million speakers).

Ethics Statement
For the human annotations on the dataset, the language experts are native speakers of the languages and from the Indian subcontinent.They were paid a competitive monthly salary to help with the task.The salary was determined based on the skill set and experience of the expert and adhered to the norms of the government of our country.The dataset has no harmful content.The annotators were made aware of the fact that the annotations would be released publicly and the annotations contain no private information.The proposed benchmark builds upon existing datasets.These datasets and related works have been cited.
The annotations are collected on a publicly available dataset and will be released publicly for future use.The IndicCorp dataset which we annotated has already been checked for offensive content.
All the datasets created as part of this work will be released under a CC-0 license 10 and all the data.To fine-tune the language model, we added one softmax layer to the end of the model and used our roman script training data to finetune the model.The results for these experiments are shown in Table 7.We found that IndicBERT and MuRIL performed similarly among these three models for our roman LID task.MuRIL leverages the advantage of roman text training data, while IndicBERT was trained on the only native script but performed similarly.However, IndicBERT supports 24 Indian languages, while MuRIL only supports 17 Indian languages.Therefore, we selected IndicBERT due to its superior coverage and performance.We then further experimented with IndicBERT by unfreezing 1, 2, 4, 6, 8, and 11 layers.The results and comparison of all the experiments are described in Table 8.We found that unfreezing 1 layer was enough for our task and that unfreezing more layers did not provide any additional benefit.

C Analysis of speed/accuracy tradeoff
We experimented IndicLID with different thresholds.If the probability score is below a certain threshold we invoke a more powerful model IndicLID-BERT, otherwise, we go with IndicLID-FTR model prediction.IndicLID-FTR model is quite fast as compared to IndicLID-BERT model.We can see a good trade-off between throughput and accuracy in table 9 as we increase the threshold.As the threshold increases, the input is more likely to go towards the IndicLID-BERT model, as we are making the model less reliant on the IndicLID-FTR model.

Figure 3 :
Figure 3: Effect of input length on romanized testset

Table 2 :
Statistics of Dakshina roman filtered test set 2022 we are interested in and filter out about 7% of the sentences.Section 2.3 describes the details of the filtering process.To create a benchmark test set for the remaining 10 Indian languages, we sampled sentences from IndicCorp(Doddapaneni et al.,

Table 4 :
Performance of IndicLID-FTR on Bhasha-Abhijnaanam roman script test set.Throughput is number of sentence/second.

Table 5 :
Comparison of results on Synthetic vs. original Romanized test sets for IndicLID model

Table 9 :
Trade-off between inference time and accuracy with different thresholds.Throughput is number of sentence/second.

Table 10 :
Table10illustrates the language-specific performance of IndicLID-FTR, IndicLID-BERT and In-dicLID models in detail.As we can see IndicLID-BERT has better representation than IndicLID-FTR for almost all the languages which leads better F1 score for IndicLID.However, for the languages of Sanskrit and Manipuri, the IndicLID-FTR model has a better representation than the IndicLID-BERT model, which is an interesting finding that warrants further investigation in future studies.Precision, recall and F1-score of IndicLID-FTR, IndicLID-BERT and IndicLID roman script model.All scores are calculated on Bhasha-Abhijnaanam roman script test set.Bold indicates the best language representation among IndicLID-FTR, IndicLID-BERT and IndicLID roman script model for individual languages.