Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages

We present, Naamapadam, the largest publicly available Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families. The dataset contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location, and, Organization) for 9 out of the 11 languages. The training dataset has been automatically created from the Samanantar parallel corpus by projecting automatically tagged entities from an English sentence to the corresponding Indian language translation. We also create manually annotated testsets for 9 languages. We demonstrate the utility of the obtained dataset on the Naamapadam-test dataset. We also release IndicNER, a multilingual IndicBERT model fine-tuned on Naamapadam training set. IndicNER achieves an F1 score of more than 80 for 7 out of 9 test languages. The dataset and models are available under open-source licences at https://ai4bharat.iitm.ac.in/naamapadam.


Introduction
Named Entity Recognition (NER) is a fundamental task in natural language processing (NLP) and is an important component for many downstream tasks like information extraction, machine translation, entity linking, co-reference resolution, etc. The most common entities of interest are person, location, and organization names, which are the focus of this work and most work in NLP. Given high-quality NER data, it is possible to train good-quality NER systems with existing technologies (Devlin et al., 2019). For many high-resource languages, publicly available annotated NER datasets (Tjong Kim Sang, 2002;Tjong Kim Sang and De Meulder, 2003a;Pradhan et al., 2013;Benikova et al., 2014) as well as high-quality taggers (Wang et al., 2021;Li et al., 2020) high-quality taggers are available.
However, most Indic languages do not have sufficient labeled NER data to build good-quality NER models. All existing corpora for Indic languages have been manually curated (Lalitha Devi et al., 2014;Murthy et al., 2018;Pathak et al., 2022;Murthy et al., 2022;Malmasi et al., 2022;Litake et al., 2022). Given the number of languages, the expenses and logistical challenges, these datasets are limited on various dimensions viz. corpus size, language coverage, and broad domain representation. In recent years, zero-shot cross-lingual transfer from pre-trained models, fine-tuned on taskspecific training data in English has been proposed as a way to support various language understanding tasks for low-resource languages (Hu et al., 2020). However, this approach is more suitable for semantic tasks and cross-lingual transfer does not work as well for syntactic tasks like NER when transferring across distant languages like English and Indian languages (Wu and Dredze, 2019;Karthikeyan et al., 2020;Ruder et al., 2021). Hence, there is a need for in-language NER training data for Indic languages.
In recent years, the paradigm of mining datasets from publicly available data sources has been successfully applied to various NLP tasks for Indic languages like machine translation (Ramesh et al., 2022), machine transliteration (Madhani et al., 2022), many natural language generation tasks (Kumar et al., 2022). These approaches have led to the creation of large-scale datasets and models with broad coverage of Indic languages in a short amount of time. Taking inspiration from these successes, we also explore the automatic creation of  NER datasets by utilizing publicly available parallel corpora for Indian languages and high-quality English named entity taggers. In this work, we undertake the task of building large-scale NER datasets and models for all major Indic languages.
The following are the contributions of our work: • We build Naamapadam 1 , the largest publicly available NER dataset for Indic languages for 11 languages from 2 language families. Naamapadam contains 5.7M sentences and 9.4M entities across these languages from three categories: PERSON, NAME, and LOCATION. This is significantly larger than other publicly available NER corpora for Indian languages in terms of the number of named entities and language coverage. Table 1 compares Naamapadam with other Indic language NER datasets. WikiANN is a highly noisy dataset 'silver standard' dataset comprising annotations on Wikipedia article titles which are not representative of natural language sentences (Pan et al., 2017). Other datasets cover only 7 Indian languages. Except for CFILT-Hindi, other datasets are small in size.
• We utilize parallel translation corpora between English and Indic languages to create NER training corpora by projecting annotations from English sentences to their Indic languages translation. This allows for inexpensive creation of data, at scale, while maintaining high-quality.
• We show that the projection approach is better than approaches based on zero-shot cross-lingual transfer. Hence, we recommend the use of a projection approach when a reasonable amount of parallel corpora is available. This is a valid assumption for many mid-resource languages which today lack good NER models.
• We create the Naamapadam test set, containing human-annotated test sets for 8 languages on 1 Naamapadam means named entity in Sanskrit general domain corpora containing around 1000 sentences per language, that can help in benchmarking NER models for Indic languages. Existing testsets are limited to fewer languages or are domain-specific.
• We also train a multilingual NER model, Indic-NER, supporting 11 Indic languages. Our models achieve more than 80% F1 score on most languages on the Naamapadam test set.
The models and datasets are available publicly under open-source licences.

Related Work
We discuss the state of NER datasets for Indian languages and common methods used to improve NER for low-resource languages.

NER data for Indian languages
Very limited NER corpora are available for Indian languages. They are mostly small in size and do not cover all major Indian languages. The FIRE-2014 dataset (Lalitha Devi et al., 2014) is available for 4 languages. It was created by collecting sentences/documents from Wikipedia, blogs, and, online discussion forums. The WikiAnn dataset is available for around 16 Indian languages -it is, however "silver standard" data automatically created by projecting annotations from English Wikipedia via cross-language links (Pan et al., 2017). Moreover, the examples are Wikipedia article titles which are not representative of natural language sentences. Murthy et al. (2022) contributed the largest human-annotated dataset for Hindi (CFILT-Hindi) in terms of volume and diversity with over 100k sentences, all annotated by a single expert individual over a span of several years. There are a few small datasets for Indian languages: CFILT-Marathi (Murthy et al., 2018), MahaNER (Litake et al., 2022), AsNER (Pathak et al., 2022) and MultiCoNER (Malmasi et al., 2022). In contrast, Naamapadam has greater language coverage and is much larger compared to other datasets. It also representative of general domain text.

Annotation Projection
Named entity corpora can be created for lowresource languages by projection of named entity annotations from sentences in the source language (high-resource) to the corresponding words in the translated sentence in target language (lowresource). Yarowsky et al. (2001) first demonstrated how annotations can be projected using word alignments given parallel corpora between two languages. In addition to word alignments, projection can be also be based on matching tokens via translation and entity dictionaries as well as transliteration (Zhang et al., 2016;Jain et al., 2019). Agerri et al. (2018) extended this approach to multiple languages by utilizing multi-way parallel corpora to project named entity label from the source to the target language. They focus on using more than one source language having a corresponding parallel sentence to the target language. The idea is that if multiple languages project the same label onto an entity in the target language, there is a higher probability of that label being correct. When parallel corpus is not available, but good quality MT systems are available, annotated corpora in one language can be translated to another language followed by annotation projection (Jain et al., 2019;Shah et al., 2010). Bilingual dictionaries or bilingual embeddings have been used for translation in low-resource scenarios (Mayhew et al., 2017;Xie et al., 2018).
The WikiAnn project creates 'silver standard' NER corpora using a weakly supervised approach leveraging knowledge bases and cross-lingual entity links to project English entity tags to other languages (Pan et al., 2017). WikiAnn is thus noisy and consists of short Wiki titles.

Zero-shot Cross-lingual Transfer
This method relies on shared multilingual representations to help low-resource languages by transferring information from high-resource language NER models. Particularly, NER models finetuned on pre-trained language models like mBERT (Devlin et al., 2019), XLM-RoBERTa (Conneau et al., 2020) for high resource languages are used to tag low-resource language sentences (zero-shot NER). Pires et al. (2019) demonstrate that multilingual models perform well for zero-shot NER transfer on related languages. However, zero-shot performance is limited for distant languages Wu and Dredze (2019), particularly when there are structural/word order differences between the two languages Karthikeyan et al. (2020). Unlike many other NLP tasks, zero-shot cross-lingual NER has seen only limited benefit from recent advances in cross-lingual representation learning (Ruder et al., 2021). To overcome this limitation, a knowledge distillation approach has been proposed to create synthetic in-language training data . Here, the source language teacher NER model is to create distillation data in the target language via zero-shot cross-lingual transfer, which is used to train a target language model.

Mining NER Corpora
Following Yarowsky and Ngai (2001a,b), our method for building NER corpora is based on projecting NER tags from the English side of an English-Indic language parallel corpora to the corresponding Indic language words. For our work, we use the Samanantar parallel corpus (Ramesh et al., 2021) which is the largest publicly available parallel corpora between English and 11 Indic languages. Figure 1 illustrates our workflow for extracting named entity annotated Indic sentences from an English-Indic parallel sentence pair. It involves the following stages: (a) tagging the English sentence with a high-accuracy English NER model (Sec 3.1), (b) aligning English and Indic language words in the parallel sentence pair (Sec 3.2), (c) projecting NER tags from the English sentence to Indic words using the word alignments (Sec 3.3). These stages are further described in this section.

Labeling English Named Entities
Given the availability of large NER datasets like the CoNLL 2003 testset (Tjong Kim Sang and De Meulder, 2003a) and substantial research in building high-quality NER models (Collobert et al., 2011;Lample et al., 2016;Devlin et al., 2019) for English, we tag the named entities on the English side of the parallel corpus using a publicly available, high-quality off-the-shelf English NER tagger.
We evaluate a few off-the-shelf English NER models on the CoNLL 2003 dataset. The models evaluated along with their F1 scores on CoNLL 2003 test set reported in Table 2.  All models show good accuracy with LUKE being the best-performing one. Another consideration for choosing the English NER model is its performance on named entities of Indian origin. The Samanantar corpus mainly contains named entities of Indic origin. The performance on the CoNLL test set might not reflect the model performance on Indic named entities. Hence, we manually analyze the performance of these models on around 50 English sentences from the Samanantar corpus. Our qualitative analysis revealed the BERT-base-NER model to perform the best among the three models shown above. Hence, we used this model for tagging the English portion of the Samanantar parallel corpus. We ignore the MISC tags predicted by the BERT-base-NER and focus on PERSON, LOCATION, and ORGANIZATION tags only. MISC is a very open-ended category and we found that it was not easy to reliably align MISC tagged entities from English to Indian languages.

Word Alignment
For every sentence pair in the parallel corpus, we align English words to the corresponding Indic language words. We explore two approaches for learning word alignments.
• GIZA++ (Och and Ney, 2003) implements IBM word-alignment models (Brown et al., 1993) to learn word alignments given parallel corpus. We learn the word alignment model in both directions using the parallel corpus and use the intersection of bi-directional runs to obtain word alignments between the two languages. We use default GIZA++ settings.
• Awesome-align (Dou and Neubig, 2021) is a word alignment tool that extracts word alignments using contextualized embeddings. Specifically, mBERT (Devlin et al., 2019) is fine-tuned on parallel corpora with multiple objectives as in Masked Language Modeling, Translation Language Modeling, Parallel Sentence Identification, etc. which promote aligned words across languages to have similar contextualized embeddings -followed by extraction of aligned words.
We use Translation Language Modeling and Self-training objectives in our experiments. We use softmax to normalize the alignment scores in our experiments. We also experimented with entmax but softmax gave the best F1 scores on the small test set. In all our experiments, we use Awesome-align to refer to mBERT model fine-tuned on the parallel corpus with softmax as the normalizing function.

Projecting Named Entities
The next step is the projection of named entity labels from English to the Indic language side of the parallel corpus using English-Indic language word alignment information. The pseudo-code for the projection is presented in Algorithm 1. Some of the key desired properties of the entity projection algorithm are: • Adjacent entities of the same type should not be merged into one single entity.
• Small errors in word alignment should not cause drastic changes in the final NER projection.
In our algorithm, we project the entities as a whole i.e., the entire English entity phrase and not word by word. Named entity phrases appear as contiguous words in both languages. Since every constituent English word of the named entity phrase is aligned to some Hindi word(s), we identify the minimal span of Hindi words that encompass all the aligned Hindi words. We repeat this for every named entity identified in the English sentence. This ensures that the projection satisfies the first desired property.
A major limitation of the above algorithm is that word alignment errors could lead to incorrect named entity tagging on the Hindi side. For example, consider the example illustrated in Figure 2. Here, we use black arrows to indicate the alignment from Hindi to English direction and blue arrows to indicate the alignment from English to Hindi. The alignment from Hindi to English is correct. On the contrary, the alignment in English to Hindi direction suffers due to the presence of additional Hindi words. The word Soren gets aligned to additional Hindi words photo: and PTI which are not part of PERSON named entity.
We observe such errors to be very common from the GIZA++ alignments. In order to minimize such errors, we take advantage of the bidirectional alignment GIZA++ provides. We also have access to the aligned English words to every Indic word. We construct a new alignment by iterating through every Indic word and adding themselves to the aligned words set for each English word present in its own aligned set (obtained from the backward mapping). We now have two different alignment sets for every English word. This can be understood better by looking at Figures 2 and 3. In Figure 2, the forward and backward mappings are presented with the blue and black arrows respectively. The backward mapping is converted into a forward mapping by reversing the arrows as shown in Figure 3. Finally, an intersection of both these mappings (double) arrows is taken for the final alignment. The idea is that, the probability that the same small errors present in the forward mapping are also present in the backward mapping will be quite low and vice versa. After this intersection is taken, the same However, when we use Awesome-align (Dou and Neubig, 2021), the alignment of a source word to a set of target words is extremely rare and the model mostly aligns only a single word to the English word. This reduces the alignment errors due to additional tokens present. The above might be a limitation when aligning normal words where English words might be represented by multiple Indic words, but in our use case of named entities, we observe that this happens rarely and hence is not a limitation.

Sentence Filtering
After NER projection, we apply the following filters to the tagged Indic sentences. Sentences without Named Entities. Many English sentences in the Samanantar corpus are not annotated with any entities. We retain only a small fraction of such sentences (≈ 1%) for training the NER model so the model is exposed to sentences without any NER tags as well. Sentences with low-quality alignments. We observe that most of the errors in the Indic-tagged corpus arise from word alignment errors. Hence, we compute a word alignment quality score for each sentence pair. This score is the product of the probability of each aligned word pair (as provided by the forward alignment model in the case of GIZA++ and the alignment model by awesome align) normalized by the number of words in the sentence. We retain the top 30-40% sentences to create the final NER-tagged data for Indic languages.
The statistics of sentences filtered out can be observed in Table 3. However, in general, only the top 30-40% of the sentences were selected. This would serve as the final dataset to be used for NER training per language.

Quantitative Analysis
To quantify the quality of the labeled data obtained, we select a small sample of 50 sentences and obtain manual annotation for the 8 languages namely Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Tamil, and Telugu. We also project the named entities on this small set of 50 sentences using the projection approach discussed earlier. Since the ground truths are known, the F1 scores can be calculated. Table 4 presents the F1 scores on the manually labeled set using various projection approaches.
We observe that both GIZA++ and Awesomealign word alignment approaches obtain similar performance. On average, Awesome-align provides the best F1 scores, hence, moving forward, we consider the datasets from the Awesome-align approach unless specified otherwise.

Qualitative Analysis
We now present a few examples from our projection method. Figure 4 presents examples of correct alignments and hence correct projections of NER tags. As can be seen, the alignment is fairly sparse and the model aligns only those words in which    it is extremely confident. In this sentence, both words "Ravi" and "Shankar" had to be aligned to "Ravishankar" in Hindi, but only "Ravi" was aligned. But due to our range projection, the entire entity "Shri Ravi Shankar Prasad" was projected successfully with the tag PERSON. Figure 5 shows an example of incorrect word alignment using the awesome align method for word alignment. In this sentence, "AAP" which is the abbreviated name of a political party is mapped only to "Aam" in Marathi instead of the entire phrase "Aam Aadmi pakshanche". This causes the projected entity to be only partially tagged with the entity type Organization. Table 5 shows the statistics of the final Naamapadam dataset. We create train, dev and test splits. Testsets are then manually annotated as described later in Section 4. Most languages have training datasets of more than 100K sentences and 500K entities each. Some languages like Hindi have more than 1M sentences in the training set. Compared to other datasets (See Table 1), the Naamapadam has significantly higher number of entities. Even though the dataset is slightly noisy due to alignment errors, we hope that the large dataset size can compensate for the noise as has been seen in many NLP tasks (Bansal et al., 2022).

Dataset Statistics
We have manually annotated testsets of around 500-1000 sentences for most languages. For ta, we only have small testsets of 50 sentences. The as  and or testsets are silver standard (the named entity projections have not been verified yet). Work on the creation of larger, manually annotated testsets for these languages is in progress.

Testset Creation
We have created two manually annotated test sets for Indian language NER evaluation: Naamapadam-test and Naamapadam-test-small. The Naamapadam-test comprises 500-1000 annotated sentences per language for 8 languages namely Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Punjabi, and, Telugu. The Naamapadam-test-small comprises 50 annotated sentences per language with Tamil added to languages in large testset. The annotators were provided sentences with named entity annotations obtained using the methodology described in Section 3. The annotators had to verify if the projected NER annotations were correct and rectify the annotations if incorrect. They were asked to follow the CoNLL 2003 annotation guidelines (Tjong Kim Sang and De Meulder, 2003b). The human annotations were contributed by volunteers who are native language speakers.  We compute the inter-annotator agreement on a sample of two annotators for each language using Cohen's kappa coefficient (Cohen, 1960). The scores are shown in Table 6. They are all above 85% signifying good quality annotations.

Experimental Setup
We analyze the performance of models trained on the Naamapadam-train dataset with alternative approaches for low-resource NER and to models trained on publicly available datasets. To this end, we investigate the following research questions: RQ1: Is the model trained on Naamapadam-train dataset better than Zero-Shot models?
RQ2: How does the model trained on Naamapadam-train data fare against models trained on other publicly available labeled datasets? We evaluate it on the following test sets: (a) Naamapadam-test set (b) publicly available test-sets

Test Datasets
In order to demonstrate the usefulness of our Naamapadam-train dataset, we fine-tune the mBERT model (Devlin et al.,   Our mined data using GIZA++ and Awesome-align approaches are used to fine-tune for the language-specific entity recognition task.
languages covered, this is the most widely used dataset. However, we observe the tagged data to be highly erroneous and does not contain complete sentences, but just titles. However, we report results on this dataset for the sake of completeness, but do not recommend this test set for any evaluation of the NER task. Appendix ?? discusses the issues with the WikiNER dataset.
• FIRE-2014: The FIRE-2014 dataset (Lalitha Devi et al., 2014) contains named entity annotated dataset for Hindi, Bengali, Malayalam, Tamil, and, English. We use the test splits of these datasets to evaluate the performance of our model.
• MultiCoNER: We use the Hindi and Bengali named entity annotated data from Malmasi et al. (2022). 2 • CFILT: We use the CFILT-HiNER dataset created for Named Entity Recognition in Hindi language (Murthy et al., 2022). The dataset was from various government information webpages, and newspaper articles. The sentences were manually annotated. We also use the CFILT-Marathi dataset created for Named Entity Recognition in Marathi (Murthy et al., 2018).
• Naamapadam: We create human-annotated data of around 500 − 1K sentences for 8 languages namely Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Punjabi, and, Telugu. For Tamil, we create small humanannotated data of 50 sentences. This serves as the gold-labelled test set in all our experiments.
For a fair comparison with models trained on our dataset, we include only PERSON, LOCATION, and, ORGANIZATION entities. The rest of the named entities if present (FIRE 2014, CFILT Marathi) are considered non-named entities.

NER Fine-tuning
Recently, sequence labeling via fine-tuning of pretrained language models has become the norm (Devlin et al., 2019;Conneau et al., 2020;Kakwani et al., 2020). We fine-tune the pre-trained mBERT model (Devlin et al., 2019) and report the results in our experiments. The input to the model is a sequence of sub-word tokens that pass through the Transformer encoder layers. The output from the transformer is an encoder representation for each token in the sequence. We take the encoder representation of the first sub-word (in case the word gets split into multiple sub-words) and is passed through the output layer. The output layer is a linear layer followed by the softmax function. The model is trained using cross-entropy loss.

Baseline Comparison
Our proposed approach can be seen as a crosslingual approach since the training data is created by projection from English to Indic sentences. Hence, we compare the performance of our model with zero-shot learning (Pires et al., 2019). We describe the baseline approach in detail below:

Zero-shot NER
To perform Zero-shot transfer, we consider the mBERT model fine-tuned for NER task in English. We use the publicly available fine-tuned NER model 3 which is trained for NER in 10 highresource languages (English, Arabic, Chinese, and some European languages). We directly test the performance of this model on Naamapadam largetest dataset (Bn, Gu, Hi, Kn, Ml, Mr, Pa, Te) and Naamapadam small-test datasets (As, Or, Ta) respectively.

Monolingual and Multilingual Fine-Tuning
We employ a pre-trained mBERT model (Devlin et al., 2019) in both our monolingual and multilingual fine-tuning experiments. For multilingual fine-tuning, we combine the labeled data of all languages. Each batch may contain mix of sentences from all languages.

Results
We now present the results from our experiments.

RQ1
We now answer the question if the models trained using our Naamapadam-train data are better than cross-lingual zero-shot models? Table 7 reports the results from our experiments. Apart from Hindi, Malayalam, and, Marathi we observe relatively poor results for other Indic languages in the Zero-Shot setting. Zero-shot techniques perform quite well in high-resource languages like Hindi, scoring a respectable 75.96%. However, for Assamese and Oriya languages the results are very poor.
We observe that the models trained using the Naamapadam-train dataset give the best F1 scores across languages. In general, we observe better performance from data obtained using Awesome-align (Dou and Neubig, 2021) compared to GIZA++ (Och and Ney, 2003).

RQ2a
In this section, we answer the question if the models trained on the Naamapadam-train data fare better against models trained on other publicly available labelled datasets on Naamapadam-test set? Table 9 reports the results from our experiments. We observe that model fine-tuned on Naamapadamtrain data outperforms all other models by a significant margin indicating the utility of our labelled data. We observe that models trained using FIRE-2014 data and WikiANN data obtain similar performance. Models trained using MultiCoNER obtains the least performance. Only the model trained using CFILT-HiNER (Murthy et al., 2022), which is a large annotated dataset, obtains reasonable F1 on Hindi. This underlines the importance of large, high quality data and shows that projection methods can help to create such data at scale.

RQ2b
In this section, we answer the question if the models trained on Naamapadam-train data fare better against models trained on other publicly available labelled datasets and tested on publicly available test sets? Table 10 reports the results from our experiments. The column In-Dataset refers to training and testing on the same dataset. The column Naamapadam refers to training on Naamapadam-train data and testing on the test split of the respective datasets (Zero-Shot) and further fine-tuning the same model on the train-split of In-Dataset (Fine-Tune). We observe that in-general fine-tuning our trained model on in-datasets gives further boost compared to finetuning a pre-trained LM directly on in-dataset.

IndicNER: Multilingual Fine-tuning
Multilingual fine-tuning has been shown to outperform language-specific fine-tuning (Dhamecha   et al., 2021). We also fine-tune a multilingual model on the combined data of all languages in Naamapadam-train. We refer to this model as In-dicNER. Table 11 reports the results from our experiments. We observe that multilingual models on average perform better than monolingual models. It can also be seen that for extremely low-resource languages like Assamese, the multilingual model performs a lot better than the others with a jump in F1 score from 40 to 62.5.

Conclusion
In this work, we take a major step towards creating publicly available, open datasets and open-source models for named entity recognition in Indic languages. We introduce Naamapadam, the largest entity recognition corpora for 11 Indic languages containing more than 100K sentences, and covering 11 of the 22 languages listed in the Indian constitution. This corpora was created by projecting named entities from English side to the Indic language side of an English-Indic languages parallel corpus. Naamapadam also includes manually labelled test set for 8 Indic languages. We also build IndicNER, an mBERT based multilingual named entity recognition model for 11 Indic languages. We also provide baseline results on our test set along with a qualitative analysis of the model performance. The dataset and models are available publicly under a open source licenses. We hope the dataset will spur innovations in entity recognition and its downstream applications in the Indian NLP space.